We're planting a tree for every job application! Click here toΒ learn more

How to export a full Ethereum history into S3, efficiently

Anton Bryzgalov

21 Nov 2022

β€’

2 min read

How to export a full Ethereum history into S3, efficiently
  • Ethereum

Blockchain technologies are not a geeky thing anymore and many companies run their business around crypto assets. A popular scenario is to build a Data Platform operating blockchain data and launch analytical and realtime services on top of it. This is what I have done recently, and want to share my experience with you.

Hi πŸ‘‹ I am Anton Bryzgalov (aka @bryzgaloff), an Expert Data Engineer and Architect in Cloud & Blockchain.

The solution I have built is described in detail in my blogpost on Medium. This post is a short summary exclusively for Blockchain Works 😊

Which node to use?

I have exported the data using a number of public node providers. This is ok for an MVP. However, in a long term, I strongly advise you to run your own node. Such a node will take several weeks to sync and extra cost for hosting, but the further data extraction speed will be significantly higher than when using a public node.

Also, using free public nodes require a solid fallback mechanism due to rate and other limits. With an owned node the rate is limited only by your VM resources.

Using a paid provider like Alchemy or QuickNode is completely not an option: the full history is too big and will result in $10Ks.

Alternatives?

For a quick launch, I have used public BigQuery datasets with Ethereum data. They are populated using an open source Ethereum ETL tool. Exporting data from BigQuery into S3 is a cheap option.

A set of optimizations described in the article, allowed me to export the data in 24 hours and β‰ˆ$100 (BigQuery -> S3).

However, relying on BigQuery leads to a data structure loss, so to operate on raw JSONs I have had to switch back to the public nodes.

How do we use the data now?

πŸ”₯ As an outcome, I have ingested the full history of Ethereum to S3. The data is hosted a raw JSON format as responded by the nodes. This allows me to easily rerun any calculations on the full history without touching faulty public nodes.

The querying is performed using AWS Athena. It supports JSON brilliantly. But the full dataset is of 4.5TB now and Athena is paid for a volume of scanned data. Reading the full history in raw costs β‰ˆ$25. So, to reduce the costs I convert the data from a raw JSON into a cost-efficient Parquet format (with Snappy compression). For specific purposes, this has reduced querying costs by 97%!

Realtime data ingestion

Our product-level goal was to build a cryptowallets balances API: given a wallet address, it should respond with a list of all tokens contained by the wallet with 100% accurate balances. The latency should be minimal from the latest mined block.

Following a KISS ("keep it simple, stupid!") principle, I have implemented a polling script which "listens" to new blocks on top of the Ethereum and once a new block appears, updates a DynamoDB table containing aggregated balances for each wallet.

Calculating balances using Athena and uploading the data into DynamoDB was quite a challenge:

  • firstly, because of a high volume (β‰ˆ30GB of uncompressed aggregated data);
  • and secondly, because of uint256-based calculations in Athena (which's the biggest number type is 128-bit Decimal).

⚑️ As a result we are able to recalculate the full balances history and upload it to DynamoDB in ≀3 hours (and <$1). The API's latency is <1s from the latest block. Dive deeper into the details in the article.

Thank you for reading!

Working on something similar? I am happy to have a further discussion with you, let's get in touch on LinkedIn! 🀝

Did you like this article?

Anton Bryzgalov

I am an Expert Cloud Data Engineer with 7+ years of production experience, including 5 years as a Solutions Architect. Designed 25+ projects for various domains including finance, marketing, retail, blockchain.

See other articles by Anton

Related jobs

See all

Title

The company

  • Remote

Title

The company

  • Remote

Title

The company

  • Remote

Title

The company

  • Remote

Related articles

JavaScript Functional Style Made Simple

JavaScript Functional Style Made Simple

Daniel Boros

β€’

12 Sep 2021

JavaScript Functional Style Made Simple

JavaScript Functional Style Made Simple

Daniel Boros

β€’

12 Sep 2021

WorksHub

CareersCompaniesSitemapFunctional WorksBlockchain WorksJavaScript WorksAI WorksGolang WorksJava WorksPython WorksRemote Works
email iconhello@works-hub.comUK flag

Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ

US flag

108 E 16th Street, New York, NY 10003

Subscribe to our newsletter

Join over 111,000 others and get access to exclusive content, job opportunities and more!

Β© 2022 WorksHub

Privacy PolicyDeveloped by WorksHub