The introduction to Blockchain ETL produced a high-level overview of what ETL is, why ETL is needed on Blockchain and the Blockchain ETL architecture. In this post, I'm going to dig deep into ETL on some typical blockchain networks such as Bitcoin and Ethereum. Besides that, I'll present some public datasets that are the result of the ETL process on these networks.
About the Bitcoin ETL
Bitcoin ETL is a tool that collects, extracts, transforms, and loads data from the bitcoin chain and other similar bitcoin chains such as bitcion_cash, bitcoin_gold, doge, litecoin, dash, zcash, and so on.
Every ETL flow must have at least one schema that defines the flow's output. In this case, the Bitcoin ETL tool has a schema that includes the four components listed below:
How to use the Bitcoin ETL
There are three steps of using the Bitcoin-ETL, including preparing the provider, installing the tool using pip or docker, and using the tool's commands.
Prepare the provider
After acknowledging what the output of the tool is, one more thing you need to know is where the data source is.
To connect with the data of the bitcoin chain or others above, you need to build a Full Node by following the instructions here. After that, you will have a Full Node on your localhost, for example, http://user:[email protected]:8332
Install the tool using pip or docker
Currently, there are two installing options, including using pip and docker.
Using pip: you simply need to run the command: > pip3 install bitcoin-etl
Using docker: you need to follow these steps:
Step 1: Download and install Docker by following this instruction: https://docs.docker.com/install/
Step 2: Build a docker image by running these commands: > docker build --platform linux/x86_64 -t bitcoin-etl:latest .
Step 3: Run a container.
- > docker run --platform linux/x86_64 -v $HOME/output:/bitcoin-etl/output bitcoin-etl:latest export_blocks_and_transactions --start-block 0 --end-block 500000 --provider-uri http://user:[email protected]:8332 --blocks-output output/blocks.json --transactions-output output/transactions.json
- This command will run a docker container which will collect data from your full node with provider uri http://user:[email protected]:8332 and export data to the output folder.
Use the tool's commands
- > bitcoinetl export_blocks_and_transactions --start-block 0 --end-block 500000 --provider-uri http://user:[email protected]:8332 --blocks-output blocks.json --transactions-output transactions.json
- This command is used to export blocks and transactions of bitcoin chains from block 0 to block 500000 into blocks.json and transactions.json. You can choose export blocks or transactions only if you just need only one of both.
- Note: required_signatures, type, addresses, and value fields will be empty in transaction inputs. Use enrich_transactions to populate those fields.
- > bitcoinetl enrich_transactions --provider-uri http://user:[email protected]:8332 --transactions-input transactions.json --transactions-output enriched_transactions.json
- This command will populate required_signatures, type, addresses, and value fields for exported transactions.json before and export into enriched_transactions.json
- > bitcoinetl get_block_range_for_date --provider-uri http://user:[email protected]:8332 --date=2017-03-01
- This command will return the block range of a specific date.
- > bitcoinetl filter_items -i blocks.json -o blocks_filtered.json -p "datetime.datetime.fromtimestamp(item['timestamp']).astimezone(datetime.timezone.utc).strftime('%Y-%m-%d') == '2017-03-01'"
- Fillter blocks.json by date '2017-03-01' into blocks_filtered.json
- > bitcoinetl export_all --provider-uri http://user:[email protected]:8332 --start 2018-01-01 --end 2018-01-02
- Export everything, including blocks, transactions, input_transactions, output_transactions from 2018-01-01 to 2018-01-02
- > bitcoinetl stream --provider-uri http://user:[email protected]:8332 --start-block 500000
- Outputs blocks and transactions to the console by default.
- Add --output parameter to export data into Google Pub/Sub topic e.g. projects/your-project/topics/crypto_bitcoin
- Add --last-synced-block-file to save its state to file e.g., last_synced_block.txt
- Use the --lag option to specify how many blocks to lag behind the head of the blockchain
- Use the --chain option to specify the type of the chain, e.g., bitcoin, litecoin, dash, zcash, etc.
- You can tune each job above by adding two-parameter --batch-size, --max-workers for performance. When --batch-size determines how many blocks in one batch will be processed and --batch-size determines how many processes will be run concurrency.
About the Ethereum ETL
Familiar with Bitcoin ETL, the Ethereum ETL tool helps us convert blockchain data into convenient formats like CSV or relational databases. Every chain run on top of EVM (Ethereum Virtual Machine) like Ethereum, Binance Smart Chain, etc. can apply this tool to convert blockchain data.
This tool's features are easily exported:
- ERC20 / ERC721 tokens
- Token transfers
- Internal transactions
With its usage, the tool is being used in some other projects such as Google - Public BigQuery Ethereum datasets, Nansen - Analytics Platform for Ethereum, etc.
How to use Ethereum ETL
There are 3 steps from installing the tool to applying it to collect, extract, transform and load blockchain data. In this instruction, I choose the Ethereum chain as an example to explain how the tool works.
This is a convenient point of the Ethereum chain that there are so many public providers for everyone. As a result, you don't have to install a local Full Node on your machine and simply need a provider uri like that: "https://mainnet.infura.io/v3/7aef3f0cd1f64408b163814b22cc643c".
You can find at least one public provider with another chain from this site: https://chainlist.org. However, remember not to hit the Node when using this public provider.
Install Ethereum-ETL tool
Like bitcoin-etl, ethereum-etl has two install options: install by pip or by docker.
Install by pip: run command: > pip3 install ethereum-etl
Install by docker: run command after installing docker: > docker build -t ethereum-etl:latest
How to use the ethereum-etl tool
The ethereum-etl tool supports the following commands:
There are some main commands demo below:
- > ethereumetl export_blocks_and_transactions --start-block 14000000 --end-block 14001000 --provider-uri https://mainnet.infura.io/v3/7aef3f0cd1f64408b163814b22cc643c --blocks-output blocks.csv --transactions-output transactions.csv --max-workers 8 --batch-size 100
- This command will collect blocks and transactions from block 14000000 to block 14001000 on the Ethereum chain, then export them into blocks.csv and transactions.csv
- The command will run like this:
- The output of the command:
- Note: If you want to get this information, you must have an Archive Node because just only Archive Node has stored token transfers, and Full Node hasn't. You can get an Archive Node provider throw two ways:
- Create your own Archive Node by installing geth: https://geth.ethereum.org/docs/install-and-build/installing-geth
- Using a third-party provider example chainstack.com: https://chainstack.com/ethereum-archive-nodes-on-chainstack/
- > ethereumetl export_token_transfers --start-block 14000000 --end-block 14001000 --provider-uri https://nd-892-894-739.p2pify.com/******366e --output token_transfers.csv --max-workers 8 --batch-size 100
- This command uses my chainstack Archive Node provider: https://nd-892-894-739.p2pify.com/******366e, and starts to collect token transfer from block 14000000 to block 14001000, then exporting data to token_transfers.csv
- The command will run like this:
The output of the command:
- ethereumetl stream --provider-uri https://mainnet.infura.io/v3/7aef3f0cd1f64408b163814b22cc643c --start-block 14739860 --max-workers 8 --batch-size 100
- This command will start a stream to export Ethereum data from block 14739860 until synced with the Full Node. When it reaches the last block and there's no block to collect data, the stream will sleep 10s before running again to wait for the Full Node to update new blocks.
Google BigQuery dataset
Google BigQuery is one of the most powerful BigData query tools. If you don't familiar with this platform, you can check here for more information.
Now, data of blockchain etl is being updated on Google BigQuery, which is the fastest way for you to reach the blockchain data. With Google BigQuery, you don't need to install blockchain-etl tools, set up or buy providers, or run the stream or commands to export data and store them, but you are still able to use data and other tools of Google Data Platform to analyze the data or visualize it. Currently, there is eight public datasets map with eight blockchains:
- Bitcoin (new location): bigquery-public-data.crypto_bitcoin
- Bitcoin Cash: bigquery-public-data.crypto_bitcoin_cash
- Dash: bigquery-public-data.crypto_dash
- Dogecoin: bigquery-public-data.crypto_dogecoin
- Ethereum (new location): bigquery-public-data.crypto_ethereum
- Ethereum Classic: bigquery-public-data.crypto_ethereum_classic
- Litecoin: bigquery-public-data.crypto_litecoin
- Zcash: bigquery-public-data.crypto_zcash
Blockchain-etl is a very useful tool to handle and get insights into blockchain data. This tool still was being used in many other projects, which provided many analyses on blockchains to create more value and make blockchain better.
 Running A Full Node, bitcoin.com, accessed 16th September, 2022.
 Bitcoin-ETL, github.com, accessed 16th September 2022.
 Ethereum-ETL, github.com, accessed 16th September 2022.
 Archive novdes on Chainstack, chainstack.com, accessed 16th September 2022.
 BigQuery, cloud.google.com, accessed 16th September 2022.