When the problem occurs

In an enterprise combining many different systems with various data types, formats, and reading methods, reading data directly from the database may occur some problems:  

  1. Reading data directly on the server which customers are using causes it to go down, thus, affecting the company's business.
  2. Because of the large number of servers and varied hardware specifications, it is necessary to adjust the code on each server. It is a waste of time and resources as a result.
  3. The data amount from the source is too large and not exactly what users need, so it's necessary to transform them to be smaller on the server.

The ETL process is being used to solve these problems.

What is ETL?

ETL (Extract, Transform, and Load) is a data integration process that manages data from numerous data sources into a single, consistent data store that is then put into a data warehouse or other destination system.

3 stages in the ETL process including Extract, Transfrom and Load data
Three phases of the ETL process. Source support.microsoft.com

Specifically, these are 3 phases of the ETL process: Extract, Transform and Load

Extract

"Extract" here means extracting data from the source, which can be a structured (SQL DB) or unstructured (No SQL) source, or even files, web pages, etc. as long as we can load data from this source, reading it, and then proceeding to the next phase.

Transform

The next step is transforming the data. In fact, after extracting them from the source, there are two options:

  • Let it be and save it in the database: As a result, there is just EL (Extract Load without Transform)
  • Transform data by validating, cleaning data, removing duplicates, calculating to minimize data dimension, and so on to turn unstructured, inconsistent data into structured, consistent data for future use (Transform).

The second one is called the transform stage in the 3-stage process of ETL.

Load

Following the data transformation stage, we will write it into the target server, which is referred to as the "Load" step. Most ETL processes are automated, continuous, well-defined, and batch-driven in most enterprises.

Why do we need ETL on Blockchain?

In 2009, the bitcoin network was launched, which was the first application of Blockchain technology, giving birth to many new systems of its kind. Five years later, Ethereum, a Blockchain technology that would allow decentralized program deployment and execution, was released. The concept of decentralized program execution expanded the potential of Blockchains and their applications.

In recent years, Blockchain systems applications have appeared in various domains ranging from the Internet of Things to food supply chains. The volume of data kept on these distributed systems is growing in parallel with the development in popularity and adoption of Blockchain. As of July 2018, the Bitcoin Blockchain stored about 174 GB of data. In April 2022, it became around 390 GB.

The chart show how the blockchain data size have increased through years
Size of the Bitcoin Blockchain. Source: Statista.com

Since all information on blockchains is accessible to the public, they are great for data analysis because doing so can reveal information about user behavior. A user's identity, for instance, can be inferred from Blockchain data analysis. Additionally, depending on publicly available data, it might enable financial study and forecasting of investor behavior in a market worth billions of dollars.

Additionally, users' identities can be inferred via Blockchain data analysis, allowing for the use of the data for investigative purposes. In this case, detecting the source of fraud, theft, or abuse in the Blockchain system allows us to reduce the unfortunate consequences of thefts or hacker attacks.

In fact, we have several options for processing the data obtained on the Blockchain network. However, there are some reasons that make the ETL process stand out:

  • We will have a well-organized, efficient data handling model. The data flows in ETL are so simple that anyone can understand them.
  • We can save a lot of time. Instead of manually extracting data or using tools, ETL will do it for you; all you have to do is focus on your core competencies.
  • We will have a clean dataset for data analysis, machine learning, and other purposes.

Blockchain ETL architecture

Let's look at the following architecture to better understand Blockchain ETL.

Blockchain ETL System which combines different stages: Extract, Transform, Load data
An Example of a Blockchain ETL System. Source: github.com

The figure above shows a system that uses an ETL process to obtain data from the Blockchain and then processes and stores it.

First, the data can be gathered by using libraries that support interacting with Blockchain networks (BSC and ETH in this example), such as web3 or ether.js combined with the smart contract's abi (in case we want to collect the data of a particular contract). That data is unprocessed, encoded, and may contain irrelevant information. As a result, in the Transform phase, we must first filter out the information we want, decode it into more recognizable formats like JSON or CSV, and then store it in the database.

We can use the data collected in the previous phases to perform Blockchain analyses, visualize statistical results on websites and applications, or train a machine learning model, for example.

Conclusion

As the amount of data on blockchain networks grows, so does the demand for data analysis on blockchain. As a consequence, Blockchain ETL has become a technique in which many technology companies are interested.

References

[1] ETL (Extract, Transform, Load), IBM Cloud Education, accessed 14th August 2022.

[2] Applying the ETL Process to Blockchain Data. Prospect and Findings, mdpi.com, accessed 14th August 2022.

[3] Blockchain ETL, github.com, accessed 14th August 2022.