What is the process of creating an ETL process and where does it stand for? Furthermore, what are the two most common frameworks used? This and more will be covered in today’s blog.
ETL stands for Extract, Transform and Load. It is used to collect all the data from an organization to a central place (more information about the different options: read my Data warehouses, Data Lakes and Data Lakehouses blog). It simply describes a process of how to handle data and get ready for faster and easier processing and querying on this central storage place. A well designed ETL process is important in order to get reliable, constant, high quality, and highly usable data ready for end usage of business users. So it is important to think about it when starting a new data management initiative.
The first step of this process is extracting the data. This can happen from different places. Both internal and external sources. Sources may include JSONs, CSVs, XML/HTML files, CRM/ERP source systems (e.g. SAP, Salesforce, etc.), relational and non-relational databases, web-crawling (to search for information online, e.g. competitor prices) or scraping thru APIs (Application Programming Interface).
The second step includes transforming the data, making it suitable for and compatible with the schema of the target data warehouse, data lake or data lake house and removing, among others, inconsistencies, duplicates, sorting the data etc. More information about this topic can be found in one of the previous blogs about the Importance of Data Governance and Data Quality.
The last step is loading the data into the data warehouse, data lake or data lakehouse. From this point onwards it is possible to query the data for Business Intelligence, machine learning, data science and analytics use.
There are two ways of getting the data into the ETL process. The first is batch ETL, meaning the data is put into the process as one batch at a time as scheduled or as soon as it reaches a given threshold. This is an older method and it requires less computing power. The other process is a streaming ETL. Here the process is more event driven as soon as it arrives in the ETL pipeline in real-time, making it available for querying and analysis almost instantaneously.
Both processes have their pros and cons. Streaming ETL is good for when real-time data is of high importance. Batch ETL is best suited for cases where a small latency is allowed, and where you need to perform operations on arge quantities of data.
There are two most commonly used ETL/ data warehouse architectures. One is the data warehouse architecture of Kimball, the other is from Inmon. Both models will be discussed in a separate blog.
