Data management and data warehouses are as old as the wide spread usage of the internet. However, in the recent past many new terms are coined such as data lakes and data lakehouses. But what is the use of all these different applications and what could it best be used for? That is something we are going to shortly discover in this blog.
The traditional data warehouse is a type of data repository that is mostly used for business intelligence (BI) especially data analytics and serve as a central depository for mostly structured data. They have a formal way of how the data is organized (also known as schema), making it less flexible and more costly to store the data. Furthermore, they are not suited for unstructured data. Well known examples for this are of course the traditional on premise warehouses.
However, in more recent years new terms have risen and one of them is data lakes, introduced in early 2011. These work very similar as data warehouses but are more focused on unstructured data and store the data on a flat architecture (without schema) and allows easier integration of the different data sources as well as storing raw data. Data lakes are often used for machine learning and data science purposes. Disadvantages are, among others; they do not support data quality and transactions. Good examples of companies offering this are Snowflake, AWS and Databricks.
Then there is the other topic of data lakehouses, which is the newest term on the block. These can best be seen as a combination of both data warehouses and data lakes. They allow easy storage of both unstructured and structured data. Thus they are well suited for BI, machine learning and data science. They thus are more flexible and can reduce cost as there is no need anymore to both have a data warehouse and a data lake. It all sounds very promising. However, this technology is still a bit unproven as it was introduced not until around 2021, and so far there is only one well known provider which is Databricks.
