In today’s world, the new gold is the Data. Everyone now knows this. But like gold miners, companies have nothing to do with a pile of dirt and a few gold nuggets. To get the true value of gold, it needs to be filtered and processed. The data needs to be stored, cleaned and enhanced in a structured way so that it can be used in reporting analytical spreadsheets or for training machine learning and artificial intelligence. Currently, different approaches exist depending on the amount of data, the frequency of logging and the availability required. So, lets dive into the details of the origin and evolution of lakehouse and discover how it integrates the best elements of data warehouses and data lakes together.
Existing Solution: The Data Warehouse
A data warehouse is defined as a central data management system, specially organized for analytical purposes, which brings together data from a wide variety of sources. It is then used for data analysis and reporting. The data stored here are mostly in relational format.
This requires data to be stored in a clean structure. It can be accessed via the most commonly used database language, SQL (Structured Query Language). In addition, BI tools such as Power BI, Informatica or Tableau can be directly connected to the data warehouse. This means that analysis and dashboards can be created by business analysts who are not familiar with SQL.
When starting up a large new data project, it is often the simplest solution to store the collected data directly in the data warehouse. This is optimized for fast reads, but becomes too slow to write when continuously reading and transforming.
Therefore, load time can be inconvenient for customers with dynamic dashboards. The buffer is needed to prevent small amounts of data from being written continuously.
Rising of Data Lakes
A data lake is a well-known data storage system that acts as a buffer. Data warehouses store data in a structured format, while data lakes can store data in an unstructured format or in different formats. However, it should be noted that again, the more uniform the structure, the faster and more efficient the access to the content.
There are several advantages to implementing a data lake. When data is loaded directly into the data warehouse, it is often not possible to transform it before loading. Therefore, ETL channels (export, transform, load) should still be used.
The transformation must be calculated by the data warehouse at each load. This results in longer waiting times for the clients and higher costs for the data warehouse. These can also be directly precomputed in the database to save resources, of course, but this only changes the problem, since the database is still loaded at the time of the calculation.
The easy availability of scalable data warehouse solutions along with cheap on-demand computing power make the data lake ideal for implementing an ETL channel. Here, raw data is loaded into the data lake. Saturated or merged copies of the data are created and loaded in the data warehouse.
Storing data in the data warehouse is often expensive, so it makes sense to store only important and frequently used data. This is not a problem for the data lake. Data can be easily removed from the data warehouse after a certain period of time, but is still available throughout the data environment and can be retrieved with a longer delay if necessary.
The data in the dataset can be accessed in several ways. Primarily Python and R but also methods such as Spark. These programming languages are among the most commonly used in data science and are used in machine learning libraries such as XGBoost, Py Torch and TensorFlow.
However, these are designed to access data lakes, as training machine learning models always requires large amounts of data to be loaded and transformed simultaneously, and the end user should not experience any delays in using the dashboard during these processes.
Data warehouses, on the other hand, are primarily designed to store the results of analyses or small amounts of data. However, there are efforts to integrate machine learning directly into the data warehouse. Examples include technologies such as AWS Amazon Redshift or direct access to Amazon’s Sagemaker machine learning platform for Redshift. The Snowflake provider also offers the possibility to train machine learning models directly in the data warehouse.
Data Warehouse To Data Lake
One of the disadvantages of traditional data warehouses is that storage and computing power cannot be increased independently. This leads to prohibitively high costs as data volumes grow.
Modern data warehouses, such as Redshift and Snowflake, allow storage capacity and computing power to be scaled at least partially independently, just as they would in a data lake.
Data Lake To Data Warehouse
The most important features of the data warehouse compared to the data lake are probably the DBMS management functions, such as user access rights to individual data, ACID transactions, versioning, auditing, indexing, caching and query optimization.
Open source technologies already exist that allow some of these functions to be used in a data environment. Delta Lake or Apache Hudi create a metadata layer between the data environment and the application using it. Among other things, this layer contains information about which objects belong to which version of the table.
It allows ACID transactions and restricts user access to certain data. It also simplifies the creation of data versions in the data environment. In addition, some data schemas can be preserved by storing them in the metadata layer and checking them at load time.
These metadata can also be used to improve performance. Some of the data to be analyzed can be stored on the fastest solid state drive (SSD) or random access memory (RAM). In this case, the metadata can be used to identify stored data that is still relevant during transactions. In addition, minimum, maximum or batch sets can be stored, which speeds up the search for data points.
Evolution Of Lakehouse
Both the data warehouse and the data lake have their own advantages and can complement each other in some respect. But new technologies such as Delta Lake or Apache Hudi are increasingly combining them. The question is therefore whether the two systems will remain completely separate in a few years time or whether they will merge into a hybrid system.
Lakehouse’s open architecture approach to data storage, transformation and analysis is already gaining ground. For example, resources can be optionally reserved specifically for data entry and transformation so that they do not affect the backlog of customer order. In addition, data from the data warehouse is often stored in a proprietary format. With increasingly stringent data requirements and companies desire to avoid relying on a single provider, the long-term trend in the software industry is to open up data formats.
Data Lakes typically use the open source data format. The Lakehouse approach aims to combine the best aspects of the data lake with the data warehouse, replacing two systems.
After all, the reduction in the number of ETL pipelines, and by eliminating multiple technologies will save money and increase the incentive to adopt Lakehouse.