The ETL process in data warehouse is the basis for storage systems. Although not directly visible to users, the efficiency of the ETL process should not be underestimated.
Why Is The ETL Process So Important?
The ETL process is a vital part of the data warehouse. In large data warehouse projects, ETL processes are often used as a monthly backup, a weekly load for subsystems, and a daily load for critical data.
ETL processes are responsible for transferring data from source systems to target systems. Data is extracted, cleaned, logically prepared, and loaded into the target tables.
In many data warehouse projects, the creation of ETL processes accounts for up to 60-80% of the total project cost. As this process is located at the back end of the data warehouse, end-user involvement is only sometimes apparent.
The following article describes the extraction, transformation, and loading elements in more detail.
What Is The ETL Process In Data Warehouse?
ETL stands for Extract, Transform and Load and refers to transforming data, especially in a data warehouse environment.
The term ETL is also used in other software applications, such as self-service BI solutions, but has nothing to do with the concept and architecture of the data warehouse.
Only the data conversion and transfer process is based on the same principle.
Data Conversion Process In DWH
There are different views of the ETL process in the data warehouse. This aspect is related to the architecture of the system. The data warehouse can be built with three or four layers. Therefore, the transformation takes place in the cleaning layer in the processing phase, while the extraction is first performed in the moving phase.
Extraction is the first step of data processing. In this process, data is extracted, i.e., extracted from source systems or documents and made available for further processing steps in the input data warehouse layer. Only certain aspects of the raw data can be used. Unnecessary data shall not be exported.
In the second step, the data is transformed at a step or cleaning level. Here the raw data types are converted into column types for the target tables. In addition, the content of the data is checked. For example, duplicates are identified and eliminated, calculations are performed, or additional data is merged.
In the third step, the data is transferred to the data warehouse. There it is organized and normalized. Some of the data is also historical, so changes over time can be monitored and evaluated.
From ETL To ELT
In data processing, the conversion and loading phases are interchangeable. The data is transferred to a central database and then converted using unique algorithms.
This process is widespread in big data environments where data is collected and then made available for analysis. There are no special layer normalization procedures as in traditional data warehouses.
ETL Extraction Procedure
The ETL extraction process defines the conventions for source system connections and the types of data transfers.
In addition, a data update schedule is defined. When updating databases, a distinction is made between synchronous and asynchronous extraction:
In a synchronous query, the validity of the data is always guaranteed and updated. The disadvantage is that it requires more resources to use the network.
In asynchronous offloading, data is replicated to save resources, i.e., when resources are sufficient.
These operations are usually shifted to day or night to cause as few problems as possible. A distinction should also be made between static and incremental recovery:
A static recovery creates a complete image of the database. This process occurs when a particular system state is interrupted or restored for the first time.
Incremental recovery reads only the changes between the current and last recovery steps. These differences are highlighted in the operational log.
ETL Transformation Process
The second phase of the ETL process is the transformation process. In this phase, data from the different source systems is transferred to a defined internal format.
Besides eliminating structural differences, we also focus on contextual differences.
During the data transfer process, information is aligned to standard data formats, transformed or transcoded, time and line normalization are performed, and units of measurement are converted.
In addition, erroneous, redundant, obsolete, or incomplete values in the databases are corrected according to a correction program.
ETL Loading Procedure
The ETL loading process loads data from a data warehouse directly into a data warehouse or possibly into an operational data warehouse.
Since no evaluation or analysis can be performed during the data warehouse loading process, the data warehouse is locked during the loading phase.
Updates may replace old data or be loaded into the data warehouse as new data records.
As we have learned the most important concepts about the ETL process in a data warehouse, you should know the extraction, transformation, and loading in an ETL process. If you want your organization to maximize the value of your data, it’s time to implement the ETL process in DWH.