ETL stands for Extract, Transform, Load. It is a business intelligence (BI) oriented process to load data from the source system to the target system to enable business reporting. It is used for migrating data from one database or platform to another, forming data marts and data warehouses and also converting databases from one format to another.
The three ETL steps are:
- (E) Extracts data from sources that are mostly heterogeneous or different types of systems such as relational databases, flat files, mainframe systems, xml, etc.;
- (T) Transforms the data through cleansing, calculating, translating, filtering, aggregating, etc. into a structure that is more appropriate for reporting and analysis;
- (L) Loads the data into the end target (database and/or cubes) in a presentation-ready format for end users to make decisions.
Extract from source
The ETL process starts with extracting data from different internal and external sources. Each data source has its distinct set of characteristics. For example, data from point-of-sale, inventory management, production control and general ledger systems are often logically and physically incompatible. In general, the goal of the extraction phase is to effectively extract source systems that have different DBMSs, operating systems, hardware, communications protocols and so on, to convert the disparate and/or heterogeneous source data into a single format appropriate for transformation processing.
A decision may be made by the ETL architect to store data in a physical staging area with the same structure as the source versus processing it in memory. The challenge to achieve the optimal balance between writing data to staging tables and keeping it in memory during the ETL process comes from two conflicting objectives:
- Load the data from the source to the target as fast as possible;
- Be able to recover from failure without restarting from the beginning of the process.
Transform the data
The extract step only moves and stages data. The goal of the transformation stage is to change the data to make it usable for the intended purposes.
Once the data is available in the staging area or processed in memory, it is all on one platform and one database. Various transformations such as joining and union tables, filtering and sort the data using specific attributes, pivoting to another structure, making business calculations, and creating aggregates or disaggregates, can be easily performed. In this step of the ETL process, data is cleaned to remove errors, business rules are applied, and data is checked for integrity and quality. If data validation fails, it may result in exceptions, thus not all data is handed over to the next step. After having all the data prepared, slowly changing dimensions may be chosen to be implemented. In that case the business can keep track in the analysis and reports when attributes change over time. For example, this may be useful when a customer moves from one region to another.
Load into the target
Finally, data is loaded into the end target – usually the data warehouse – which contains fact and dimension tables. From there the data can be combined, aggregated, and loaded into data marts or cubes to support business reporting needs.
As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
The figure below displays these ETL steps.
When to use ETL?
The key function of an ETL process is data integration which enables business intelligence. Data integration allows companies to migrate, transform, and consolidate information quickly and efficiently between systems of all kinds. Pulling databases, application data and reference data into data warehouses provide businesses with visibility into their operations over time and enable management to make better decisions.
For example, a financial institution might have information on a customer in several departments and each department might have that customer’s information listed in a different way. The membership department might list the customer by name, whereas the accounting department might list the customer by number. ETL can bundle all this data and consolidate it into a uniform presentation, such as for storing in a database or data warehouse.
Another way that companies use ETL is to move information to another application permanently. For instance, the new application might use another database vendor and most likely a very different database schema. ETL can be used to transform the data into a format suitable for the new application to use.
Today, ETL process is widely used in business intelligence with the help of an ETL tool. Before the evolution of ETL tools, the ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome since it involved many resources, complex coding, and more work hours. On top of it, maintaining the code placed a great challenge among the programmers.
These difficulties are reduced by ETL tools since they are very powerful and they offer many advantages in all stages of ETL process. Starting with extraction, data cleansing, data profiling, transformation, debugging and concluding with the loading into data warehouse is streamlined and visually documenting when compared to the old method. Most modern ETL software also covers real-time and on-demand data integration in a service oriented architecture (SOA), and master data management.
The widely used ETL tools in the ETL space are Informatica, DataStage, SQL Server Integration Services, Pentaho Data Integration, and Oracle Warehouse Builder.