In database management technology, an extract, transform, load (ETL) process plays a key role in obtaining a 360-degree view of the customer by harmonizing data for operational needs. One might be tempted to assume that ETL processes are simply a matter of extracting data from multiple sources and loading it into the data warehouse where it will be transformed and integrated. But in reality ETL process involves quite a bit of complexity and hence need to be handled adroitly. Here are a few important points to consider while evaluating an ETL process, in order to avoid hiccups.
Abrupt processing of ETL loads
Consider an example wherein there are 19 data loads to be transformed and loaded, but the job stops at the 9th load. You wouldn’t want to have to reload the entire shift again. When a processing job gets interrupted it needs to have the capability to restart from the point at which it left off. If the ETL process is hampered, reports will not be refreshed in time, and you will be making decisions on stale information.
To address the above problem, one option is to have a “checkpoint recording” mechanism at regular intervals. If and when a job is interrupted, the keys can be reprocessed from the last checkpoint. Check for this feature in your ETL process solution package before making a purchase decision.
Another option is to have an interrupt handler written in third-generation languages such as C. This handler would store the position of the processing point at the last failure and seek that position to process forward when the job restarts. An important criterion is how quickly the data can be moved across. While evaluating an ETL tool, check for performance levels and features for restart.
Managing fast changing data sets
For smooth running of the ETL process, the tool that you select should have the following capabilities:
- Handling of large
volumes of data; moving the data from one point to another in the shortest span of time.
- Detection of changes to transactions in real time and synchronization of data.
- Handling a variety of data including text and unstructured data such as images.
- Distribution of processing over multiple processors and options for parallel processing.
Any tool to automate an ETL process should provide for block replication at the lowest level and offer good features to manage fast-changing data sets.
For big data, build a sandbox for Hadoop/Hive/PIG infrastructure. It is important to have a clear-cut strategy wherein new generation big data infrastructures can coexist with the old world. The existing team handling the ETL process could be retrained to deal with big data technologies, or you could consider hiring a new, younger team that is open to cross-pollinating skills -- machine learning, very large data handling and programming.
Loading data into individual data marts
In cases without a centralized database, it is desirable to have data templates. These are standardized interfaces, which each individual or departmental data mart would populate. Make sure that your ETL tool has the ability to expand into a data warehouse platform for flow of information from one data mart to the next.
About the author:Derick Jose is a co-founder at a data science start up which is focused on providing business solutions at the intersect of big data, machine learning and deep domain expertise. Prior to that he was the Vice President at MindTree where he headed the advanced analytics group. He has 20+ years of experience spanning Telecom, CPG, Banking across US, Europe and Asian markets. He has also filed for a patent with USPTO related to Analytical components.
(As told to Sharon D'Souza)
This was first published in January 2012