Creating an enterprise data warehouse (EDW) involves integrating data from multiple sources and formats. The most critical process is uniformly cleaning and formatting the data for effective use by the business. We discuss here the extract, transform & load (ETL) best practices we followed at Shoppers Stop during our EDW build.
Analyzing data. This is the first step of the ETL best practices. Investing time in analyzing the requirements and having a mature data model for the target EDW can reduce the ETL challenges to a large extent. This will ensure scope creeps and keep the overall timeline intact. It is also important to study the source systems, assess data quality and build robust data validation rules for ETL modules. An ETL strategy should be formulated based on the source and target systems’ data structure.
Fixing data issues at source. Users are often aware of data issues, but have no idea where the quality problems originate or how to fix them. It is therefore important to have a strategy to trap errors and rectify them because it is inevitable that data with quality issues will reach the ETL system. A common practice to tackle them is at ETL execution time, but a best practice would be to partner with source system experts to fix such issues.
Choosing the right ETL tool. One of the most important parts of ETL best practices is choosing the tool which is most compatible with the source and target systems, and which can make life much simpler. The ETL tool’s capability to generate native SQL for the source and target systems can reduce the processing time and resources. This allows one to process transformation anywhere within the environment that is most appropriate.
Integrating incremental data. Most data warehouse tables are so large that they cannot be refreshed during every ETL cycle. Incremental loads ensure that only records changed since the last update are brought into the ETL process. Isolating the latest source data is called capturing changed data (CCD). Building a good strategy for CCD will have a huge impact on the scalability and time taken to refresh the EDW. Often, legacy source systems will not have time stamps or a primary key to identify changes easily. Such problems can be very costly in terms of rework if identified at the later stages of the project. One of the ETL best practices to be followed is that the initial source system study should cover such aspects; this knowledge will help the ETL team to identify changed data capture problems and determine the most appropriate strategy.
Scheduling, auditing & monitoring ETL jobs. Another step of primary importance to be followed as a best practice during ETL implementation is scheduling, auditing and monitoring to ensure that the ETL jobs are done as per what was decided.
Ensuring scalability. The most important part of the implementation is to ensure that the proposed solution can scale with business growth. This is one of the ETL best practices to be kept in mind during ETL design.
About the author: Krishnakumar Avanoor is customer care associate & deputy general manager of the solutions & technology team at Shoppers Stop.
(As told to Anuradha Ramamirtham)
This was first published in September 2010