Moving data from multiple sources, reformatting, and cleansing it can be a tedious task. An extract, transform, and load (ETL) tool should include these as the important features of the product. This tip highlights key factors to consider before purchasing an ETL tool.
Ascertain ETL tool’s ability to handle multiple data formats
Third generation ETL tools are capable of handling structured data from different data sources including legacy systems such as mainframe and UNIX based application systems, spreadsheets, XML format, proprietary databases, and others. But managing unstructured data is still a challenge for ETL tools.
The next generation ETL tools would have the ability to handle structured and unstructured data simultaneously as we access huge streams of valuable data from social networking media. Handling streaming data is another challenge for most of the ETL tools.
1) Consider branded ETL tools to save time
Today, numerous business intelligence (BI) vendors are coming up with BI and analytics packages, which contain pre-built data models and ETL modules to load different types of data onto them. These ETL tool packages accelerate the delivery of BI reports. If the package fits the needs of a business user and the required data is easily available, the module can be implemented in a couple of months and the user can start generating reports. On the other hand, the traditional way of building custom BI systems takes at least 9 to10 months.
2) Conduct data profiling at source data level
Data profiling with an ETL tool is an important activity to audit and understand what is being captured into source systems through the business process. A syntactic check validates if the data is meeting technical constraints. whereas a semantic check audits if the data is abiding by the inherent business rules and policies. Data health is quite important for designing and implementing an ETL process or implementing a new platform.
As a best practice, data should be audited and corrected at the source data level rather than the data warehouse, reporting or new application platform.
3) Check if ETL tool supports data quality/ cleansing
Integration of data quality activity with ETL functionality is the most sought after feature in any ETL tool. However, many ETL tools do not support advanced data quality and cleansing features. ETL tools with advanced data quality audit and cleansing features will have better applicability in the modern BI applications where streaming and historical data are required to be integrated seamlessly.
4) Look out for meta data support
Meta data management is a central activity for any ETL and reporting project. Most of the ETL tools support meta data capturing and maintenance features. The main challenge in current meta data management facilities of ETL tools is sharing a meta data at different layers of an information management (IM) system. There are few independent meta data management systems, such as Superglue from Informatica and MetaCenter of Data Advantage Group that can be deployed to create a central meta data repository for IM initiatives.
5) Ensure version control of the ETL maps
An ETL tool should support version controlling mechanism though which the developer can maintain different versions of the source code, without overwriting the original code. The ETL tool should also prevent multiple developers from working on the same extract so that at a time only one can make changes in the ETL work flow.
Almost all ETL tools have version control features, but they are not advanced enough to manage complex project delivery where ETL is being developed with varied groups in different time zones. Integration of advanced software version control tools like CVS and Code Co-op is a challenging process, because they are good to detect and control software versions in a distributed environment, whereas ETL version control requires not only configuration management of ETL codes but also of meta data and embedded business rules.
Confirm if ETL tool supports data manipulation
Most of the current generation ETL tools support data manipulation in and out of target database. In data integration projects, data extracted from multiple data sources are manipulated outside of the target database to fit into a uniform format. In a historical data warehouse, ETL needs to manipulate existing record in the target database before inserting a new instance of the same record. Current ETL tools are capable of managing data manipulation inside and outside of the target database.
About the Author: Dr Pramod Singh is senior consultant of Information Management. Dr Singh has around 20 years of experience in IT and holds Doctorate in Data Mining from University of New South Wales, Australia. He can be contacted through his email: firstname.lastname@example.org.
(As told to Sharon D'Souza)
This was first published in February 2011