Ravi is in trouble. Black Friday is fast approaching, and he needs 35,00,000 names and addresses for the big weekend sale; a clean file of names has to be sent to the mail house in less than a week. The source for these names needs to be consolidated from several different systems.
Even though Ravi has a good CRM system for creating mail files, he is very concerned about the quality, accuracy and reliability of his source data. He knows that many fields are not consistently populated, and that in many cases they contain invalid information. Now, Ravi needs
a reliable way to resolve these data quality problems. He needs to make sure that all the addresses are valid and standardized. If he had the time, Ravi also would like to remove any duplicate addresses.
Ravi’s case is common, and highlights a frequent data quality problem faced by various data users. There are two components of this data quality problem.
- Cleansing records to ensure data validity, accuracy and completeness.
- Determining a unique individual. (For example, is M. Mehra, 12/A, Shanti Vihar, Kurla [W] the same person as Manish Mehra, flat no. 12, A-wing, Shanti Vihar Co-Op., Kural [W]?)
This data quality problem is making Ravi jittery. There is a functional dimension of the data quality issue and a technical approach for remediation. Data quality tools aid technical remediation, and the functional dimension will help you harness the capabilities of these solutions.
There are many tools available in the market to resolve data quality problems. Some of these tools are very elaborate, while others are minimalist. Even in a comprehensive tool, certain modules may be required and others may not. Since the capabilities of data quality tools differ, their prices differ. The functions which are sought after in a data quality tool depend on the organization’s needs. In this context, you can map several of your data quality problems to issues with the tool’s capabilities in two steps.
Step 1: Identify functional tool capability challenges which create data quality problems
As part of the first step in mapping issues to the data quality tool's deficiencies, you have to take a look at the following aspects.
Reporting: How bad is it?
These issues are generally related to data profiling at the column level to generate statistics about aspects such as the range of values, row count against each value and NULL count.
Rules: What can we do about it now?
On this front you have to look at creation of a rule library to define various transformation rules in order to correct the data.
Transformation: how can we fix it?
Here we are talking about the tool’s functionality to ‘correct’ or ‘standardize’ data. For example,
- If the gender field contains ‘M,’ ‘Males,’ or ‘Male,’ then set the target value to ‘M.’
- If the zip code is ‘400001,’ then set the pin code to null.
Other functional requirements at fault are:
Data cleansing:Basic data cleansing functions like:
- Left and right justification
- Trimming blanks and spaces
- Case conversion
- Range checks
- Date/time validation and reformatting
Data enrichment/appends: In addition to standard cleansing, the ability to ‘enrich’ or append data acquired from internal/external sources such as identified missing data elements, etc.
Address standardization and cleansing: It’s essential that your data quality tool has address cleansing functionality to standardize addresses. Examples of these capabilities include:
- Fixing spelling
- Identification of address components, and standardize
- Validation of address against post-office addresses
- Validating postal codes
- Validation of street names
- Consistent and valid street suffixes
- Valid house numbers
Data merging:Data often needs to be obtained from multiple input sources and ‘merged.’ When a data attribute is duplicated in the input source, a user-specified business rule needs to be used to identify the ‘correct’ value. For example, the rule may specify “use pin code from Source-A when available. However, when it is null, use the pin code from Source-B”.
Customer de-duplication and data merging: Data about a particular customer may be in multiple input data sources, and each data source may have a different customer identifier. It is necessary to link or merge data from multiple data sources based on the identification that a particular customer is the same (based on user-specified business rules). For example, same address, same gender and same last name.
Incremental processing: Ability to quickly add or reload values as the source data changes, and ability to track source to target record linkage.
Step 2: Reconciliation
As part of the reconciliation process, we have check whether the measures we undertook deliver the right results. Testing and validation are the ways in which we can ascertain the efficacy of our corrective measures.
Testing:Functionality to test rules on sample sets of data.
Validation:Providing statistics about each data load. These include areas like:
- Number of records processed
- Generate hash total and checksum total
- Number of times a rule was invoked
- Post-cleanse ‘data visualization’
- Audit cleansed and transformed data partition
- Generate and validate statistics of data load
- Create and populate error and exception tables
The business need for a data quality tool is the commercial value (or use) of the data. If data quality problems hamper the use of that data, then it only creates nuisance than business value. In this context, one-time cleaning is not enough. The use of a data quality tool and associated processes (which the tool forces your organization to adopt) leads to reduction of data quality issues across the organization.
About the author: Somnath Dasgupta is the Head of Data Warehouse Practice at NIIT technologies.