Standard, regular data warehouse data is often supplemented by non-standard, irregular data. Such irregular data may be data that has low volatility in the number of rows loaded or is infrequently modified, but nevertheless requires special auditing or management. An example of irregular data could be sales commissions entered at the time of sale (by, say, a mobile phone retailer vending mobile phones as well as SIM cards) that later needs to be adjusted or integrated with overriding revenue/margin data from multiple suppliers (such as, say, mobile service providers and handset manufacturers), in order to produce accurate reports. (See table 1: Issues with traditional approaches)
Design patterns for irregular data
In order to resolve the issues with handling of irregular data, standard design patterns are required for data integration. Here are five easy methods:
1. ETL and full integration with data warehouse (DW) keys:
In this approach -- suitable mainly for one-off transaction loads -- the extract, transform, load (ETL) function uses the irregular data file to update existing dimensional records using natural keys present in the irregular data. ETL inserts new dimensional records (or serves as the only source of the dimension table) and creates the data warehouse (DW) keys in the process.
Use this approach to add new data that does not exist in any of the source systems, provided the source data is auditable at the record level and deletions of records in the source data can be flagged up rather than actually deleted. Source file content must be validated by the business intelligence competency center (BICC) internally.
2. Direct data maintenance in DW tables:
Here the DW table is directly updated/ inserted manually. The table will have a natural key only and no surrogate key. Data validations are applied in the portal front-end at data entry. Table constraints need to be set to ensure integrity of records created manually. As there is no DW key involved, this approach is best for small-sized tables lest performance be impacted.
3. Direct maintenance in tables and addition of DW keys with ETL:
This approach is similar to the previous one, with the addition that ETL generates the surrogate key following manual updation/ insertion of the DW table. Use this approach for updating values in larger dimensions, wherein maintaining the file for source data is not possible.
4. Standard DW data modeling
In this approach, external data is loaded into a DW star schema. It is useful when adjustments that occur outside the transactional system need to be captured and loaded. While the dimension type (for example, location) must be priorly defined in the DW, the dimensions loaded may not be at the grain of the regular DW fact tables.
For this method, it should be possible to assign a context to the external data/adjustment and apportion it to existing facts. If the grain of the fact tables loaded is different from the existing facts (which it supplements), and apportioning is not possible, then summary level analysis must be made possible between the two areas. For example, adjusting sales figures at a state level, when the transactional data is at the individual store level.
Data loaded will be auditable and set to be active for a specified period while it is valid. However, external data must not overwrite the regular transaction data values.
5. BI application layer for irregular data
This approach is used when there will be no requirement of integration of external data with DW data. For this, external data must be made available to the BI application server directly, and data will be accessed directly by the BI presentation layer. It is recommended that external data files be loaded into database tables (that host DW data) from where the BI tool can access the irregular data, to reduce performance impact. (See table 2: Integrating irregular data)
The design patterns described help realize the following benefits:
- Provide a well-documented, standardized approach to maintaining irregular data, allowing for continuous, agile updates to the process of managing data.
- Allow business as well as IT to better understand the consequences of how data is handled.
- Reduce time spent by the business in integrating data.
- Allow access rights to be assigned.
- Make data available to all information consumers.
- Enable data persistence in the database and not just at report time.
Data governance is critical when dealing with irregular data. The BICC must make sure that data quality standards are defined, and appoint data stewards to govern the data. A standard portal serving as a single source of truth for data emanating from different external sources must be developed. The portal along with the associated set of design patterns will pave the way for efficient reporting, even when one has to handle considerable irregular data.
author: Girish Viswanathan is a DW & BI technical architect at Infosys Technologies Ltd. He
has worked in the field for over 11 years and in those has been Bi project manager/technical
architect for 5. He specializes in ETL integration and is a Certified Business Intelligence
Professional & Certified Data Management Professional.
This was first published in June 2011