Data Quality plays a crucial role in the success of a data warehouse. End-users expect that the data residing in the data warehouse database will be accurate and reliable.
Failure to meet these expectations will result in a loss of confidence in the data warehouse by the end-users. Once end-user confidence is lost, the data warehouse will sit unused by the end user community.
An effective data quality program will view data quality from two perspectives, a preliminary data quality assessment to determine data quality issues and an on-going plan to assess and monitor the quality of data that enters the data warehouse.
Six Dimensions of Data Quality:
Data quality is defined in terms of six different dimensions:
- Accuracy: Accuracy has two components:
- Correctness represents the truth of the data.
- The second component of accuracy is whether one can truly and clearly understand what the data means.
- Completeness: Completeness has two components:
- Existence – relates to the presence or absence of data.
- Comprehensiveness – measures the data’s ability to answer all possible business questions
- Precision: Precision entails having the appropriate level of detail in the data. Precision is related to numeric data and exemplified by having a sufficient number of decimal places
- Consistency: Consistency has two components:
- Internal Consistency – deals with whether the data is consistent with a set of rules about itself. This deals with the concept of Domain Integrity.
- External Consistency – deals with whether one item or one set of data is consistent with another item or set of data. External Consistency deals with such concepts as Entity Integrity, Referential Integrity, and Time Integrity.
- Timeliness: Timeliness deals with whether the information can be provided to the end user quickly.
- Accessibility: Accessibility has three components:
- Security issues – does someone who needs data have the appropriate security authority to get the information.
- Appropriate communications and platform architectures – are these in place to enable users to get to the data.
- User interface – does the interface allow the user to understand access and navigation through the data.
The first four dimensions are assessed both in the preliminary data quality assessment and as part of the on-going assessment/monitoring process. The last two dimensions are only addressed in the preliminary assessment.
An essential part of the data quality effort is the preliminary data quality assessment. The business subject matter experts will play in supporting the data quality effort. The preliminary assessment provide the following benefits for the data warehouse project:
- Early discovery of data anomalies
- Better data models
- Accurate source to target mappings
- Provides baseline quality metrics
- Provides foundation for extract, transformation & data cleansing specifications
- Reduces scrappage and rework
- Avoids “White Noise” in testing
The limited amount of time available in the data warehouse project will require a well organized assessment plan to effectively focus the assessment on the critical data from the source system. Mostly a consulting company works with the business users to review the business questions to identify critical source data for in-depth analysis. Sampling techniques will be used to limit the amount of source system data to allow effective, timely analysis.
Metrics will play an important role in the data quality process. The data quality team from will need to assess the data quality requirements and to validate results of the assessment.
The data quality team and The business community business staff will develop a data quality assessment matrix to assist in prioritizing data. The matrix will include:
- Key data for referential integrity
- Critical data required for decision making
- Dimensional/hierarchical data required to support dimensional data analysis
Break-down of the Preliminary Data Quality Assessment:
- Define business questions
- Identify data required to answer the business question
- Prioritize data required to support business requirements
- Determine quality tolerance (regulated, end-user comfort zone etc.)
- Historic (Trend analysis)
- Trend analysis
- Identify following resources for the appropriate source system
- Analyze Source data
Source to Target Mapping
- Map source data to target data model
- Group data anomalies into categories according to their severity and priority
- Relate categories to guidelines for processing data anomalies
- Determine business requirements for specific data elements, including the levels of tolerance and thresholds
- Change business process, policy or rule
- Change source system’s functionality
- Transform the data before loading it into the data warehouse
- Load the data into an error table, not the data warehouse
- Load marginal data into the data warehouse, with flags
- Don’t use the data (cannot be fixed)
Develop DQ Specifications for ETL
- Based on transformation/cleansing rules from the data quality assessment
- Based on tolerances and thresholds developed in the data quality assessment
- Build continuous quality assessment for priority data elements
- Build continuous quality assessment for problem data elements
- Update historical tables
- Builds an audit trail
The time invested in the preliminary data quality assessment will reduce development and testing time required later in the warehouse development cycle. The assessment process not only produces transformation and cleansing specifications, the process also produces the majority specifications and processes required for the on-going data quality assessment/monitoring functions required in the extract, transformation and load process for the warehouse refresh and reload cycles.
On-going Data Quality Assessment/Monitoring
Data quality and cleansing is a cyclical, repeatable process. On-going Data Quality assessment/monitoring and data cleansing is an important part of maintaining data for the life of the data warehouse. Many factors may effect the quality of the source system data that may require changes to the extract, transformation and load process for the warehouse refresh and reload cycles. Often these changes may go undetected by the data warehouse team.
Break-Down of the Data Quality Assessment/Monitoring Process
Basic Assessment & Cleansing Extracted Data
The basic data assessment process examines the extracted source data for data records with quality issues that make the records unusable. These records will be erred off for correction and reprocessing. IT department will determine usability requirements with the organization’s business users and technical staff. Records that pass the assessment will have go through the normal transformation process and will apply cleansing for missing field values etc. The basic functions of this process are:
- Determine valid values
- Determine precision
- Determine completeness
- Create consistency with transformation
- Derive or supply default values for empty fields (when feasible)
Advanced Assessment & Cleansing Passed Data
The advanced assessment/cleansing process occurs during the staging process with those records that have passed the basic assessment process. The basic functions of this process are:
- Perform cross-field analysis
- Address cleansing
- Party matching/normalization
- Address matching/normalization
Error Correction: Failed Data
Records that were determined as unusable are placed in an error file for manual evaluation. Those records that can be corrected will be routed through the basic assessment process. The basic functions of this process are:
- Manual evaluation of bad records
- Identify correction steps
- Correct records or reprocess them from the legacy system
- Document error causes for feedback to the ETL team and source system group
Some data anomalies within a data warehouse system will only be noticed by comparing current averages and distribution of data to data trends that were collected over time. A set of tables will be developed during the preliminary data quality assessment to track these trends for priority data. These tables will be referenced each time the warehouse is refreshed. When the historical analysis indicates that critical data exceeds tolerances for change, an alert will either be given or in extreme cases, processing may be aborted. The basic functions of this process are:
- Degree of volatility
- Discover trends
- Discover repeated patterns
At the close of the extract, transformation and load process, reports will be generated for audit purposes and for management review. The data warehouse management team will use the statistics to evaluate assessment and cleansing process and to identify areas for quality improvement in the source systems. The basic functions of this process are:
- Reporting on distribution of defaulted values
- Reporting on data quality anomalies by source
- Reporting on data quality anomalies by data type
- Analysis of data totals against validation data from the source systems
The feedback process provides the mechanism required for communicating data quality issues back to the groups responsible for the source systems. A formal assessment report will be created by the warehouse management group after each cycle to be distributed to the MIS and business management for the source systems. This report will:
- Identify current data quality issues
- Assess the benefits vs. costs of correcting the anomaly
What is your experience with Data Quality Assessment?
Please share your insights with regard to this topic or let me know if you have any questions. I look forward to your comments.