There is a vigorous debate going on since the advent of datawarehousing in the DW community about whether to build applications top-down or bottom up.
The model often requires implementation of numerous, large, normalized, relational target databases, including a central data warehouse, Operational Data Store, Persistent Staging Area, and additional databases, such as an Exploration Warehouse.
Although applications may be built using iterative development techniques, the top-down methodology typically requires lengthy interviews with personnel from multiple business groups and preparation of an extensive, Enterprise Data Model for the DW application.
Top-down development approaches often fail because they require a large, up-front development effort, defer ROI, and require implementation of multiple, large, normalized target databases.
The bottom-up development methodology, as advocated by Ralph Kimball, myself, and other analysts, supports business requirements to build applications iteratively and achieve rapid ROI.
To avoid data mart chaos, individual data marts are embedded within a long-term enterprise data warehousing architecture that is specified early in the project. The development effort is controlled through use of incremental, enterprise data modeling techniques, and all components of the architecture are integrated with central meta data, generated and maintained by an ETL tool.
In my experience, bottom-up techniques have been highly successful in meeting business requirements, and in reducing development time, cost, and risk. However, it is necessary to use best practices in the development effort.
This article summarizes some of the mistakes that can be made in bottom-up development.
MISTAKE 1: Use of Normalized, Relational, Target Databases
A central tenet of bottom-up development is avoidance of large, normalized, relational data structures. It has been known for years that joining large, normalized, relational tables in response to queries can take an excessive amount of time.
To avoid these problems, bottom-up techniques are often based on the use of dimensional data structures, such as star schemas, rather than normalized structures. Star schemas consist of one or more normalized Fact Tables surrounded by multiple, denormalized Dimension Tables.
The Star Join feature of many RDBMSs may be used to greatly speed-up the process of joining Dimension Tables to Fact Tables.
A major advantage of the bottom-up approach is elimination of the need to design and populate large, normalized, target databases for a central data warehouse, an ODS, or a Persistent Staging Area.
Atomic data is typically stored in dimensional structures, rather than normalized, relational tables. An ODS is not required because real-time source data can now be accessed by ETL tools via native interfaces to EAI brokers, such as WebSphere MQ, TIBCO, or Web Methods.
Instead of capturing source data in a Persistent Storage Area, staging of data is generally performed using a temporary, volatile, flat file. When additional source data is required, point-and-click data modeling tools and ETL tools are used to modify the target data model and populate the model with additional Facts and attributes of dimensions.
MISTAKE 2: Inadequate Data Warehousing Architecture
Bottom-up development is frequently based on the Data Warehousing Bus Architecture described by Ralph Kimball in his book “The Data Warehouse Toolkit, Second Edition”. In this implementation, data marts, designed using dimensional models with conformed dimensions and facts, are implemented for each business process.
The dimensions of the set of data marts have a high degree of conformity, i.e., many dimensions are identical or an exact subset of a common dimension across all data marts. The data marts are built incrementally, one business process at a time.
All data marts share central meta data definitions generated and maintained by an ETL tool. A meta data exchange architecture is used to automatically synchronize local meta data with central meta data. Additional data marts can be implemented by separate groups and plugged into the standard architecture and infrastructure.
Detailed atomic data and aggregates are typically stored in each data mart. Access of atomic data across data marts is fast and efficient due to the use of conformed facts and dimensions.
A central data warehouse, utilizing a normalized data model, is not required and is not recommended for use in the Bus Architecture.
An Operational Data Store and a Persistent Staging Area are also optional.
MISTAKE 3: Few Conformed Dimensions; High Overlap of Atomic Data
The target database for a data mart is assumed to be a relational database that stores multiple data structures in the form of star schemas. One star schema stores Facts at the atomic level, e.g., individual atomic transactions.
Additional star schemas store pre-computed aggregates, i.e., smaller star schemas that contain summary information with different combinations of dimensions. Aggregates are pre-computed as part of the nightly data warehouse update process.
An objective in designing data marts for the Bus Architecture is to achieve a high level of conformity of dimensions across data marts and a low amount of overlap at the level of atomic transactions.
The set of target databases for the DW Bus Architecture is the logical equivalent of a Central Data Warehouse. The advantage of this approach is that a Central Data Warehouse in normalized form is not required.
MISTAKE 4: Hand-Coded Extraction, Transformation, and Load
A major mistake in the implementation of a data warehousing application is the use of procedural languages to code the extraction, transformation, and load process. There are serious problems with hand-coded ETL programs.
Maintenance of the hand-written ETL programs is likely to become a nightmare. The functions to be supported by data warehouses cannot be predicted in advance. Users initiate a high volume of changes that require hand-written ETL programs to be constantly modified and in many cases rewritten. The effort required to maintain these programs often becomes a major burden.
Meta data is not generated automatically by hand-generated ETL programs. Thus, it is difficult to avoid building stovepipe data marts that cannot be integrated across the enterprise.
Hand-coded ETL programs are often single-threaded and are likely to have slower speed of execution than directly executable, multi-threaded code generated automatically by modern ETL tools.
MISTAKE 5: Lack of Integration of Components at the Meta data Level
Best practices call for the use of a powerful, second-generation ETL tool to automatically generate executable applications and integrate all components of the architecture with central meta data.
Best-of-breed ETL tools support native interfaces to data sources, including legacy files, flat files, spreadsheets, relational databases, ERP, CRM, Web log files, real-time message queues, XML, and external data sources.
They are scalable to a wide range of supported platforms, including Windows and Unix environments and support a wide range of relational and non-relational databases.
The ETL tool should support automatic meta data generation and integration of meta data across all components of the architecture, including the ETL tool, data modeling tools, data cleansing tools, BI tools, and ERP packages.
There should be only a “single version of the truth” for business rules and entity definitions, which should be stored in the central meta data repository, generated and maintained by the ETL tool.
MISTAKE 6: Lack of Data Quality
Blindly copying dirty source data into the data warehouse is extremely dangerous. Dirty data can lead business analysts to make erroneous business decisions and rapidly causes end users to lose confidence in the quality and integrity of the data warehouse.
Lack of data integrity in the data warehouse is a common cause of failure of many data warehouses and analytic applications.
ETL tools can be used, in combination with third-party data cleansing tools, to provide data integrity. Functions supported by these tools include data profiling, assessment of data quality, standardization, integration, de-duping name and address data, and identifying household relationships.
Use of a data cleansing tool to process source data will ensure that the data in the DW target databases is clean, consistent, and reliable.
MISTAKE 7: Specification of an Enterprise Data Model
The traditional, top-down approach is based on rigorous Information Engineering techniques. It typically requires a substantial, long-term effort to interview potential users of the data marts, document user requirements, and prepare a detailed enterprise data model for the data warehouse.
This often involves in-depth business discovery across multiple business units, reconciliation of numerous differences in entity definitions and business rules, and months of work to document functional requirements and specify an enterprise data model for the data warehouse.
The bottom-up approach is based on Rapid Application Development (RAD) techniques that build applications, or data marts, incrementally.
Rather than specifying an Enterprise Data Model up-front, the bottom-up process uses logical data modeling techniques to specify the Enterprise Data Model incrementally, one data mart at a time.
As each data mart is implemented, it contributes additional entities and relationships to the overall logical data model. Over time, the logical data model evolves to become the Enterprise Data Model.
MISTAKE 8: Inappropriate Team
Top-down development often requires a large, dedicated staff to design and implement a central Data Warehouse, ODS, Persistent Staging Area, and other normalized, relational target databases.
In contrast, the bottom-up development effort utilizes a small, self-managed team consisting of three skill sets; a data modeler, an ETL specialist, and a BI tool specialist. The primary development team builds data marts rapidly to achieve high ROI, making maximum use of reusable components, such as conformed dimensions, conformed Facts, and high-level transformation objects.
A separate Maintenance and Enhancement Team responds to user requests for enhancements to completed data marts. Additional staff personnel include a Data Warehouse Administrator, DBA, source system experts, Data Steward, training support personnel, and continued involvement of business analysts.
MISTAKE 9: Development Effort Not Time-Boxed
In my experience, it is important to place time boxes around every step of the development process. The Requirements Specification Phase is typically constrained to a time box of 4 to 6 weeks.
Functions performed during this phase include user interviews, specification of the long-term enterprise DW architecture, selection of components, and modeling of the initial data mart.
Implementation of the initial data mart is constrained to a time box of 90 days.
Prior to the initiation of the 90-day time box, the self-managed development team defines the functionality they are willing to commit to developing within the 90-day period. They then manage their own resources to complete their self-imposed deliverables.
This “Swat Team” approach provides unusually high levels of productivity and efficiency.
MISTAKE 10: Long User Interviews
Top-down development techniques often require a lengthy user interview process across multiple business areas.
The objective of the interview process is to gain a detailed, global view of functional requirements for the entire enterprise data warehouse. The interviews often lead to the specification of an Enterprise Data Model for the data warehouse.
The overall interview and data modeling process may consume one to two years and produce nothing but a large pile of paper.
The bottom-up approach dramatically reduces the requirement for up-front interviews of end users and specification of enterprise data models. In the bottom-up process utilized by my organization, global requirements are defined through short interviews with individual business units that consume no more than one day per interview.
The deliverable from each interview is a short, concise requirement specification and a top-level dimensional model for the business unit. The top-level data models produced as a result of interviews with multiple business units are then synthesized to identify conformed dimensions, facts, transformations, aggregates, etc.
The reason that long user interviews are unnecessary is that prospective users of data marts typically cannot specify the detailed functionality of a data mart in advance. Up to 50% of required functionality is identified after the data mart is in operation.
Data marts are an example of an application in which users don’t know what they want until they see it.
Your Turn Now – Share Your Thoughts
What is your experience around these 10 mistakes I discussed with you today? Have you seen people making these in your professional experience? Share with us!