Numerous aspects and issues must be considered in the design and implementation of successful data warehousing applications.

The aspects that we must consider include definition of the business purpose of the datawarehouse, specification of an enterprise data warehousing architecture, and selection of appropriate components and tools.

I define a set of best practices in data warehousing that can be used as the basis for the specification of data warehousing architectures and selection of tools. These best practices, which are derived from extensive consulting experience, include the following:

  1. Ensure that the data warehouse is business-driven, not technology-driven
  2. Define the long-term vision for the data warehouse in the form of an Enterprise data warehousing architecture
  3. Provide flexibility to allow for significant changes in data warehousing functionality and integration of new business requirements
  4. Identify best practice solutions and introduce new functionality in a piloted or prototyped manner
  5. Reduce dependence on individuals who have key information
  6. Do not build “virtual” data warehouses that access data directly from source environments and have no target database
  7. Avoid “stovepipe” data marts that do not integrate at the metadata level with a central metadata repository, generated and maintained by an ETL tool. Replace existing stovepipe or tactical data marts by developing fully integrated, dependent data marts, using best practices
  8. Buy, don’t build data warehousing components
  9. Use a data modeling tool to perform logical and physical modeling. Ensure that a single logical data model drives the generation of all physical, target data models
  10. Create a hub-and-spoke architecture, with the ETL tool as the Hub, to extract data from all data sources, resolve inconsistencies in data sources, and generate clean, consistent target databases for decision support
  11. Use a 2nd-generation ETL tool to automate the extraction, transformation, and load functions:
    • Use the ETL tool, as the hub of a hub-and-spoke architecture, to access source data from all data sources, apply procedural cleansing, compute aggregates, apply integrity constraints, implement business rules, generate and maintain central metadata, and load clean, consistent, target databases
    • Do not code extraction, transformation, and load functions by hand using procedural code, such as COBOL, C, C++, PL/SQL, Perl scripts, Cold Fusion, etc. In many organizations, procedural code is used extensively to access source data, transform the data, and load target databases. Procedural code is difficult and expensive to modify and enhance. In addition, procedural code does not automatically generate metadata, which leads to the development of non-integrated, stovepipe data marts. An important objective is to eliminate the need to write procedural code by using only point-and-click tools, e.g., point-and-click ETL tools, data modeling tools, data cleansing tools, and BI tools
    • The ETL tool should incorporate native interfaces to source and target databases, including legacy files, relational databases, as well as ERP, CRM, Web log, and real-time data sources
    • Use the ETL tool to access atomic data from data sources, not summary data
    • Ensure that the ETL tool is platform-independent, i.e., it runs in mainframe, Windows and Unix environments, and supports multiple relational and non-relational target databases
    • Use the incremental aggregation capability of the ETL tool to compute aggregates. The aggregates provide fast response to user queries
    • Ensure data security and integrity through automatic encryption, authentication, data recovery and data integrity
  12. Populate the target databases using the ETL tool:
    • Re-organize the target data into business-centric dimensional models that facilitate analysis using metrics, dimensions, and attributes that are intuitively meaningful to business users
    • Use dimensional star schemas as the physical organization of data in data marts. Establish a goal to achieve 60 – 70 % conformed dimensions across data marts, and low overlap at the atomic level of detail across data marts
    • Eliminate the use of Access databases and Excel spreadsheets to maintain and manage departmental data
  13. Integrate all components of the data warehousing architecture with central metadata
    • Buy only components that integrate with central metadata
    • Look for product suites that are integrated at the metadata level across all components of the DW architecture, i.e., that provide an end-to-end solution across data sources, data profiling tools, data cleansing tools, data modeling tools, ETL tools, target databases, analytic applications, analytic delivery, and general-purpose BI functions
    • Use the ETL tool to generate and maintain central metadata. Metadata generation should be automatic and transparent. Do not generate metadata by hand using procedural code
    • Use the ETL metadata repository as the “single version of the truth” for entity definitions, data models, business rules, transformation logic, calculations, etc.
    • Support global, synchronized metadata by integrating multiple, networked, ETL engines with common, global metadata generated and maintained by a central ETL engine. Beware of “multiple versions of the truth”
    • Map business terms to the data mart schema (business metadata). Ensure that there is consistency in the definition and use of corporate definitions and metrics across business functions, including systems of record across all elements
    • Do not use an external metadata repository, such as CA Platinum or Rochade, as a global metadata repository. This creates multiple versions of the truth that cannot be resolved through bi-directional metadata synchronization
    • Generate extensible, CWM (Common Warehouse MetaModel) -compliant metadata and LDAP-compliant directories
    • Ensure that local business rules are 100% dependent on central entity definitions. This important constraint enables business analysts to generate reports using business rules that are defined locally, while supporting the needs of central administrators to generate reports across business units that can be consolidated and reconciled
    • Track where metadata and data models are used across tools, including data profiling, data cleansing, data modeling, ETL, and BI tools
    • Support data lineage reporting, i.e., automatic documentation and mapping of data flow from data sources, through transformations, to target databases
  14. Use the real-time features of the ETL tool to access near real-time data and support implementation of near real-time, closed-loop, analytic applications.
  15. Support strategic DSS and tactical DSS using the same architecture
    • Strategic DSS uses the ETL tool to process all accumulated source data transactions in a batch mode update at night. Strategic DSS is oriented toward access of summary data that is accurate as of end-of-business yesterday. Aggregates are pre-computed as part of the nightly update process
    • Tactical DSS uses the ETL tool to process near real-time data. It may be used to support 24X7 operation of the data warehouse and provide a high level of reliability and availability. The target databases are updated in near real-time without a nightly data update window. Tactical DSS is oriented toward access of near real-time detail data that is no more than a few minutes old. It is not practical to pre-compute aggregates for tactical DSS. Instead, if aggregates are required, they are computed in real-time
    • To support strategic DSS and tactical DSS in the same target databases, use timestamps on both the aggregates and the detail data. Constrain strategic DSS to access only detail data that has a timestamp that is equal to or prior to the timestamp of the aggregates
  16. Do not load dirty source data into the data warehouse
    • Use a data cleansing tool, in combination with a data profiling tool and an ETL tool, to analyze the source data, clean the source data, and resolve logical inconsistencies. Data profiling tools provide insight into the state of source data before it is extracted, including data types, dimensions, measures, primary and foreign keys, normalization, etc. Functions supported by data cleansing tools include assessment of data quality, standardization, integration, de-duping name and address data, and identifying household relationships.
    • Look for data cleansing tools that integrate at the meta data level with the ETL tool. Data cleansing function calls should be available directly from ETL mappings or as a pre/post process.
    • Use the data cleansing tool to resolve inconsistencies in the Customer, Employee, and Household dimensions, and to append geocoding data
  1. Do not let an ERP package, such as SAP R/3 or PeopleSoft, dominate the data warehouse. Treat the ERP application as one of many data sources. Leverage the off-the-shelf functionality of packaged data warehousing products, such as SAP BW or PeopleSoft EPM, but maintain the ability to build custom data warehousing applications using the same ETL tool utilized to populate the packaged product
  2. Support pre-packaged analytic applications that are fully integrated with the ETL tool and analytic delivery platform. Look for analytic applications that implement common business processes, provide an end-to-end solution, complement the existing infrastructure, deliver metrics in the form of intuitive, dashboard displays, and are modular, extensible, scalable, and customizable.
  3. Build the data warehouse bottom-up, not top-down
    • Bottom-up development maximizes ROI and minimizes risk
    • Develop the data warehouse incrementally, one business area at a time
    • Focus initially on the development of multiple data marts using a Data Warehouse Bus Architecture. Design data marts that support differentiable business processes, such as sales, finance, customer touchpoints, procurement, supply-chain management, asset management, human resources, etc.
    • The Bus Architecture consists of data sources, data cleansing tool, data modeling tool, ETL tool, target databases, and BI tools. All components of the architecture are integrated at the metadata level. In the bottom-up methodology, a central data warehouse, Operational Data Store, and persistent staging area are optional; they are not required components of the architecture
    • Avoid normalized, relational database structures, which can be extremely slow in joining large normalized tables, such as tables of transactions, customers, products, and employees, in response to a query
    • Implement target databases for data marts as dimensional structures. Compared with normalized relational structures, dimensional structures generally provide higher performance, greater flexibility, and less sensitivity to changes in data models. Dimensional structures are easier to navigate for end users and map much better to business processes than normalized relational structures. Ensure that the dimensional model is data-driven and not report-specific.
    • Store atomic transactions for a specified business process, together with multiple, pre-computed aggregates, in the target database for each data mart. The set of data marts, implemented as dimensional structures, is the logical equivalent of a central data warehouse
    • Access facts across data marts using the star-join capability of the RDBMS, rather than joining normalized relational tables in a central data warehouse. The star-join technique provides faster response to queries across business processes at the atomic level, in comparison with joining normalized relational tables
    • Design data marts to achieve minimal overlap in atomic transactions across data marts and a high level of conformed dimensions across data marts
    • Consider implementation of a central data warehouse and/or a persistent staging area only if there is significant overlap in atomic transactions across data marts and a low level of conformed dimensions across data marts
    • Prove each step before moving on to the next step

Deliver a major increment of functionality, in the form of a data mart, every 90 days or less. Measure the ROI from each delivered data mart

  1. Keep the incremental, bottom-up development effort under control through use of logical data modeling techniques and integration of all components of the architecture with central metadata, generated and maintained by the ETL tool. Create logical and physical models of the target databases with a data-modeling tool. Do not allow DBAs to modify physical table structures directly. Modifications to table structures must be performed at the logical level using a data modeling tool. Do not assume that OLTP data modeling experience applies to data warehousing.
  2. Anticipate scalability and performance issues:
    • Select an ETL tool and server that support a linear increase in performance through incorporation of additional CPUs and RAM in the server. Ensure that the hardware resources can scale as the data warehousing application grows
    • Provide high performance through the use of pre-computed aggregates
      • Use an automated tool to identify candidates for aggregates
      • Use the incremental aggregation feature of an ETL tool to pre-calculate aggregates in one pass of the source data. Alternatively, use the DMBS to pre-calculate aggregates, or develop materialized views
      • Use the aggregate navigation functions of a BI tool or a database to re-direct queries to pre-computed aggregates
    • Select an ETL tool that supports a high degree of multi-tasking, high-speed operations in RAM memory, concurrent processing of parallel data streams, load balancing across CPUs and servers, and fail-over across multiple servers
    • Select an ETL tool that supports transparent migration from a Windows to a Unix environment. Consider implementing the initial data marts in a low-cost Windows environment and migrating to a Unix environment as performance requirements increase
    • Work with the DBA to specify indexing and partitioning of the target data to minimize I/Os, provide faster database access, and faster loads
    • Simplify the target data model to eliminate extraneous joins, typically caused by excessive snowflaking (i.e., normalized dimension tables)
    • Use a more powerful RDBMS, such as Teradata, DB2 UDB, Informix, or Tandem NonStop
    • Use hybrid tools, multidimensional query languages, and a persistent multidimensional cache to improve the performance of the BI tool and target database. Move to hybrid OLAP tools as soon as they integrate fully with ETL tools
  3. Support data integration standards, including XML, CWM (Common Warehouse MetaModel), LDAP, UML, JMS, SOAP J2EE, etc.
  4. Do not code reports by hand:
    • Use intuitive, point-and-click, Web-enabled Business-Intelligence tools to support query, reporting, OLAP analysis, drill-down, and drill-through functions
    • Deliver information from the data warehouse over the Web
    • Support ad hoc reporting, “push” reporting (reports are pushed to users on occurrence of user-defined events), and “pull” reporting (named, template-driven reports are pulled to the user)
    • Provide business analysts with intuitive, user-managed, digital dashboard displays, KPIs, balanced scorecards, etc., with business-defined threshold values, color-coded in red, yellow, and green
    • Provide an intuitive interface to communicate the meaning of a business rule or entity to the business user.
    • Use a BI-oriented corporate portal as a gateway to both structured and unstructured information
  5. Build for the short-term, but strategize and architect for the future
  6. Set reasonable expectations on the initial Data Mart delivery and make good on meeting and exceeding those expectations within the timeframe allocated.

 

Your Turn Now – Share Your Thoughts

Do you have a few additional points that you want to add to this list of best practices we shared? Let us know!