Selection of an appropriate ETL tool is the most important decision that has to be made in choosing components of a data warehousing application.
The ETL tool operates at the heart of the data warehouse, extracting data from multiple data sources, transforming the data to make it accessible to business analysts, and loading multiple target databases. Unlike other components of a data warehousing architecture, it is very difficult to switch from one ETL tool to another.
Due to a lack of standards, data definitions and transformation rules cannot be moved from one tool to another. If the selected ETL tool fails in production, it is the one component of the architecture that is not easily replaceable.
Organizations are faced with two major choices to implement an extraction, transformation, and load process:
- Write a custom program in COBOL, C, or PL/SQL to extract data from multiple source files, transform the data, and load the target databases
- Purchase an off-the-shelf extraction/transformation/load (ETL) tool
Hand-Coded ETL Processes
A common mistake is to write custom programs to perform the extraction, transformation, and load functions. Writing an ETL program by hand may seem to be a viable option because the program does not appear to be too complex and programmers are available. However, there are serious problems with hand-coded ETL programs:
- Unlike OLTP applications, the functions to be supported by individual data marts cannot be predicted in advance. In a typical data mart, over 50% of the required functionality is defined by end users after the data mart goes into production. To keep up with the high volume of changes initiated by end users, hand-written ETL programs have to be constantly modified and in many cases rewritten. The effort required to maintain these programs often becomes a major burden for the project
- Metadata is not generated automatically by hand-generated ETL programs. Metadata is the key to integrating data marts across business units. If metadata is not available, it is difficult to avoid the development of “stovepipe” data marts that satisfy the needs of individual business units, but cannot be integrated across the enterprise
- Hand-coded ETL programs are likely to have a slower speed of execution, compared with directly executable code generated by off-the-shelf ETL tools. Hand-generated programs are typically single-threaded, while modern ETL tools generate multi-threaded, directly executable code that can run on parallel, high-speed engines
Off-The-Shelf ETL Tools
To avoid these problems, automated, off-the-shelf ETL tools are increasingly being used to extract, cleanse, transform, and load data into target databases. An important function of these tools is to generate and maintain centralized metadata.
ETL tools solve several important architectural problems for data warehouses:
- The ETL tool provides coordinated access to multiple data sources. Functions supported by ETL tools include extraction of data from multiple source environments, data cleansing, reorganization, transformation, aggregation, calculation, automatic loading of data into the target database, and automatic generation of executable code to perform parallel processing of transformations on multiple engines
- ETL tools are used to generate and maintain a central metadata repository. The metadata repository provides a “single version of the truth” that can be used to define enterprisewide source data definitions, data models for target databases, and transformation rules that convert source data into target data. A metadata exchange architecture is used to synchronize central business rules with local business rules, maintained as local metadata by end-user BI tools
- The ETL tool also addresses the dirty data problem – data from source files can be cleansed and inconsistencies in the data resolved as part of the extraction and transformation process, using procedural data cleansing techniques. Name and address correction, deduping, and householding functions require use of an external data cleansing tool
Analysts define source-to-target mappings, data cleansing rules, and data transformation rules using a graphical point-and-click interface provided by the ETL tool. When all mappings and transformations have been specified, the ETL tool automatically generates the data extract/transformation/load programs, which typically run in batch mode.
Stovepipe Data Marts
Unfortunately, there is a major limitation in the architecture of many ETL tools that has caused the failure of numerous data warehouses. The problem is lack of integration between central metadata maintained in the central ETL repository, and local metadata maintained in individual data marts.
Local metadata, which defines the local meaning of business data for end users, is maintained by end-user BI tools. However, many vendors of end-user tools do not support a metadata exchange architecture with ETL tools that can be used to integrate central metadata with local metadata. Lack of integration with central metadata results in the development of independent data marts, often called “stovepipe” data marts, which store local, inconsistent, incompatible definitions of business entities, semantics, and business rules. Stovepipe data marts support the needs of individual business units, but cannot be integrated at the corporate level because they do not to share data definitions and business semantics across the enterprise.
IS departments have known for years that they should not build stovepipe applications. In spite of the well-known problems with stovepipe applications, most data warehouses built today incorporate stovepipe data marts.
In their rush to deploy individual data marts, many organizations do not implement the infrastructure required to support consolidation of data across data marts. The result is a profusion of stovepipe data marts that satisfy the requirements of individual business units, but cannot be used to support corporate requirements for an integrated view of data across data marts. Subsequent attempts to install the infrastructure required to integrate data marts with a central repository often fail. The failure is due to the complexity and risk associated with making major changes to an operational data warehousing facility that is growing rapidly and supports large numbers of business end users.
It is extremely difficult to avoid building stovepipe data marts. To avoid stovepipe data marts, it is important to buy only components that integrate, out-of-the-box, with a central metadata repository generated and maintained by an ETL tool.
Architected Data Marts
The solution to the stovepipe problem is to build architected data marts that utilize a metadata exchange architecture to ensure that the local metadata maintained by each data mart conforms with definitions maintained in a central metadata repository. To support an architected data mart, an ETL tool is used to generate and maintain a central metadata repository. The central metadata repository provides a “single version of the truth” that can be used to define enterprisewide source data definitions, data models for target databases, and transformation rules that convert source data into target data. The metadata exchange architecture is used to maintain consistency between the central metadata and the local metadata.
The central metadata repository maintained by the ETL tool is at the heart of the data warehouse. It is an essential component of the architecture used to integrate all components of the data warehouse.
In practice, it is hard to find combinations of “best-of-breed” ETL tools and end-user tools that integrate at the metadata level. Each vendor of ETL tools generates proprietary metadata that is incompatible with metadata generated by other ETL tools. The lack of metadata standards makes it difficult to select best-of-breed tools from multiple vendors that support metadata integration within a common architecture.
Standards organizations, such as the OLAP Council and the Meta Data Coalition, have tried, but failed, to obtain industry-wide agreement on the definition of a metadata standard. Recently, in another attempt to define a metadata standard, the Meta Data Coalition merged with the Object Management Group. It is possible, but not likely, that the metadata standards being developed jointly by the merged organizations will be adopted by the industry.
Selection Criteria for ETL Tools
Here are some of the important functions to look for in ETL tools in the selection process :
- General-purpose ETL tool suitable for all data warehousing applications
- Support for data extraction, cleansing, aggregation, reorganization, transformation, calculation, and load operations, including the following functions:
– Access data from multiple operational data sources
– Re-map source data into a common format
– Standardize data to enable load to conformed target databases
– Filter data, convert codes, perform table lookups, calculate derived values
– Automated, slowly changing dimension support (Type I, Type II, Type III)
– Incremental aggregation – computation of aggregates by the ETL tool in one pass of the source data
– Support for Unicode – multi-byte character sets localized for Japanese and other languages
– Support graphical job sequencer, re-usable containers, and nesting of sessions
– Validate data to check content and range of field values
– Perform procedural data cleansing functions
– Support complete development environment, including versioning and run-time debugger
– Load cleansed data to the target data mart or central DW
– Produce audit and operational reports for each data load
– Automatic generation of centralized metadata
– Automatic generation of data extract programs
- Native interfaces to legacy files, relational databases, ERP sources (e.g., SAP R/3 and PeopleSoft), e-Business applications, Web log files, IBM MQ-Series, XML sources, etc.
- Support for near real-time clickstream data warehousing
- Support for an Enterprise e-Business environment, including integration at the metadata level with BI tools, ERP applications, CRM applications, analytic applications, corporate portals, etc.
- Platform independence and scalability to enterprise data warehousing applications
- Directly executable, in-memory, multi-threaded processing for fast, parallel operation
- No requirement to generate and compile source code
- No requirement for intermediate disc files
- Support for concurrent processing of multiple source data streams, without writing procedural code
- Specification of ETL functions using pre-packaged transformation objects, accessible via an intuitive graphical user interface
- Extensible transformation objects at a high level of significance
- Ability to specify complex transformations using only built-in transformation objects. The goal is to specify transformations without writing any procedural code
- Automatic generation of central metadata, including source data definitions, transformation objects, target data models, and operational statistics
- Metadata exchange architecture that supports automatic synchronization of central metadata with local metadata for multiple end-user BI tools
- Central management of distributed ETL engines and metadata using a central console and a global metadata repository
- End-user access to central metadata repository via a right-mouse click
- Metadata exchange API compliant with COM, UML, and XML
- Support of metadata standards, including OLE DB for OLAP
- Ability to schedule ETL sessions on time or the occurrence of a specified event, including support for command-line scheduling using external scheduling programs
- Ability to schedule FTP sessions on time or event
- Integration with data cleansing tools
- Import of complete data models from external data modeing tools
- Strong data warehouse administration functions
- Support for the analysis of transformations that failed to be accepted by the ETL process
- Extensive reporting of the results of an ETL session, including automatic notification of significant failures of the ETL process
Share Your Experience In ETL Tool Selection Process
Have you participated in ETL tool selection process for any datawarehousing initiative? How did you do it? What did you learn through that process? Please share your experience with us.