Data Integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information.
The process of Data Integration is about taking data from many disparate sources (such as files, various databases, mainframes etc.,) and combining that data to provide a unified view of the data for business intelligence.
Data integration is needed when a business decides to implement a new application and migrate its data from the legacy systems into the new application. It becomes even critically important in cases of company mergers where two companies merge and they need to consolidate their applications.
One of the most commonly known use of data integration is building a data warehouse for an enterprise which enables a business to have a unified view of their data for analysis and business intelligence (BI) needs.
Data Integration Techniques
There are many sophisticated ways the unified view of data can be created today. No more ETL is the only way to achieve the goal and that is a new level of complexity in the field of Data Integration.
There are several organizational levels on which the Data Integration can be performed and let’s discuss them briefly.
1. Manual Data Integration
Technically speaking, this is really not a Data Integration. In this approach, a web based user interface or an application is created for users of the system to show them all the relevant information by accessing all the source systems directly. There is no unification of data in reality.
2. Middleware Data Integration
A middleware data integration solution is essentially a layer between two disparate systems allowing them to communicate. Middleware integration can act like a glue that holds together multiple legacy applications, making seamless connectivity possible without requiring the two applications to communicate directly.
3. Data Virtualization Integration Approach
Data Virtualization allows us to leave data in the source systems while allowing to create a new set of unified views. This provides a way for users to access the unified view of disparate source system’s data across whole enterprise.
A lot of organizations today prefer this approach because of the benefits and technologies that exist today to support this approach. The main benefit of the virtual integration approach is near real time view of data from the source systems. It eliminates a need for separate data store for the consolidated unified data.
However, that doesn’t mean it’s the best way to do Data Integration although it certainly has a short term benefit. The drawbacks of this approach include limited possibility of data’s history availability or data version management and extra load on the source systems involved which may have an adverse effect on the performance of the source systems.
4. Data Warehouse Approach Of Data Integration
This is the most commonly known approach to Data Integration you may already know if you have read Ralf Kimball and/or Bill Inmon.
This approach requires creation of a new Data Warehouse (of Data Marts) which stores a unified version of data extracted from all the source systems involved and manage it independent of the original source systems.
The benefits of this approach include ability to easily manage history of data (or data versioning), ability to combine data from very disparate sources (mainframes, databases, flat files, etc.) and to store them in a central repository of data.
Data Integration Challenges
Data Integration is the process of identifying ways to bring data from disparate sources and combining them to have a unified view. Well, that is a challenging task by itself.
From a technical standpoint, the technical aspect of this implementation is the first level of challenge. It is never easy to understand data residing in disparate sources and then design a common structure to fit them all at one place.
Diving a bit deeper, let’s categorize the challenges and make a list of things to be considered as a challenge for any Data Integration project.
1. To Have A Good Understanding Of Data
It is very important to have a person (or a team of people) who understand the data assets of the enterprise and also the source systems. I would like to call them Data Champions.
The data champions should be able to lead the discussions about the long-term data integration goals in order to make it consistent and successful.
2. Understanding Of Objectives And Deliverables
The next big thing is to have a good understanding of business requirements. What is the business purpose behind the data integration initiative? Understanding of the objectives and deliverables for the project is critical to the next steps.
What are the source systems? Do the source systems have data to support the business requirements? What are the gaps between data and the requirements? These questions should be answered adequately.
3. Analysis Of The Source Systems And Extraction
Having a good understanding of the options of extracting data from the source systems is critical to the overall success. Things like frequency of extracts, extent of data extraction (full extract or incremental), quality of the data in the source systems affect the timeline and overall direction of the project.
In addition, it’s important to know about the volume of data extract to be able to plan the integration approach along with a knowledge of source system’s backup schedules, any specific maintenance windows etc. that may impact the data integration process.
The tool selection is also very important to the overall success of any data integration project. A feasibility study should be performed to select the right toolset for the job.
Small enterprises or companies who are just starting their data warehousing initiative are faced with this challenge and sometimes, making that decision isn’t easy considering the number of options available today. The feasibility study helps map out which tools are best suited for the overall data integration objective for the organization.
Sometimes, for organizations that have built their data warehouse and a lot of ETL processes using a tool which is no more able to scale or the tool is no more relevant. For example, organizations that have invested years of work in OWB are facing that challenge as Oracle decided to do away with OWB to promote a much better tool, ODI. In these cases, even matured organizations need to do a feasibility analysis to estimate what it takes to upgrade the existing Data Integration infrastructure to the new toolset.
Your Turn, Share Your Experience
What is your experience with data integration? What are some of the different types of data integration projects you have participated in? Would you like to mention some of the challenges you faced? Look forward to your comments!