One of the bottlenecks often encountered in ETL development for start schema is the sequence of loading dimension tables before being able to load any fact table. Obviously, it happens since fact tables need those surrogate keys from dimension tables in any dimensional data model.
I always wondered if there could be a better way where all dimension tables and fact tables using the process of parallel processing? Fortunately, there is a way to achieve this and I am going to share in this post how to do it.
The Problem With Surrogate Keys
I don’t need to spend time introducing surrogate keys in this post as that fundamental knowledge is prerequisite for understanding the context for this post.
Usually, surrogate keys for dimension tables are generated using a sequence number objects. All relational databases support sequence generators and all leading data integration tools have inbuilt sequence generators as well. The practice itself works out very well and data warehouses are full of dimension keys (surrogate keys) built using sequence numbers.
It helps also because business doesn’t need to know about surrogate keys and therefore, the only objective which is to keep the surrogate keys as unique, is greatly met.
However, there are fundamental challenges with this approach and just to list, here are a few:
1. Sequence numbers are generated in sequential manner which means the data loads can’t happen in parallel even if the infrastructure supports MPP architecture.
2. Sequence number (or MAX(key)+1) based approach requires dimension table to be loaded before loading it’s dependent fact tables. Imagine a couple of large dimension tables (with several millions of rows) to be loaded before loading 20 smaller dependent fact tables. They all need to wait till the big dimensions finishes loading.
3. Some people generate surrogate keys and store in a separate tables and cross reference it while loading into dimension/fact tables. While this approach can provide a level of parallelism, it is resource intensive and adds to the complexity and processing time especially considering slowly changing dimensions. Imagine the hassle of re-generating surrogate keys in case of an error…
An Elegant Way To Better Manage Surrogate Keys
If there was a way to generate surrogate keys using a function that could enable us to leverage MPP architecture rather than a sequence number, it could enable us to load data much faster due to parallelism. Not only that, it also could open a whole new door for possibility to load and reload dimensions and facts consistently in parallel without violating data integrity.
Fortunately, such a feat can be achieved by using hashing techniques to generate surrogate keys instead of using sequence numbers. This allows us to massively improve the load performance especially when there is a MPP architecture available to leverage. For the purpose of this discussion and based on my experience, I recommend using MD5 hashing algorithm which is widely supported and highly efficient for data warehousing environments. MD5 hashing is one of the most widely used hashing technique in data integration community already. But, few people use it as a method to generate surrogate keys and that’s our topic of this post.
A Little About MD5 Hashing
According to Wikipedia,
The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. MD5 has been utilized in a wide variety of cryptographic applications, and is also commonly used to verify data integrity.
The idea behind using MD5 hashing is that if you feed it the same set of data, you will end up with the same hash number output every time. Which means, we can uniquely generate a MD5 hash for each row of data and use that value as a surrogate key in our dimension tables. That also means we can re-compute the hash value for any given row to arrive at the same hash value any number of times unlike sequence numbers which change every time we need to re-load a dimension table due to the nature of sequence generation process itself.
MD5 is a hash function and therefore, there is a possibility of value collision although rare. The probability of collision is extremely rare, so much so that the change of collision is 1 in 2^64. You can read all about it on Google if you prefer and figure out what are your chances of finding a collision in your environment. I find it almost impossible to get one unless an expert is trying to deliberately manufacture a collision which isn’t necessary in real life situation.
Having said that, if you are really serious about collision and willing to trade ETL speed for making collisions impossible, you can go for MD6, SHA1 or SHA2 algorithms that function on 64-bits or 128-bits. But, it’s not recommended since surrogate keys are not worth that much pain unless you have hundreds of trillions of rows being inserted everyday in your dimensions.
MD5 Hashing In Change Data Capture
The good thing about MD5 hashing is that ETL developers have already been using this technique for Change Data Capture (CDC) purposes on large dimension tables where more than 10-15 columns need to be tracked for changing data to maintain SCD Type 2 and SCD Type 3 type dimension tables. Instead of comparing 15 fields between source and target, developers generate MD5 hash for the combination of those columns (checksum is what they call it generally)and just make one comparison instead of many which improves performance of the jobs significantly.
Tools like Informatica provide function MD5() for this purpose. Oracle has packages like DBMS_OBFUSCATION_TOOLKIT and DBMS_CRYPTO that have MD5 hashing function.
Similarly, SQL Server has HASHBYTES function that returns MD2, MD4, MD5, SHA, SHA1, or SHA2 hash of its input as well. The functionality is so common that it is widely available on all major platforms by all major tools and databases.
You can pretty much generate a MD5 hash value for a given text in Oracle and compare that to the MD5 hash value generated by any other tool (Informatica, SQL Server, PHP or Pentaho) and the values will still match.
The point is, MD5 is already being used in data warehousing environments. All we have to do is, extend its use to replace sequence based surrogate keys with MD5 values uniquely generated for each row in the dimension table and that opens the doors for parallel data loads thereby improving performance and reducing dependency among ETL processes.
Using MD5 Hash Value As Surrogate Keys
To use MD5 hash value as surrogate keys in dimension tables, make sure your surrogate key field (Say Product_Dim_Key) is 32 character long and it is a NOT NULL field (which will be true for your primary keys anyway).
Following are the best practices for generating MD5 values accurately and efficiently as surrogate keys in your Data Mart:
1. Concatenate all the columns of the table ensuring that fields are delimited using some special character such as a ‘~’ or a ‘|’ or even a ‘:’. You may argue that using delimiter may not be necessary. But, here is the reason for that:
Say your product dimension record has fields such as Product ID,Name,WarehouseID,Price,Weight and you construct input for MD5 without a field delimiter as following example:
The above record with a delimiter will look like this:
Now, say the warehouse ID changed in the source from 102 to 1021, price changed from 150 to 501 and weight changed from 123 to just 23. Now, the new concatenated record will look like this without field delimiter:
The same record with a field delimiter could look like this:
If you compare the original record and the changed record, without the field delimiter the MD5 hash values for both the strings will be the same causing change detection to not happen. That is why, using a field delimiter is so important.
Now that you know the importance of field delimiter, also make sure you include all the columns of the target dimension table including the LOAD_DATE or EFFECTIVE_DATE columns to ensure that the surrogate key generated is always unique for the row.
2. Pass the concatenated string to MD5 function in your ETL engine (or to the database function) to get a 32 character long HEXADECIMAL value to be used as the surrogate key in your dimension table.
Star Schema Parallel Loading For Performance Gain
By now we know that MD5 hash value can be generated and regenerated again and again for a fixed string and each time it will generate the same value for same input string. That means, we can actually generate hash keys for incoming source data in the staging area itself and keep it ready for loading into the data warehouse instead of generating these keys during the target load at runtime.
One of the most important gain of this technique is also the reason for this post which is to discuss how a parallel dimension and fact load can happen and how we can achieve massive parallelism using this method.
Just to reiterate, Star schema usually has a set of dimension tables linked to a central fact table through primary key – foreign key relationship between dimension’s surrogate key and the fact table’s candidate keys as shown in the image below:
To be able to load the entire star schema in parallel by leveraging the technique of MD5 hashing, we need to do the following:
1. Instead of using a number/integer type field as the surrogate key for your dimension and fact tables, model your star schema tables to use a CHARACTER(32) as surrogate key field. This is the starting point.
2. Disable foreign key constraints before the loads to allow data load in the fact tables while dimension tables are being loaded in parallel.
3. In your source-to-stage ETL mappings, calculate MD5 hash keys for each dimension table row rather than calculating them while loading into the final warehouse tables. This will allow parallel loading of the fact tables during the load later on. It is important to calculate MD5 hash keys in the staging area itself without which the parallel loading of fact tables won’t be possible.
4. While loading dimension and fact tables from stage to datawarehouse tables (facts or dimensions,) since surrogate keys are already calculated, there is no need to calculate them again. Design your ETL mappings to pickup these values from stage area itself (join respective tables in the staging area to pickup surrogate keys for the fact tables in the staging area itself) and load them all without concerning yourself with the data integrity issues in the warehouse. This logic will work great if your staging area tables are loading using “truncate and load” strategy.
5. Execute all “stage-to-target” ETL mappings in parallel to load all dimension tables and facts in parallel which will dramatically improve performance if the infrastructure is MPP.
6. Re-enable the foreign key constraints after all data is loaded. The foreign key constraints will re-enable themselves without any problem if data load was successful since surrogate keys will always be in sync due to MD5 hash keys.
Share Your Thoughts, Your Experiences
Have you used MD5 hashing in your data integration projects in any of the ways I shared here or in any other way? Please share your experience if you have and help us gain more insight trough your experience.
If you have any questions about the approach I shared for parallel loading of star schema or face any challenges during your implementation of this technique, please feel free to ask your questions through the comments section below.
We look forward to hearing from you…