How To Load Star Schema Dimensions And Facts In Parallel

One of the bottlenecks often encountered in ETL development for start schema is the sequence of loading dimension tables before being able to load any fact table. Obviously, it happens since fact tables need those surrogate keys from dimension tables in any dimensional data model.

I always wondered if there could be a better way where all dimension tables and fact tables using the process of parallel processing? Fortunately, there is a way to achieve this and I am going to share in this post how to do it.

The Problem With Surrogate Keys

I don’t need to spend time introducing surrogate keys in this post as that fundamental knowledge is prerequisite for understanding the context for this post.

Usually, surrogate keys for dimension tables are generated using a sequence number objects. All relational databases support sequence generators and all leading data integration tools have inbuilt sequence generators as well. The practice itself works out very well and data warehouses are full of dimension keys (surrogate keys) built using sequence numbers.

It helps also because business doesn’t need to know about surrogate keys and therefore, the only objective which is to keep the surrogate keys as unique, is greatly met.

However, there are fundamental challenges with this approach and just to list, here are a few:

1. Sequence numbers are generated in sequential manner which means the data loads can’t happen in parallel even if the infrastructure supports MPP architecture.

2. Sequence number (or MAX(key)+1) based approach requires dimension table to be loaded before loading it’s dependent fact tables. Imagine a couple of large dimension tables (with several millions of rows) to be loaded before loading 20 smaller dependent fact tables. They all need to wait till the big dimensions finishes loading.

3. Some people generate surrogate keys and store in a separate tables and cross reference it while loading into dimension/fact tables. While this approach can provide a level of parallelism, it is resource intensive and adds to the complexity and processing time especially considering slowly changing dimensions. Imagine the hassle of re-generating surrogate keys in case of an error…

An Elegant Way To Better Manage Surrogate Keys

If there was a way to generate surrogate keys using a function that could enable us to leverage MPP architecture rather than a sequence number, it could enable us to load data much faster due to parallelism. Not only that, it also could open a whole new door for possibility to load and reload dimensions and facts consistently in parallel without violating data integrity.

Fortunately, such a feat can be achieved by using hashing techniques to generate surrogate keys instead of using sequence numbers. This allows us to massively improve the load performance especially when there is a MPP architecture available to leverage. For the purpose of this discussion and based on my experience, I recommend using MD5 hashing algorithm which is widely supported and highly efficient for data warehousing environments. MD5 hashing is one of the most widely used hashing technique in data integration community already. But, few people use it as a method to generate surrogate keys and that’s our topic of this post.

A Little About MD5 Hashing

According to Wikipedia,

The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. MD5 has been utilized in a wide variety of cryptographic applications, and is also commonly used to verify data integrity.

The idea behind using MD5 hashing is that if you feed it the same set of data, you will end up with the same hash number output every time. Which means, we can uniquely generate a MD5 hash for each row of data and use that value as a surrogate key in our dimension tables. That also means we can re-compute the hash value for any given row to arrive at the same hash value any number of times unlike sequence numbers which change every time we need to re-load a dimension table due to the nature of sequence generation process itself.

MD5 is a hash function and therefore, there is a possibility of value collision although rare. The probability of collision is extremely rare, so much so that the change of collision is 1 in 2^64. You can read all about it on Google if you prefer and figure out what are your chances of finding a collision in your environment. I find it almost impossible to get one unless an expert is trying to deliberately manufacture a collision which isn’t necessary in real life situation.

Having said that, if you are really serious about collision and willing to trade ETL speed for making collisions impossible, you can go for MD6, SHA1 or SHA2 algorithms that function on 64-bits or 128-bits. But, it’s not recommended since surrogate keys are not worth that much pain unless you have hundreds of trillions of rows being inserted everyday in your dimensions.

MD5 Hashing In Change Data Capture

The good thing about MD5 hashing is that ETL developers have already been using this technique for Change Data Capture (CDC) purposes on large dimension tables where more than 10-15 columns need to be tracked for changing data to maintain SCD Type 2 and SCD Type 3 type dimension tables. Instead of comparing 15 fields between source and target, developers generate MD5 hash for the combination of those columns (checksum is what they call it generally)and just make one comparison instead of many which improves performance of the jobs significantly.

Tools like Informatica provide function MD5() for this purpose. Oracle has packages like DBMS_OBFUSCATION_TOOLKIT and DBMS_CRYPTO that have MD5 hashing function.

Similarly, SQL Server has HASHBYTES function that returns MD2, MD4, MD5, SHA, SHA1, or SHA2 hash of its input as well. The functionality is so common that it is widely available on all major platforms by all major tools and databases.

You can pretty much generate a MD5 hash value for a given text in Oracle and compare that to the MD5 hash value generated by any other tool (Informatica, SQL Server, PHP or Pentaho) and the values will still match.

The point is, MD5 is already being used in data warehousing environments. All we have to do is, extend its use to replace sequence based surrogate keys with MD5 values uniquely generated for each row in the dimension table and that opens the doors for parallel data loads thereby improving performance and reducing dependency among ETL processes.

Using MD5 Hash Value As Surrogate Keys

To use MD5 hash value as surrogate keys in dimension tables, make sure your surrogate key field (Say Product_Dim_Key) is 32 character long and it is a NOT NULL field (which will be true for your primary keys anyway).

Following are the best practices for generating MD5 values accurately and efficiently as surrogate keys in your Data Mart:

1. Concatenate all the columns of the table ensuring that fields are delimited using some special character such as a ‘~’ or a ‘|’ or even a ‘:’. You may argue that using delimiter may not be necessary. But, here is the reason for that:

Say your product dimension record has fields such as Product ID,Name,WarehouseID,Price,Weight and you construct input for MD5 without a field delimiter as following example:

1001Shoe102150123

The above record with a delimiter will look like this:

1001,Shoe,102,150,123

Now, say the warehouse ID changed in the source from 102 to 1021, price changed from 150 to 501 and weight changed from 123 to just 23. Now, the new concatenated record will look like this without field delimiter:

1001Shoe102150123

The same record with a field delimiter could look like this:

1001,Shoe,1021,501,23

If you compare the original record and the changed record, without the field delimiter the MD5 hash values for both the strings will be the same causing change detection to not happen. That is why, using a field delimiter is so important.

Now that you know the importance of field delimiter, also make sure you include all the columns of the target dimension table including the LOAD_DATE or EFFECTIVE_DATE columns to ensure that the surrogate key generated is always unique for the row.

2. Pass the concatenated string to MD5 function in your ETL engine (or to the database function) to get a 32 character long HEXADECIMAL value to be used as the surrogate key in your dimension table.

Star Schema Parallel Loading For Performance Gain

By now we know that MD5 hash value can be generated and regenerated again and again for a fixed string and each time it will generate the same value for same input string. That means, we can actually generate hash keys for incoming source data in the staging area itself and keep it ready for loading into the data warehouse instead of generating these keys during the target load at runtime.

One of the most important gain of this technique is also the reason for this post which is to discuss how a parallel dimension and fact load can happen and how we can achieve massive parallelism using this method.

Just to reiterate, Star schema usually has a set of dimension tables linked to a central fact table through primary key – foreign key relationship between dimension’s surrogate key and the fact table’s candidate keys as shown in the image below:

Star-Schema-Example-Surrogate-Keys-Facts-Dimensions-Primary-Keys-Foreign-Keys-Relationships

To be able to load the entire star schema in parallel by leveraging the technique of MD5 hashing, we need to do the following:

    1. Instead of using a number/integer type field as the surrogate key for your dimension and fact tables, model your star schema tables to use a CHARACTER(32) as surrogate key field. This is the starting point.

    2. Disable foreign key constraints before the loads to allow data load in the fact tables while dimension tables are being loaded in parallel.

    3. In your source-to-stage ETL mappings, calculate MD5 hash keys for each dimension table row rather than calculating them while loading into the final warehouse tables. This will allow parallel loading of the fact tables during the load later on. It is important to calculate MD5 hash keys in the staging area itself without which the parallel loading of fact tables won’t be possible.

    4. While loading dimension and fact tables from stage to datawarehouse tables (facts or dimensions,) since surrogate keys are already calculated, there is no need to calculate them again. Design your ETL mappings to pickup these values from stage area itself (join respective tables in the staging area to pickup surrogate keys for the fact tables in the staging area itself) and load them all without concerning yourself with the data integrity issues in the warehouse. This logic will work great if your staging area tables are loading using “truncate and load” strategy.

    5. Execute all “stage-to-target” ETL mappings in parallel to load all dimension tables and facts in parallel which will dramatically improve performance if the infrastructure is MPP.

    6. Re-enable the foreign key constraints after all data is loaded. The foreign key constraints will re-enable themselves without any problem if data load was successful since surrogate keys will always be in sync due to MD5 hash keys.

Share Your Thoughts, Your Experiences

Have you used MD5 hashing in your data integration projects in any of the ways I shared here or in any other way? Please share your experience if you have and help us gain more insight trough your experience.

If you have any questions about the approach I shared for parallel loading of star schema or face any challenges during your implementation of this technique, please feel free to ask your questions through the comments section below.

We look forward to hearing from you…

By |2016-10-31T14:21:27+00:00December 19th, 2015|Data Integration Concepts|14 Comments

About the Author:

Kumar Gauraw is senior IT professional with over 16 years of experience while serving in various capacities at Fortune 500 companies in the field of Data Integration, Data Modeling, Data Migration and architecting end-to-end Datawarehousing solutions. Kumar has strong experience in a wide spectrum of tools and technologies related to BI and Data Integration.

14 Comments

  1. Daniel February 23, 2016 at 3:20 pm - Reply

    Hey Kumar,

    This is a really great post. I like the idea of using MD5 hash. We are currently implementing this technique in our data warehouse for change data capture.

    My question is how does this impact performance and how would you manage indexes on these tables. In the fact table, do you have non-clustered indexes on your foreign key columns? Do you create a clustered index on a combination of all the surrogate keys or some other column(s)?

    Aslo, for the dimension tables, in this set up would you rather create the clustered index on the PK instead of the business key?

    Well, let me know your thoughts!

    Thanks,
    Daniel

    • Kumar Gauraw February 24, 2016 at 1:44 am - Reply

      Daniel,

      I don’t recommend using Foreign Keys on Fact tables because it’s unnecessarily if a controlled ETL process is loading data into it. You can create indexes on columns that are to be used in join conditions in SELECT statements. Same will apply for Dimension tables as well. You will create indexes (I prefer bitmap indexes over B-TREE indexes in general) even on the surrogate keys. However, you will need to create indexes on other columns of your Dimension table based on query usage.

      Clustered or not, again, depends on your architecture and volume of data in your tables.

      Does this help?

  2. Anand Kolangara May 2, 2016 at 3:50 pm - Reply

    The approach is good however there is an assumption here that the staging table has a one to one mapping to dimension table. If the staging table is loading more than one dimension tables at a time then this wont be an option that we could consider.

  3. Gautam Chakravorty May 27, 2016 at 1:06 pm - Reply

    what if the existing surrogate key number in the table now clashes with md5 hex after transition

    • Kumar Gauraw May 30, 2016 at 12:54 am - Reply

      Gautam,

      Can you elaborate (may be with an example) of what are you trying to imply?

      Thanks,
      Kumar

  4. jithin July 11, 2016 at 6:05 am - Reply

    Hi Kumar,

    Approach looks good for etl loading..But while reporting if we use this 32 bit char as join key, it will affect performance?
    Am i rit..Please let me know if i am wrong?

    Thanks
    Jithin

    • Kumar Gauraw July 17, 2016 at 8:50 pm - Reply

      Hi Jithin,

      I don’t think performance is that of a big deal with 32 character field to hold the MD5 values since the usual surrogate keys that we have seen being use traditionally, uses INTEGER datatype which is Number(38) as well. So, no. I don’t think performance is a big concern for querying. Instead, it is a better loading scheme that cuts down the load time multi fold for large models with many facts and dimensions.

      Regards,
      Kumar

  5. Dipti August 17, 2016 at 7:13 pm - Reply

    Hi Kumar

    How to handle SCD with this approach ?

    • Kumar Gauraw October 31, 2016 at 2:12 pm - Reply

      Which SCD type are you confused about, DIpti?

      • bhasakr August 8, 2018 at 10:27 am - Reply

        Dear Kumar , Good day !!! Hope you are doing well.
        Please expline about SCD TYPE 2

  6. Siddharth Krishna June 9, 2017 at 3:14 am - Reply

    During the Fact Table Load, if the staging table doesn’t have the surrogate key…won’t it require to lookup to dimension table as well as all dimensions may not be available on a given day corresponding to transnational data for the day.

  7. Mark F June 13, 2017 at 12:26 am - Reply

    This is an interesting idea – however, I think it will sacrifice query speed, most likely big time, for the load time gains – which is not usually the tradeoff we want. I’m not sure what you mean in equating Integer and Number(38) in number of bytes – 4 bytes for integer and 20 for number(38) (and who would use Number for a surrogate key anyway ?). If you need a bigint that’s still only 8 bytes. So a hash-based approach has 24 or 28 more bytes, per dimension, of extra space on each FK in the fact, with the implications of that too on the indices you’d have on them. Plus, there is no way a join on a 32 byte value will remotely perform as a join on an int or bigint would (which map to the hardware architecture as no other datatypes do).

    So, on balance, unless load speed trumps query speed, this doesn’t seem like a good idea.

  8. Nick February 24, 2018 at 1:07 am - Reply

    Totally agree with the statement above. Like the concept but it fails the basic principle of what a dimensional is trying to achieve. Query performance. By using char 32 for say 10 dimensions and several measures over 500 million records will yield a drop in performance as you are doing more io for each record. DW 101…long skinny facts. ..short fat dimensions. ..like your thinking though.

  9. Isha August 9, 2018 at 10:45 pm - Reply

    I am not able to understand how SCD type 2 would be implemented with this algorithm. We do not have effective date and expiry date in our staging layer. How would we know when to expire a record and insert new row for a given natural key Vs if its a totally new insert. Could you please explain with an example.

    Thanks in advance!

Leave A Comment