Master data is basically a shared master copy of data such as customer, product, employee, suppliers and location data used by several applications within an enterprise. Master Data Management (MDM) is important to organizations today because it allows an enterprise to have a single version of the truth.
Without a clearly defined master data, any enterprise runs the risk of having multiple copies of data that are inconsistent with one another. This is also a very common issue across organizations today who do not have a MDM in place.
Data Volume Matters
Master data is, usually, relatively small in volume when compared with transactional data and definitely much smaller than the big data when it comes to volume.
If you look around, not many companies have several million product data records. Similarly, having millions of customers, suppliers or locations is also very rare. In fact, only in B2C companies do you get to see customer records in the tens or sometimes hundreds of millions (e.g. Facebook, Twitter, Google etc.)
Big data, on the other hand, is about much larger volumes. In fact, big data is about processing a data volume so large that the current RDBMS databases struggle to handle it. Big data toolset is getting better and better with each day passing and more and more companies are aggressively willing to jump on the big data bandwagon.
Hadoop Is Not A Good Fit For MDM
Doing master data management on big data platform, Hadoop doesn’t seem to be a natural fit and most companies wouldn’t be willing to go that route because of the low volume of master data.
So, the biggest question that came to my mind was, are these two related in some way? If so, how is big data related to master data management and how can they co-exist if they can?
MDM Has A Different Purpose Than That Of Big Data
After doing research and asking several of my friends who have experience in working with MDM tools, I came to a conclusion that makes perfect sense!
Yes, master data is also data. But, the purpose of MDM is different than that of big data or datawarehousing.
Therefore, while it is possible that big data can sometimes add value to the MDM, MDM solution is not driven by big data (or Hadoop.) Instead, MDM is used for reference while building big data solutions or even while building data marts. MDM, if exists, it is the true master that is used for driving other master data driven projects across the enterprise.
How MDM Can Drive Big Data And EDW
So, how should you plan to build a MDM solution while you are building your big data solution for an enterprise datawarehouse that involves a huge volume of data coming from an array of heterogeneous sources?
Big data will play an important role in your design if your data has high volume and unstructured data in it. Another thing to keep in mind is that Big Data is not a replacement of RDBMS or the EDW. Big Data, in reality, compliments your existing EDW infrastructure by helping you analyze the patterns you were not able to detect using your standard RDBMS.
Keeping that in mind, you want to keep a familiar dimensional data model for your datawarehouse for the relevant data that will be used for analytical purposes while rest all the data will stay in Hadoop Ecosystem as an active archival area. This allows you to get the analysis the data which is not necessarily stored in your EDW.
With experience, what I realized is, you also want to design your MDM solution in parallel to your EDW in RDBMS environment which will be continuously used and maintained as a single source of truth for your enterprise level master data. It is from this MDM storage that your datawarehouse and your Hadoop platform will read and reference the master data for your enterprise.
This concept is depicted in the architecture diagram shown below: