If you have been around big data for any length and worked on Hadoop, you have seen plenty of Pig and Hive. If you are even learning Hadoop and Big Data, Pig and Hive must seem to be the things you should know in order to have some control on big data.
Well, that’s the way things are in the world of Hadoop and for good reasons.
Pig and Hive both seem to do similar things with differences in the approach both take in dealing with data. While Pig is great at migrating (ETL) data into Hadoop and making it ready for querying, Hive seems to be a great fit for datawarehousing on Hadoop due to its SQL like syntax.
Pig is simple and powerful language and does a great job around data migration related activities. However, the reason some of us prefer Hive over Pig is because we don’t need to learn a new syntax working with Hive as long as we know SQL.
This is why folks at Facebook created Hive. They wanted to avoid the effort of training their users in Pig’s syntax which wasn’t common before the advent of Pig. Since most of their staff were already good at SQL, Hive was a perfect solution in that situation. So, Facebook created Hive and eventually, it became one of the top level projects for Apache.
Performance Challenges With Hive
Hive’s syntax looks very much like traditional SQL. However, because Hive is based on Hadoop and MapReduce operations, it is not exactly an RDBMS on Hadoop. Because Hive is based on Hadoop, the queries we execute, generate MapReduce code and that means you can expect your Hive queries to run very slow as compared to traditional SQL databases.
Hive can read text files like logs, CSV, or JSON format data exported from other systems and Hive output as well can be in text format. Although not the best format in the world, text files can be easily consumed by other systems and all sets of tools.
However, there are severe performance issues that make text formatting undesirable. Text is a verbose means of storing data and it forces Hive to scan the whole data for every query causing it to be extremely slow. Moreover, if the data is delimited and requires escaping, there is a huge performance bottleneck because of parsing being involved.
Optimization Through Partitioning And Compression
So, what do we do to optimize Hive’s query performance?
Well, there are many things that can be done and one of the first things regarding performance enhancement that we can do is to partition the data in a directory structure.
For example, we can partition the data by year, month, and days for log data. We can also split data by region and countries in case of geographic data and so on. It enables Hive to only query and read requested partition of the data instead of scanning the whole set each time. In addition, Hive also comes with a wide variety of compression algorithms such as Gzip, Bzip, LZO, or Snappy.
Not all compression algorithms are suitable in every situation, based on your Hadoop infrastructure and data, you can choose one of the compression techniques that best suits your needs. Experience has shown that compression significantly helps improve query performance.
Optimizing Performance By Changing Storage File Format
By default, Hadoop saves its data internally in flat sequence files, which is a binary storage format for key value pairs. Sequence files can be compressed on value, or block level, to improve IO.
Unfortunately, sequence files are not an optimal solution for Hive as it needs to scan a whole row if it’s looking for a value because of the nature of this storage format.
So, here are a couple of optimal file format solutions to help you dramatically improve performance of your Hive queries:
1. RCFile Data Storage Format
The RCFile (Record Columnar File) is a data storage format, co-developed by Facebook, which combines multiple functions such as data storage formatting, data compression, and data access optimization to meet all the four data storage requirements: fast data loading, faster query processing and highly dynamic data access patterns.
The RCFile forma is capable of partitioning data in both directions – horizontally and vertically. This allows data to be stored in a columnar format instead of rows. That means, instead of storing row one then row two, stores column one across all rows then column two across all rows and so on. On top of that, since rows are broken into multiple partitions across multiple blocks, Hadoop’s parallelism does rest of the magic.
RCFile format allows Hive queries to skip large parts of the data and get the results faster and cheaper since it can scan just fetch only the specific fields that are required for analysis, thereby eliminating the standard time needed to analyze the whole table in a database.
The storage format for a Hive table is specified at the time of table creation. Here is the syntax you can use to use RCFile storage format while creating a Hive table:
CREATE TABLE table_name (column1 type, column2 type,…) STORED AS RCFILE;
2. ORC File Data Storage Format
The ORC File (Optimized Row Columnar) storage format takes the storage and performance to the whole new level where it provides a much more efficient way to store relational data than the RC File.
ORC File format reduces the data storage format by up to 75% of the original data file and performs better than any other Hive data files formats when Hive is reading, writing, and processing data.
According to some people who have extensively tested ORC File format, when compared to the RCFile, ORC File takes less time to access data and takes less space to store data. However, the ORC File storage format increases CPU overhead by increasing the time it takes to decompress the relational data. Also, the ORC File format comes with the Hive 0.11 version and cannot be used with previous versions. So, if you are on an older version of Hive, you will need to upgrade if you wish to use ORC.
The storage format for a Hive table is specified at the time of table creation and again, if you want to use ORC File storage format for a new Hive table you are creating, here is the syntax you can use:
CREATE TABLE table_name (column1 type, column2 type,…) STORED AS ORC;
Performance improvement by using Data Compression, partitioning and using file formats for smaller, faster and optimal storage is going to be increasingly important as Big Data technologies mature. RCFile and ORC File formats have certainly added tremendous value to Hive.
However, as big data continues to evolve to cope up with the current rapid and exponential growth of data, we can expect new and more innovative ways to perform further optimization. Let’s keep learning and keep watching!