The term Big Data is often used to denote a storage system where different types of data in different formats can be stored for analysis and driving business decisions.
Big Data is an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of traditional RDBMS databases or traditional data processing techniques.
According to IBM, there are three main characteristics of Big Data:
- Volume: Facebook generating 500+ terabytes of data per day just as an example. Data volume is growing at an unprecedented rate everyday.
- Velocity: For an example, an organization may need to analyzing 2 million records each day to identify the reason for losses. Companies like Facebook and Google analyze much bigger data set everyday for their data processing needs.
- Variety: Images, audio, video, sensor data, log files, etc.
What Is Hadoop?
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model.
Unstructured data such as log files, Twitter feeds, media files, data from the internet in general is becoming more and more relevant to businesses. Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze this kind of big data in the organizations.
Hadoop is a framework that has the ability to store and analyze data present in different machines at different locations very quickly and in a very cost effective manner. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel.
Key Features Behind Popularity Of Hadoop
Because of its power of distributed processing, Hadoop can handle large volumes of structured and unstructured data more efficiently than the traditional enterprise data warehouse. Hadoop is open source and therefore, it can run on commodity hardware. That means the initial cost savings are dramatic with Hadoop while it can continue to grow as your organizational data grows.
Here are a few key features of Hadoop:
1. Hadoop Brings Flexibility In Data Processing:
One of the biggest challenges organizations have had in that past was the challenge of handling unstructured data. Let’s face it, only 20% of data in any organization is structured while the rest is all unstructured whose value has been largely ignored due to lack of technology to analyze it.
Hadoop manages data whether structured or unstructured, encoded or formatted, or any other type of data. Hadoop brings the value to the table where unstructured data can be useful in decision making process.
2. Hadoop Is Easily Scalable
This is a huge feature of Hadoop. It is an open source platform and runs on industry-standard hardware. That makes Hadoop extremely scalable platform where new nodes can be easily added in the system as and data volume of processing needs grow without altering anything in the existing systems or programs.
3. Hadoop Is Fault Tolerant
In Hadoop, the data is stored in HDFS where data automatically gets replicated at two other locations. So, even if one or two of the systems collapse, the file is still available on the third system at least. This brings a high level of fault tolerance.
The level of replication is configurable and this makes Hadoop incredibly reliable data storage system. This means, even if a node gets lost or goes out of service, the system automatically reallocates work to another location of the data and continues processing as if nothing had happened!
4. Hadoop Is Great At Faster Data Processing
While traditional ETL and batch processes can take hours, days, or even weeks to load large amounts of data, the need to analyze that data in real-time is becoming critical day after day.
Hadoop is extremely good at high-volume batch processing because of its ability to do parallel processing. Hadoop can perform batch processes 10 times faster than on a single thread server or on the mainframe.
5. Hadoop Ecosystem Is Robust:
Hadoop has a very robust ecosystem that is well suited to meet the analytical needs of developers and small to large organizations. Hadoop Ecosystem comes with a suite of tools and technologies making i a very much suitable to deliver to a variety of data processing needs.
Just to name a few, Hadoop ecosystem comes with projects such as MapReduce, Hive, HBase, Zookeeper, HCatalog, Apache Pig etc. and many new tools and technologies are being added to the ecosystem as the market grows.
6. Hadoop Is Very Cost Effective
Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data.
Apache Hadoop was developed to help Internet-based companies deal with prodigious volumes of data. According to some analysts, the cost of a Hadoop data management system, including hardware, software, and other expenses, comes to about $1,000 a terabyte–about one-fifth to one-twentieth the cost of other data management technologies
Core Architecture Of Hadoop
Hadoop is composed of four core components
1. Hadoop Common
Hadoop Common is the set of common utilities that support other Hadoop modules.
2. Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout various nodes in the cluster. This way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing. This powerful feature is made possible through the HDFS of Hadoop.
MapReduce is a programming framework of Hadoop suitable for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.
MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions.
4. Yet Another Resource Negotiator (YARN).
YARN is a cluster management technology. It is one of the key features in second-generation Hadoop.
It is the next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.
Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform.
Big Data is going to dominate the next decade in the data processing world and Hadoop ecosystem, with all the supporting data access projects around it, is going to be the center of it all. All traditional data integration tools are now coming with Hadoop and Big data support to meet the next level of data processing challenges.
I hope this posts adds value to you and adds to your knowledge repository about Hadoop ecosystem and its components. We will discuss more about each of Hadoop’s components and data access projects in detail in future posts where we will dive deeper into them.
Your Turn, Share Your Knowledge Now!
If you have been working in Big Data area, would you add value by adding your experience with Hadoop? What are some of the important features of Hadoop that a beginner should know? Share your thoughts through your comments!