Big data, as the name suggests, is a collection of large datasets that any organization puts together to serve specific analytical goals and to support their operations.
The concept of Big Data is often used to denote a storage system where different types of data in different formats can be stored for analysis and driving business decisions. The data stored in Big Data systems can be structured (coming from RDBMS systems or other formatted data sources) or unstructured (such as log files, pictures, audio files, communications records, email, Twitter feed – anything).
Today, data is growing at an unprecedented rapid pace and even more intriguing fact is that the unstructured data accounts for 90% of all the data.
The Problem With Traditional Approach To Data
That means, with complex and dynamically increasing requirements of data storage, management, and analytics, traditional ways of handling enterprise data is no more efficient or accurate. That is where Big Data comes to rescue with whole set of new and efficient tools and technologies.
The traditional storage systems were about storing data in a relational database in a way that various applications could access it over the network. This approach has following problems:
1. With the volume of data going into terabytes, petabytes and even in zettabytes, traditional data storage systems are highly inefficient in handling that volume of data.
2. With highly dynamic data and need for real time analysis on a data, (which is more unstructured and less structured,) it is painstakingly becoming difficult to deal with traditional storage and applications.
3. If the number of users on the system increase, the load on the system increases and this it degrades the performance due to the way data is stored at one central place and users access it from different places on the network.
Big Data – The New Approach
The Big Data approach is the new approach to data storage and management where structured and unstructured data can co-exist and contribute to the overall analysis and help organizations make more accurate decisions.
Speaking of Big Data, following are the high level requirements of such a system:
1. The system should be fault tolerant.
2. The system should be designed in such a way that the data is always available.
3. The processes running on the system should not be impacted adversely by a single node failure.
4. The data integrity and correctness should be maintained correctly at all times.
5. The system should be highly scalable. That means, it should allow new hardware, new nodes to be added as the need for more processing of more storage arises.
Hadoop – The Ecosystem To Handle Big Data Efficiently
According to the definition by IBM, “Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.”
Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Hadoop can handle all types of data from disparate sources and systems – structured, unstructured, log files, pictures, audio files, text, Twitter feeds, social media channels – just about anything you can think of, regardless of its native format.
Furthermore, you can put it all in your Hadoop cluster with no prior need for a schema. In other words, you don’t need to know how you intend to query your data before you store it; Hadoop lets you decide that later.
By making all your data useable, not just what’s in your databases, Hadoop lets you uncover hidden relationships and reveals answers that have always been just out of reach. You can start making more decisions based on hard data, instead of hunches, and look at complete data sets, not just samples and summaries.
Highlights Of The Features Of Hadoop
Hadoop is a distributed framework designed to handle Big Data efficiently. Let’s look at some of the top features that make Hadoop so powerful in today’s atmosphere:
1. Hadoop Is Highly Scalable
Because Hadoop work on a distributed framework where any number of servers can be added as a node within a cluster of servers for processing, it is highly scalable.
Hadoop clusters can be expanded from just a few server nodes to several hundreds of nodes efficiently without bringing the system down or impacting the system in anyway.
Depending upon the need, adding new servers to the cluster or removing servers from the cluster can be done without bringing the cluster down.
2. Hadoop Is Highly Flexible
There is no particular specification of machines that can be added to a Hadoop cluster. Any machine of any capability, can be added to a Hadoop cluster without any difficulty. It gives offers high economical flexibility.
Because Hadoop is Java based, it can run on any operating system as long as Java Virtual Machine is up and running.
In addition, Hadoop can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
3. Hadoop Is Highly Economical
Since Hadoop is an Open Source project, you don’t need to buy licenses to install Hadoop on new systems or to expand your Hadoop clusters.
If you compare that with the traditional systems and their proprietary tools that you needed to buy when your organization needed more licenses, Hadoop seems very economical from licensing perspective alone.
On top of that, you can decide what kind of hardware you want to attach to a Hadoop cluster depending upon your organizational needs and budget and it will work just fine.
4. Hadoop Is Highly Fault Tolerant
Hadoop, by design, treats failure as a part of system and therefore, it maintains replica of data across multiple nodes within its cluster thus providing a high level fault tolerance.
Execution of jobs in a Hadoop cluster continues without any interruption even in case a node or two fails at any time.
Because of distributed framework and data replication across multiple nodes, Hadoop provides a highly reliable and efficient storage system.
5. Hadoop Brings Computations Closer To Data
Data Localization is one of those fundamental concepts that make Hadoop so good at crunching large data sets. Unlike traditional systems where computations are performed by applications on separate machines while data resides on a central server, Hadoop brings computations closer to the node where data is stored making data processing way more faster.
In the core of the HDFS documentation, under “Assumptions and Goals”, follow is stated:
“Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
This approach ensures that real time analysis at lightening speed making it possible for organizations to make real time business decisions.
Hadoop includes various main components and every year new components are being added to the Hadoop Ecosystem making it even more powerful.
However, HDFS (storage) and MapReduce (processing) are the two core components of Apache Hadoop. The most important aspect of Hadoop is that both HDFS and MapReduce are designed with each other in mind and each are co-deployed such that there is a single cluster and thus provides the ability to move computation to the data easily.
1. Hadoop Distributed File System (HDFS)
HDFS is a distributed file system that provides high-throughput access to data spread across multiple nodes. It provides a limited interface for managing the file system to allow it to scale and provide high throughput.
HDFS breaks data into multiple blocks and then creates multiple replicas of each data block and distributes them on several nodes throughout its cluster to enable reliable and rapid access.
The main components of HDFS are as described below:
- NameNode is the master of the entire HDFS system. It maintains the metadata of entire system (information about nodes,directories,files) and manages the blocks which are present on the DataNodes attached to it. If NameNode is down, HDFS can’t be accessed.
- DataNodes are the slaves which are deployed on each machine to provide the actual storage. These are the nodes responsible for serving read and write requests for the clients.
- Secondary NameNode should not be assumed to be a backup copy of the NameNode. It is actually responsible for performing periodic checkpoints of the system which can be used to restart the NameNode in the event of NameNode failures.
MapReduce is the component of Hadoop framework for performing distributed data processing.
In the MapReduce paradigm, each job has a user-defined map phase (which is a parallel, share-nothing processing of input; followed by a user-defined reduce phase where the output of the map phase is aggregated).
Typically, HDFS is the storage system for both input and output of the MapReduce jobs.
The main components of MapReduce are as described below:
- JobTracker is the master of the Map Reduce system which manages the jobs and resources in the cluster (TaskTrackers). The JobTracker’s job is to try to schedule each map as closer as possible to the actual data being processed.
- TaskTrackers are the slaves of Job Tracker which are deployed on each machine. They are responsible for running the map and reduce tasks as instructed by the JobTracker. Usually, Job tracker assigns the tasks of running the mapping or reducer tasks to Task Trackers that are local to the node where the data being processed resides.
- JobHistoryServer is a daemon that serves historical information about completed applications. Typically, JobHistory server can be co-deployed with JobTracker, but we recommend to run it as a separate daemon.
The following illustration provides details of the core components for the Hadoop stack:
Apache has a new component by name YARN which they call the NextGen MapReduce. While the detailed discussion on YARN is not in the scope of this writing, I do want to share that the fundamental idea of YARN is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons.
The idea behind YARN is to have a global Resource Manager (RM) and per-application Application Master (AM). An application is either a single job in the classical sense of Map-Reduce jobs.
The Resource Manager (RM) and per-node slave, the Node Manager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system in case of YARN. For more details on YARN please refer to this article on apache’s website.
Hadoop is a powerful distributed framework for Big Data management. A lot of top companies like LinkedIn, Amazon, eBay, Adobe and many more are powered by Hadoop. Yahoo is known for bringing focus of the world on Hadoop through their 1000 node cluster back in 2007 and then 4000 node cluster in 2008.
Hadoop is unstoppable as its open source roots grow wildly and deeply into enterprise data management architectures.
Realizing the trend, many companies, from IBM to Amazon Web Services, Microsoft and Teradata – all have packaged Hadoop into more easily-consumable distributions or services.
As time goes by, more and more companies are adapting Hadoop as the need for Big Data management is consistently rising.