Since Apache Spark came to existence in 2014, it received massive recognition and developer community just loved it, all for good reasons. Apache Spark is a fast, in-memory data processing engine with elegant development APIs to allow developers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

However, there seems to be a confusion where some begin to think Spark could possibly replace Hadoop or if it’s better than Hadoop etc.

That assumption is not actually correct as far as I understand Hadoop and Spark and I am going to discuss some aspects of Spark to show how it is actually there to complement to Hadoop and improve its capabilities rather than being a replacement of Hadoop.

Let’s get started.

Apache Hadoop – An Ecosystem For Big Data

Hadoop is a platform for Big Data processing with a distributed file storage system known as HDFS (Hadoop Distributed File System) and MapReduce (now replaced by YARN) programming engine for data processing.

Rather than discussing the details, I recommend visiting these couple of earlier posts about basics of Hadoop to help you gain some basics insights about Hadoop and its toolset:

  1. Fundamentals Of Big Data And Hadoop Ecosystem
  2. Big Data And Hadoop – Features And Core Architecture

Applications And Tools For Hadoop Including Spark

Apache Spark – A Data Processing Engine

As stated above, Apache Spark is a fast, in-memory data processing engine while Hadoop is a distributed storage system powered by HDFS and YARN which makes Spark work with Hadoop and add more power to Big Data processing.

Spark does not have its own distributed storage system but it can be running on Apache Hadoop and comfortably work with YARN. With that kind of a setup, developers can now create applications to leverage the computational power of Spark.

Basically, Spark is just one of many data processing engines that work with YARN in Hadoop to help process Big Data.

For more information, you might want to check out this simple but very informative post at Hortonworks website explaining what Apache Spark can and cannot do:

Having said that, Spark isn’t limited to Hadoop platform either just as Hadoop has many other tools in it’s ecosystem. Apache Spark can run on Hadoop, Mesos, even standalone, or in the cloud as situation demands.

Spark Machine Learning Libraries SQL and GraphX

It can access diverse data sources including HDFS, Cassandra, HBase, S3. Therefore, it is an independent data processing engine by itself as well although it is better used on HDFS when distributed data storage is needed.

Can Apache Spark Replace Hadoop?

The simple answer is already a resounding “No” if we simply understood the details of the descriptions of Spark as discussed above.

Hadoop is a general purpose Big Data processing framework with not only a powerful, scalable HDFS but also many data processing engines while Spark is truly an alternative to Hadoop’s MapReduce for following reasons:

  1. Map Reduce has traditionally been used to run map/reduce jobs which are generally long running batch jobs that take minutes or even several hours to complete depending on application.
  2. Spark is an alternative to the Map Reduce’s batch map/reduce model. Spark is used for real-time stream data processing with fast interactive queries that finish within seconds due to it’s In-Memory processing power.
  3. Spark uses more RAM due to it’s In-Memory data processing capabilities instead of disk I/O and therefore, it’s significantly fast (almost real-time) when compared to Map Reduce.

Spark Adds Power To Hadoop In Real-Time Analytics

Developments in streaming technologies such as real-time analytics demanded new data processing models and Apache Spark came to fill that gap for Hadoop’s framework. Spark’s speed and versatility due to it’s In-Memory processing power makes it a key part of today’s big-data processing stack across organizations.

Some of the key characteristics of Spark include:

  • Its ability to leverage distributed storage platform.
  • Its support for data parallel computations.
  • Its uniform and elegant APIs developers to efficiently execute streaming, machine learning or SQL workloads.
  • Its supports for excellent fault-tolerance.

Developer community can benefit from Spark’s support for popular programming languages, such as Java, Python, and R, while data scientists can benefit from Spark’s support for machine learning (ML) through its own distributed ML library.

Just as an example of how powerful use of Machine Learning and real-time processing is the use of Spark in providing personalized search recommendations to customers, product recommendations to customers based on their buying profiles or browsing profiles etc.

This type of recommendations and processing time for getting to such recommendations used to take days and even weeks before the evolution of Spark.

A great example and an awesome tutorial of how Spark Streaming can be used on Hadoop platform for almost real-time data processing and analytics is at Cloudera’s blog post given below:

 

In this post, author shows how to capture all clickstream activity within the timeframe of a single visitor’s Website session and produce analytics based on near real-time data collection using Apache Spark Stream. If you have some experience with Hadoop development, you will enjoy learning from this post. If not, you will be intrigued to learn more about it.

Your Turn To Share

I hope this post was useful in throwing some light on the use of Spark in Hadoop world and how it can be very helpful when it comes to real-time data processing on HDFS.

If you have experience with Spark and you would like to share some of your insights to add value to our readers, kindly use the comments section and share your thoughts.

If you are looking for information about Spark and have some questions, please feel free to post your questions in the comments section or for private consultations, please use our contact page to get in touch with us.

Thank you for dropping by. We hope to see you back again.