Continued Big Data Evolution: the Greater Hadoop System
A Quick Comparison of Multiple Data Processing Engines: Tez vs Impala vs Drill vs Spark vs Flink
Recent research from
BARC[1], shown in the diagram below, reveals the breadth of business areas for which Big Data is now highly relevant. Big Data has allowed
more users to find more business insights from more data in less time.
Figure 1. BARC Research Survey
Results on Big Data Use Cases (2015)
Some of these Apache
projects – Spark, Flume, Kafka, Drill, Storm, and Flink, to name a few – have
been developed to enhance and expand Hadoop’s capabilities. Additionally, DataTorrent
made an announcement this past September that Apex was accepted as an Apache incubator project[2]
with the goal to improve upon Spark and Flink by addressing enterprise needs
for a unified streaming and batch processing framework. This continued evolution has provided users
abundant choices when selecting tools to solve Big Data problems. It has also caused confusion as to
which tool is best suited for the job.
This blog is an attempt
to sort through relevant Big Data technologies around Hadoop ecosystem based on
secondary research[3].
Hadoop Increased Rate of Releases
over Time
Based on the
releases from The Apache Software Foundation (ASF)[4],
the open source community has released Hadoop at an increased rate in the last
two years.
Hadoop Releases at a Glance (not include all minor releases)
Release Time
|
Released Version
|
December 2011
|
v1.0.0
|
May 2012
|
V2.0.0-alpha
|
October 2013
|
v2.2.0
|
February 2014
|
v2.3.0
|
April 2014
|
v2.4.0
|
August 2014
|
v2.5.0
|
November 2014
|
v2.6.0
|
April 2015
|
v2.7.0
|
July 2015
|
v2.7.1
|
From Hadoop 1.0 to Hadoop 2.0
As the
most recognized big data system, Hadoop is constantly evolving. Hadoop 1.0 caught the attention from enterprise IT professionals because of its low-cost, high-capacity storage and massive computing
power. It was a niche
market player that was applicable to specific use cases.
However, the arrival of Hadoop 2.0 added Operational Management capabilities. Hadoop 2.0 distilled resource management out
of MapReduce 1.0 as YARN and separated it from the data processing engine (see
Figure 2 below), In this way, Apache Hadoop 2.0 has become a promising data operating system that has the ability to be a real
threat to the traditional data management systems.
Figure 2. The Evolution of MapReduce in Hadoop 1.0
vs Hadoop 2.0
With YARN, it is
now possible to run multiple applications in Hadoop instead of just
MapReduce. Unlike previous improvements that targeted enhancements within Hadoop, multiple frameworks that can work as stand-alone cluster are popping
up to enhance or replace MapReduce, the data processing engine in Hadoop.
While leveraging Hadoop file system (HDFS) and resource management
(YARN), these new data processing frameworks achieved the performance and enabled
capabilities that are difficult to attain from Hadoop itself.
A Closer Look at Hadoop 2.0
Let’s take a look at the changes in Hadoop 2.0.
In order to improve the performance of MapReduce jobs sent from Hive and
Pig, Hadoop 2.0 replaced the Job Tracker and Task Tracker with ResourceManager
(RM), ApplicationMaster(AM), NodeManager (NM), and JornalNodes. Other operational abilities from Hadoop 2.0
include Namenode High Availability, Snapshots and Federation.
The modern Hadoop
system consists of three essential parts:
1.
The
file system: HDFS
2.
The
resource management: YARN
3.
The batch
data processing engine: MapReduce
Although there
are many improvements from Hadoop 1.0 to Hadoop 2.0, the most significant one is
the evolution of MapReduce to MapReduce NextGen aka MRv2 or YARN as ASF calls
them[5]. YARN stands for “Yet Another Resource
Negotiator”.
In Hadoop 2.0, MapReduce
is now one application instead of two.
This is made possible by the new resource management model, YARN, using
above RM, AM, etc. In addition, Mapper
and Reducer slots are no longer statically predefined. Instead, containers are requested based on
each application by AM from RM, and more specifically from the scheduler in
RM. This allows more efficient resource
utilization of the cluster.
However, despite these improvements, the bottleneck in real world applications continued to
be MapReduce, the data processing engine part of the Hadoop system, due to its
nature being a batch processing engine and its rigid programming paradigm. There are multiple innovations attempting to
improve or replace MapReduce framework, such as Tez, Impala and Drill. There
are also frameworks like Spark and Flink that are not designed for Hadoop but found their way to work seamlessly into the Hadoop cluster replacing MapReduce engine.
So, which engine should you use for your next
job?
Quick Comparison of Data Processing Engines Used in Hadoop: Tez vs Impala vs Drill vs Spark vs Flink
Impala, Tez and Drill
are all developed for Hadoop. Both Tez[6]
and Impala[7] claimed to have improved Hive/MapReduce
speed by 10-100 times, set asides biases in the benchmark like a 384
GB memory machine used for Impala. Spark and Flink were brought to Hadoop not
only as more powerful data processing engines, but also with other capabilities in
real-time data processing, complex query processing and machine learning. Below is a quick comparison among all of these engines.
Impala: Shipped by Cloudera, MapR,
Oracle and Amazon since 2013, Impala is an open source tool developed by Cloudera to combat the slowness of
Hive/MapReduce and to work as a promising interactive SQL-on-Hadoop solution. Impala
includes a processing engine that is derived from Google Dremel and does not build
on MapReduce. Impala process data
in memory and is faster than Hive/MapReduce.
It initially lacked Hive’s breadth of capabilities, but has added
many functions over time such as UDFs, COMPUTE STATS and window functions for
aggregation. Impala does not support
mid-query fault tolerance. It supports data stored in HDFS, Apache HBase and
Amazon S3. Impala is best used with
Parquet. Depends on who you are talking
to, some believe that Impala may be better than Hive on Tez. Others believe that Hive on Tez is better than Impala.
Tez: It was originated from Microsoft’s
research paper and implemented mainly by Hortonworks. In July 2014, Tez became
a top level Apache project.[8] Its main goal is to improve Hive and Pig’s MapReduce jobs. It is shipped with Hortonworks and supported
by Microsoft HDInsight and some third party applications like Datameer v5.0 or
later. Tez uses Directed Acyclic
Graphs (DAGs) and does everything MapReduce does, but faster.
Tez enabled interactive SQL for Hive.
Tez is best used for queries on poorly
defined heterogeneous data in Hive.
It
is tightly integrated with MapReduce and, unfortunately, inhered the same limitation
of rigidness as MapReduce.
The best use
case for Tez is for heavy Hive users to speed up their query performance on
Hive.
Drill: Led by MapR, Drill has become an Apache top
level project in December 2014
[9].
Like Impala, Apache Drill is based on Google’s
Dremel, they are native massively parallel processing query engines on read-only
data.
To its advantage, and in contrast to Impala, Drill
uses schema-free document model similar to MongoDB so that it can query
non-relational data easily.
Drill can
discover metadata dynamically and does not have to use Hive’s metastore like
Impala.
However, it is less mature than
Impala and has less functions.
“It is
designed for short queries” said Nitin Bandugula, Sr. Product Manager from MapR
in his blog earlier this year
[10].
MapR ships its Hadoop distribution with
Impala and, at the end of 2014, integrated its Hadoop distribution with both Drill
and Spark.
Spark: Donated by UC Berkeley and Databricks,
Spark supports SQL natively.
Among data processing frameworks compared here, Spark has gained
the most support by BI and DI vendors based on vendor announcements. Although it can spill-to-disk at the expense
of slower processing speeds if needed, Spark processes data in memory while also keeping all data sets in memory. Therefore, you have to have larger RAM on data processing nodes for Spark. Like Tez, it uses DAGs that allow
flexibility and speed, but deploy its unique data format, RDD (Resilient
Distributed Dataset) and recently Data Frames as well. Although Spark has map and reduce capability,
it does not share its programming design with MapReduce. The classic word count job is just 3 lines of Scala for Spark while it would take 300
lines for Tez/MapReduce. In addition, Spark
has its own ecosystem and can work as a stand-alone cluster. Besides a data processing engine that
provides a SQL-on-Hadoop solution, Spark also provides pssudo-streaming
(micro-batch), graph processing and machine learning capabilities. For organizations looking for wider use cases
out of their Hadoop cluster, Spark could be a good fit. Although Spark can be used as a standalone
cluster as in the case of many POCs, but in
production environments, most enterprises integrate Spark with their Hadoop
cluster to provide an enhanced big data solution.
Flink: A recent Apache top level project since January 2015, Flink is
donated by Technical University of Berlin as a general purpose data processing engine like Spark. Compared to Spark, Flink’s main advantages
are: 1) it allows iterative processing to take place on the same nodes rather
than having the cluster run each iteration independently; 2) it is built with YARN in mind so that it
works with Tez and allows existing MapReduce job to run directly on Flink
engine; and 3) Flink avoids memory spikes
typically seen in a Spark cluster by managing its own memory resource requirements. Although Flink may have another advance as a
true streaming engine instead of micro-batch processing, most of the real world
streaming problems could be solved sufficiently with micro-batch type streaming
that Spark provides. On the down side, Flink
is currently less mature and unproven in production . It has a smaller group of committers and does
not offer many readily available libraries for graph processing or machine
learning.
In summary, Tez
is a framework for purpose-built tools such as Hive and Pig. It is
simple to install as it is a client side application with no deployment to
cluster needed. Impala has easy to use
SQL like query engine with many analytical functions and is currently the most matured
SQL-on-Hadoop solution. Compared
to Tez or Impala, the major advantage from Drill is its schema-free model that
allows it to get to data in NoSQL, like MongoDB, quicker without
modeling it first. As a less mature
solution, Drill still lacks some of the analytical functions that Impala
offers.
Spark is a general purpose engine with APIs readily
available for developers to write applications in a language of their choice
such as Scala, Python and Java. Spark
has the most support in the open source community with an unprecedented large
group of over 300 committers from many companies including Yahoo, Intel,
Facebook, Cloudera, Hortonworks, Netflix, Alibaba besides UC Berkeley
and Databricks. Only time will tell if Flink will continue to have
those advantages as it matures or if Spark’s army of committers will help Spark bypass Flink.
Spark and Flink are not only able to access data from
different sources (SQL, NoSQL, HDFS or other files), but also bring more than
just SQL-on-Hadoop. They provide
machine learning, streaming and graph processing. Impala, Hive or Drill provides
standard JDBC or ODBC connectors for easy integration with traditional BI tools like Tableau, Qlikview, MicroStrategy, etc. However, those BI tool vendors have announced their support for Spark.
Outside of the comparisons above, there are close source
data processing engines like Big SQL from IBM, SQL-H from Teradata, open source
HAWQ from Pivotal, non-supported open source Presto from Facebook, Kinesis from
Amazon. There are also new Apache projects
like Apache Ignite and the two month old Apache Apex that are growing up
to challenge both Spark and Flink.
What is the Next for Hadoop?
Many people in the
industry has been asking where Hadoop will go from here and which framework
they should choose. Right now it looks
like that Hadoop is becoming more of a platform that integrates various Big Data technologies and support them with Hadoop file system (HDFS), resource
management (YARN) and other maturing operational functions (such as security
and administration) that Hadoop 2.0 provides.
With new frameworks continuously being added to Hadoop, reasonable conclusions would
include:
1) Hadoop is here to stay and expand.
2) Depending on the jobs and the problems to be
solved, Big Data solutions may include Hadoop as well as other frameworks working
in sync with Hadoop.
3) There is a continuous flow of new data processing frameworks being brought into Apache
that can be used to enhance and expand Hadoop’s capabilities, which is gradually forming the next generation Hadoop into Hadoop + other
frameworks. We may call it... the Greater Hadoop System!
So, keep an eye
out for new Apache projects that are still being created and choose a data processing
framework that is best suited for the job, the existing system environment,
your future expansion needs… whatever your situation might be. One thing for sure is that more and more Big
Data challenges are becoming easier and easier to solve.
Labels: apache drill, Big Data, Drill vs Impala, Fink vs Spark, Hadoop, Impala, Spark, Spark vs Flink, Tez