Continued Big Data Evolution: the Greater Hadoop System



 Continued Big Data Evolution: the Greater Hadoop System

A Quick Comparison of Multiple Data Processing Engines: Tez vs Impala vs Drill vs Spark vs Flink



Recent research from BARC[1], shown in the diagram below, reveals the breadth of business areas for which Big Data is now highly relevant.  Big Data has allowed more users to find more business insights from more data in less time.  

         Figure 1.  BARC Research Survey Results on Big Data Use Cases (2015)


Some of these Apache projects – Spark, Flume, Kafka, Drill, Storm, and Flink, to name a few – have been developed to enhance and expand Hadoop’s capabilities. Additionally, DataTorrent made an announcement this past September that Apex was accepted as an Apache incubator project[2] with the goal to improve upon Spark and Flink by addressing enterprise needs for a unified streaming and batch processing framework.  This continued evolution has provided users abundant choices when selecting tools to solve Big Data problems.  It has also caused confusion as to which tool is best suited for the job.

This blog is an attempt to sort through relevant Big Data technologies around Hadoop ecosystem based on secondary research[3].

Hadoop Increased Rate of Releases over Time

Based on the releases from The Apache Software Foundation (ASF)[4], the open source community has released Hadoop at an increased rate in the last two years.

Hadoop Releases at a Glance (not include all minor releases)
Release Time
Released Version
December 2011
v1.0.0
May 2012
V2.0.0-alpha
October 2013
v2.2.0
February 2014
v2.3.0
April 2014
v2.4.0
August 2014
v2.5.0
November 2014
v2.6.0
April 2015
v2.7.0
July 2015
v2.7.1


From Hadoop 1.0 to Hadoop 2.0


As the most recognized big data system, Hadoop is constantly evolving.  Hadoop 1.0 caught the attention from enterprise IT professionals because of its low-cost, high-capacity storage and massive computing power.  It was a niche market player that was applicable to specific use cases.  However, the arrival of Hadoop 2.0 added Operational Management capabilities.  Hadoop 2.0 distilled resource management out of MapReduce 1.0 as YARN and separated it from the data processing engine (see Figure 2 below), In this way, Apache Hadoop 2.0 has become a promising data operating system that has the ability to be a real threat to the traditional data management systems.  

Figure 2. The Evolution of MapReduce in Hadoop 1.0 vs Hadoop 2.0

With YARN, it is now possible to run multiple applications in Hadoop instead of just MapReduce.   Unlike previous improvements that targeted enhancements within Hadoop, multiple frameworks that can work as stand-alone cluster are popping up to enhance or replace MapReduce, the data processing engine in Hadoop.  While leveraging Hadoop file system (HDFS) and resource management (YARN), these new data processing frameworks achieved the performance and enabled capabilities that are difficult to attain from Hadoop itself. 

A Closer Look at Hadoop 2.0

Let’s take a look at the changes in Hadoop 2.0.  In order to improve the performance of MapReduce jobs sent from Hive and Pig, Hadoop 2.0 replaced the Job Tracker and Task Tracker with ResourceManager (RM), ApplicationMaster(AM), NodeManager (NM), and JornalNodes.  Other operational abilities from Hadoop 2.0 include Namenode High Availability, Snapshots and Federation.

The modern Hadoop system consists of three essential parts:  
1.       The file system:  HDFS
2.       The resource management:  YARN
3.       The batch data processing engine:  MapReduce  


Although there are many improvements from Hadoop 1.0 to Hadoop 2.0, the most significant one is the evolution of MapReduce to MapReduce NextGen aka MRv2 or YARN as ASF calls them[5].   YARN stands for “Yet Another Resource Negotiator”.

In Hadoop 2.0, MapReduce is now one application instead of two.  This is made possible by the new resource management model, YARN, using above RM, AM, etc.  In addition, Mapper and Reducer slots are no longer statically predefined.  Instead, containers are requested based on each application by AM from RM, and more specifically from the scheduler in RM.   This allows more efficient resource utilization of the cluster. 

However, despite these improvements, the bottleneck in real world applications continued to be MapReduce, the data processing engine part of the Hadoop system, due to its nature being a batch processing engine and its rigid programming paradigm.  There are multiple innovations attempting to improve or replace MapReduce framework, such as Tez, Impala and Drill. There are also frameworks like Spark and Flink that are not designed for Hadoop but found their way to work seamlessly into the Hadoop cluster replacing MapReduce engine.  

So, which engine should you use for your next job?

Quick Comparison of Data Processing Engines Used in Hadoop: Tez vs Impala vs Drill vs Spark vs Flink

Impala, Tez and Drill are all developed for Hadoop.   Both Tez[6] and Impala[7] claimed to have improved Hive/MapReduce speed by 10-100 times, set asides biases in the benchmark like a 384 GB memory machine used for Impala.  Spark and Flink were brought to Hadoop not only as more powerful data processing engines, but also with other capabilities in real-time data processing, complex query processing and machine learning.  Below is a quick comparison among all of these engines.

Impala:  Shipped by Cloudera, MapR, Oracle and Amazon since 2013, Impala is an open source tool developed by Cloudera to combat the slowness of Hive/MapReduce and to work as a promising interactive SQL-on-Hadoop solution. Impala includes a processing engine that is derived from Google Dremel and does not build on MapReduce.  Impala process data in memory and is faster than Hive/MapReduce.  It initially lacked Hive’s breadth of capabilities, but has added many functions over time such as UDFs, COMPUTE STATS and window functions for aggregation.  Impala does not support mid-query fault tolerance. It supports data stored in HDFS, Apache HBase and Amazon S3.  Impala is best used with Parquet.  Depends on who you are talking to, some believe that Impala may be better than Hive on Tez.  Others believe that Hive on Tez is better than Impala.

Tez:   It was originated from Microsoft’s research paper and implemented mainly by Hortonworks.  In July 2014, Tez became a top level Apache project.[8]  Its main goal is to improve Hive and Pig’s MapReduce jobs.  It is shipped with Hortonworks and supported by Microsoft HDInsight and some third party applications like Datameer v5.0 or later.  Tez uses Directed Acyclic Graphs (DAGs) and does everything MapReduce does, but faster.  Tez enabled interactive SQL for Hive.  Tez is best used for queries on poorly defined heterogeneous data in Hive.  It is tightly integrated with MapReduce and, unfortunately, inhered the same limitation of rigidness as MapReduce.  The best use case for Tez is for heavy Hive users to speed up their query performance on Hive.  

Drill:  Led by MapR, Drill has become an Apache top level project in December 2014[9].  Like Impala, Apache Drill is based on Google’s Dremel, they are native massively parallel processing query engines on read-only data.  To its advantage, and in contrast to Impala, Drill uses schema-free document model similar to MongoDB so that it can query non-relational data easily.  Drill can discover metadata dynamically and does not have to use Hive’s metastore like Impala.  However, it is less mature than Impala and has less functions.  “It is designed for short queries” said Nitin Bandugula, Sr. Product Manager from MapR in his blog earlier this year[10].  MapR ships its Hadoop distribution with Impala and, at the end of 2014, integrated its Hadoop distribution with both Drill and Spark.

Spark:  Donated by UC Berkeley and Databricks, Spark supports SQL natively.  Among data processing frameworks compared here, Spark has gained the most support by BI and DI vendors based on vendor announcements.  Although it can spill-to-disk at the expense of slower processing speeds if needed, Spark processes data in memory while also keeping all data sets in memory.   Therefore, you have to have larger RAM on data processing nodes for Spark.  Like Tez, it uses DAGs that allow flexibility and speed, but deploy its unique data format, RDD (Resilient Distributed Dataset) and recently Data Frames as well.  Although Spark has map and reduce capability, it does not share its programming design with MapReduce.  The classic word count job is just 3 lines of Scala for Spark while it would take 300 lines for Tez/MapReduce.  In addition, Spark has its own ecosystem and can work as a stand-alone cluster.  Besides a data processing engine that provides a SQL-on-Hadoop solution, Spark also provides pssudo-streaming (micro-batch), graph processing and machine learning capabilities.  For organizations looking for wider use cases out of their Hadoop cluster, Spark could be a good fit.  Although Spark can be used as a standalone cluster as in the case of many POCs,  but in production environments, most enterprises integrate Spark with their Hadoop cluster to provide an enhanced big data solution.  

Flink: A recent Apache top level project since January 2015, Flink is donated by Technical University of Berlin as a general purpose data processing engine like Spark.  Compared to Spark, Flink’s main advantages are: 1) it allows iterative processing to take place on the same nodes rather than having the cluster run each iteration independently;   2) it is built with YARN in mind so that it works with Tez and allows existing MapReduce job to run directly on Flink engine; and 3) Flink avoids memory spikes typically seen in a Spark cluster by managing its own memory resource requirements.  Although Flink may have another advance as a true streaming engine instead of micro-batch processing, most of the real world streaming problems could be solved sufficiently with micro-batch type streaming that Spark provides.  On the down side, Flink is currently less mature and unproven in production .  It has a smaller group of committers and does not offer many readily available libraries for graph processing or machine learning.   

In summary, Tez is a framework for purpose-built tools such as Hive and Pig.   It is simple to install as it is a client side application with no deployment to cluster needed.  Impala has easy to use SQL like query engine with many analytical functions and is currently the most matured SQL-on-Hadoop solution.  Compared to Tez or Impala, the major advantage from Drill is its schema-free model that allows it to get to data in NoSQL, like MongoDB, quicker without modeling it first.  As a less mature solution, Drill still lacks some of the analytical functions that Impala offers.   

Spark is a general purpose engine with APIs readily available for developers to write applications in a language of their choice such as Scala, Python and Java.  Spark has the most support in the open source community with an unprecedented large group of over 300 committers from many companies including Yahoo, Intel, Facebook, Cloudera, Hortonworks, Netflix, Alibaba besides UC Berkeley and Databricks.   Only time will tell if Flink will continue to have those advantages as it matures or if Spark’s army of committers will help Spark bypass Flink.  

Spark and Flink are not only able to access data from different sources (SQL, NoSQL, HDFS or other files), but also bring more than just SQL-on-Hadoop.  They provide machine learning, streaming and graph processing.  Impala, Hive or Drill provides standard JDBC or ODBC connectors for easy integration with traditional BI tools like Tableau, Qlikview, MicroStrategy, etc.  However, those BI tool vendors have announced their support for Spark.

Outside of the comparisons above, there are close source data processing engines like Big SQL from IBM, SQL-H from Teradata, open source HAWQ from Pivotal, non-supported open source Presto from Facebook, Kinesis from Amazon.  There are also new Apache projects like Apache Ignite and the two month old Apache Apex that are growing up to challenge both Spark and Flink.

What is the Next for Hadoop?

Many people in the industry has been asking where Hadoop will go from here and which framework they should choose.  Right now it looks like that Hadoop is becoming more of a platform that integrates various Big Data technologies and support them with Hadoop file system (HDFS), resource management (YARN) and other maturing operational functions (such as security and administration) that Hadoop 2.0 provides.

With new frameworks continuously being added to Hadoop, reasonable conclusions would include: 
1) Hadoop is here to stay and expand.  
2) Depending on the jobs and the problems to be solved, Big Data solutions may include Hadoop as well as other frameworks working in sync with Hadoop.   
3) There is a continuous flow of new data processing frameworks being brought into Apache that can be used to enhance and expand Hadoop’s capabilities, which is gradually forming the next generation Hadoop into Hadoop + other frameworks.  We may call it... the Greater Hadoop System!  

So, keep an eye out for new Apache projects that are still being created and choose a data processing framework that is best suited for the job, the existing system environment, your future expansion needs… whatever your situation might be.  One thing for sure is that more and more Big Data challenges are becoming easier and easier to solve.



Labels: , , , , , , , ,