Anexinet Big Data Blog: Building a Hadoop Cluster using Cloudera Manager

Starting from scratch and building a Hadoop cluster can be a scary thing—there are a lot of dependencies, such as Java versions, and then there are just many components involved in the Hadoop EcoSystem, like Hive, Oozie, ZooKeeper and Impala. Having to install and get up to date versions of each of those packages can be an exercise in tedium. After you’re up and running then you have to pay attention and maintain current versions of all of these solutions—it’s enough to give anyone a headache.

Fortunately, our friends at Cloudera have put together a great solution called Cloudera Manager, that takes care and manages a lot of these dependencies for us. Additionally, it has the benefit of being free with Cloudera’s standard release of Hadoop. The enterprise edition of the software, which is available via subscription adds some nice features like LDAP integration, rolling updates, and automated disaster recovery. The details are available here, but for the purposes of getting started in your lab environment the standard edition is great.

Here at Anexinet, we’ve been building out some Big Data concepts in our lab environment, so we started out with 4 VMs—one of them will be our Name Node which controls the cluster, and the other three nodes will be our Data Nodes. For our data nodes we added a large additional LUN to store the data. Hadoop has some specific requirements around best practices for formatting disk devices in Linux, so send your sys admin here to understand the options to mount the disk(s) under.

We started out by installing our standard CentOS image—it has the great advantage of being free, and broadly compatible with Hadoop solutions. Unlike Oracle or Microsoft clusters you may have worked with in the past, no clusterware or cluster specific configuration is required for a Hadoop cluster, and with Cloudera Manager the build process is incredibly easy. The first step is to go here and download Cloudera Standard onto the machine you want to be manager. In a production environment this should be a dedicated machine, but for our lab, we are using the machine that we are planning to use for the Name Node. You will need root SSH access to all of the machines in your cluster and access either to the internet or local repository with access to all of the machines.

After downloading the file (cloudera-manger.bin), you will want to make it executable. From the command line of the host where you are working, issue the following command:

Add the IP addresses of your hosts (and change the SSH port if needed) and click next. Cloudera manager will check for the existence of the hosts and network connectivity, and then install the packages you would like to your cluster. It really is that easy—if you need more details from Cloudera you can go here or contact Anexinet for help getting started with your big data project.