Getting Started with Hadoop and Big Data

When I first started working with Apache Hadoop a couple of years ago, the ecosystem was a big and scary place. First, I had to set up Linux machines (not too bad), then get the Hadoop software (a little worse)  and finally ensure I had the right version of java and had it linked correctly (uggh!!). In 2013, however things are a lot easier—let’s discuss one of the fastest ways you can get started with Hadoop.

Microsoft Platforms

You can download HDInsight from Microsoft here. It runs on your workstation locally, and also installs Hive (a metadata store that acts like a SQL front end to Hadoop’s HDFS (Hadoop Distributed File System) and Pig (procedural language for interacting with HDFS). There are some other tools like job monitor’s that get installed. One that’s installed, there will be a shortcut on your desktop called “Hadoop Command Prompt”, and if we launch that we see:


In this screen shot, I’m doing two things—creating a directory called test (hdfs dfs –mkdir test) and then listing the directories under root in my HDFS (hdfs dfs –ls).  Next let’s load some data.








We’ve loaded a file—addresses.txt, from the C:\hadoop\hadoop-1.1.0-SNAPSHOT directory, by using the put (format of that command is hdfs dfs –put $Path:$filename $destination_hdfs_directory—in this case test) command. For anyone who’s used Hadoop on Linux, you may have observed that the hdfs dfs has been deprecated in favor of the hdfs fs command. This Hortonworks/Windows version hasn’t been updated to that syntax yet. We can see where our file exists in HDFS by doing a hdfs dfs –ls test command (where test is the directory name where put the file).








As you may have noticed, the slashes and permissions resemble that of a Linux file system. Also, the commands that we have been using (ls and mkdir specifically) are Linux commands. I don’t have much to add here, except that we are working with a system that has largely been developed by Linux users and developers, and the porting to Windows can only go so far. 

So how can we access this data—let’s start with Hive. Here we are going to create a table based on our underlying data structure in HDFS, it’s a fairly simple address table. Note, even though we have load data command in Hive, we aren’t physically loading the table, just defining it’s underlying physical files in HDFS.



You’ll note the create table statement looks almost exactly like T-SQL or Oracle SQL, just that the data types are different and more simplistic—the queries work the same way.



I mentioned this was a small sample. This is just a brief introduction to interacting with HDFS and Hive. If you have any questions about using Big Data, or how to integrate it into your existing BI landscape, please contact your Anexinet representative.