Drilling Data with Apache Drill

Apache Drill is a SQL query engine for Big Data exploration, defined on their website. Drill allows you to perform analysis on semi-structured data in Big Data applications using ANSI SQL. What differentiates Drill from other exploratory tools is its ability to query data sources that do not have a defined schema on a distributed architecture. I am looking at you Data Lake.

Drill has plug and play support for Hive and HBase as well. If you have an architecture with Hive, you can run interactive queries on these tables and access all Hive input/output formats. Drill also allows you to join tables across Hive metastores and to other types of data like HBase tables and log files all in a familiar SQL syntax. Here is an example provided on their website:

One last point to mention before we drill some data, is its high performance capabilities. Drill can run as a single node on a laptop and scale up to a 1000-node cluster. Data is aggregated in memory on the cluster and when memory is exhausted, data is automatically swapped to disk to maintain availability.

Example: Beer Advocate Beer Reviews

In this example we are going to use some data extracted from the excellent Beer Advocate website. I have 3 CSV files - beer_features, beer_ratings and beer_reviews. Below are the schemas.

Beer Features

Beer Ratings

Beer Reviews

Apache Drill Setup

Before we begin using Drill for querying CSV files, there is a configuration we will change to make the process easier. By default, Drill will read the header line of a CSV file as a line of data.

We don’t want the header as data in our example, instead we want Drill to skip this line. In order to configure this, we need to open up the Web Console, which by default is found here: http://localhost:8047. Click on Storage on the menu bar and will will see the following:

We are going to work with the dfs plugin, which stands for distributed filesystem. Click on the Update button and a JSON configuration file will appear. Scroll down to the ”csv” entry and it should look similar to this.

We are going to add a property called “skipFirstLine” that will skip the header line in CSV files.

Make this change and click on Update and you will see a quick notification that says success. If we run the same query as before we should see two data rows and no header.

Drilling CSV Files

There are two main ways we can use Drill for querying data. As you saw above, there is a Web Console that provides a simple interface for administration and development. You can also run Drill in embedded mode and use a console to query your data sources.

Embedded Mode

To run Drill in embedded mode, you need to change into the directory where Drill is located, or put the Drill bin directory in your PATH variable. Then run drill-embedded and you will get a Drill prompt.

Web Console

To use Drill in the Web Console, you need to enter embedded mode first, then navigate to http://localhost:8047. Once you click on the Query menu item, you will get an interface for entering queries in your browser.

Here is the query that was used above to retrieve the two rows of data:

As you can see, the query follows the ANSI SQL syntax and the FROM clause uses the dfs plugin for the filesytem. Drill also allows you to perform more complex SQL queries as well.
In this example, we will get the average beer rating for Sierra Nevada Pale Ale.

Looking at the above query, you can see we can perform joins across CSV files as well as use aggregate functions. Also, notice the liberal use of columns[n] syntax. This syntax is needed because we did not define a schema for these files, so columns need to be referenced by ordinal numbers.

Query Plans

The Drill Web Console, which we used above to configure the CSV storage format, has additional features regarding the query plans used by the optimizer. If you go to the Web Console at http://localhost:8047 and click on the Profiles menu, you will see a list of queries that were run.

Clicking on the Query will open the Query and Planning screen.

This screen gives you the option of looking at the Physical or Visualized plan used by the optimizer. Below is an example of the visualized plan.

This is nice to look at to show what steps were performed to arrive at the query, but it does not give us any detail. To gain a more comprehensive insight, you will need to scroll down to the Operator Profiles section.

I trimmed down the profile to highlight the highest cost operations used in the query plan. As you can see by the profile, the optimizer used a HASH_JOIN for our inner join and used 5MB of memory to do so. Additionally, a HASH_AGGREGATE was used for our group by clause which also consumed 5MB of memory. The last operation worth pointing out is the TEXT_SUB_SCAN. A scan was used in this case to scan the beer_features.csv file to look for the beer Sierra Nevada Pale Ale.

Conclusion

As you can see in this basic example, Drill provides the ability to work with files on the filesystem and perform SQL operations to query, join, and aggregate the data. Drill also exposes performance and tuning features in a web application to allow the user see exactly what is going on under the covers.

There are many uses cases for this tool, like interactive querying and schema-on-the-fly, but you should evaluate all the SQL-on-Hadoop options (Drill, Impala, Hive, Spark SQL) to see what works best for your problem.

Ed.

Labels: apache drill, sql