Accelerate Your Data Warehouse Performance with Columnar Databases

When it comes to Big Data the first thing a lot people think of is Hadoop and MapReduce (M/R), and that's a very valid assumption. Hadoop and the other products in it's ecosystem (Hive, Pig, Zookeeper) do drive many of the big data projects in the world today. However, MapReduce is all about taking very large data sets, and doing batch analysis of them--if we think relationally, M/R is basically doing a large amount of full table scans. So what to we do if what want to do real time analytics against a subset of that data? Think columnar. If you need to do frequent rapid updates, columnar data, probably isn't for you--all that compression is expensive from a CPU perspective.

What is a columnar database? Think of our traditional relational table, that is laid out in rows. Well a columnar database is a database that stores data in a column oriented format. From a database perspective, particularly a data warehouse and analytic workload perspective it is much more efficient to run aggregates across columns of data, rather than rows. Additionally, from a data compression perspective, columns (with much more similar data than rows) will compress data far more efficiently than row based compression. This allows for even fewer reads against a given data set to achieve a result.

So what are the columnar databases that are commercially available and how do they fit into your big data equation?

HP Vertica
Microsoft SQL Server 2012/2014 Columnstore Indexes
Apache HBase (open source)
Datastax Cassandra (open source)
Paraccel

The advantages of offerings from Microsoft and HP offer a lot more commercial support than the open source offerings, and are more closely related to relational offerings, so they can be easier to deal with. The limit of SQL Server (at least in 2012--this is fixed in the soon to be released SQL 2014) is that columnstore indexes cannot be updated, if you insert new records, you need to rebuild the indexes. Vertica is a very robust solution that integrates nicely with other big data solutions like Hadoop.