In a previous post, we talked about query strategies for distributed analytics databases. In Part 2, we’ll explain how to adjust your data storage to take advantage of distributed columnar architectures.

Data Distribution Strategies

Distributing the data based on the date or user_id is another way to make our query faster. With this strategy, worker nodes don’t need to communicate to complete the query. Each either has all of a user’s or date’s data, and that can easily be counted and later combined by the leader node. Usually, this is a big boost to performance.

However, it’s important to keep an eye out for skewed distributions. If the underlying data is skewed, query performance can get dramatically worse.

Here’s our example cluster distributed by user_id:

Example cluster

When a query runs to scan our activity_logs table, node #3 will finish early and node #4 will finish late. This will cause the cluster to waste resources waiting for node #4 to finish, ultimately running queries slower.

Columnar Storage

Distributing the data is only half the solution. How the data is stored also has a huge impact on improving query performance in these analysis databases.

In traditional databases like MySQL, a table’s data is stored in a single file:

Single table file

Row data is contiguous, so it’s very fast to select a single row. That’s why this style of storage is great for databases serving websites. The downside is that with common analytics tasks like finding the minimum value of a column, or the sum of a column, you need to read the entire table because all of the data is in one file. That’s why these analytics databases store each column’s data separately:

Column data

Data is split into many files per column. Now finding the minimum value of a column only requires reading that column’s data. Reading less data is the number one way to speed up queries.

Compressing Columns

With each column’s data split into many files, it’s easy to distribute the column’s data around the cluster. And since each file contains the same data type (integers, dates, etc), the data can be compressed:

Compressed data

This 12-row table can be compressed to 3 rows with some metadata! Compression means there is less data to read, making the query even faster.

Sorting Columns

Finally, analytics databases maintain minimum and maximum values for the data in each of these files if you define sort keys:

create table
app_logs sortkey(created_at)
as (...);
When you query based on a sort key column:
select ...
from app_logs
where created_at > '2015-01-10'

The database can skip files it knows don’t have matching data:

Sort keys

And skipping these files means there is less data to read yet again.

Distributed Columnar Databases

Distributing data across commodity hardware and then optimizing the storage of that data are two key features of modern analytics databases. Armed with the knowledge of how they both work, you’ll be able to reorganize your warehouse and queries even faster!