In a previous post, we talked about query strategies for distributed analytics databases. In Part 2, we’ll explain how to adjust your data storage to take advantage of distributed columnar architectures.
Data Distribution Strategies
Distributing the data based on the date or user_id is another way to make our query faster. With this strategy, worker nodes don’t need to communicate to complete the query. Each either has all of a user’s or date’s data, and that can easily be counted and later combined by the leader node. Usually, this is a big boost to performance.
However, it’s important to keep an eye out for skewed distributions. If the underlying data is skewed, query performance can get dramatically worse.
Here’s our example cluster distributed by user_id:
When a query runs to scan our activity_logs table, node #3 will finish early and node #4 will finish late. This will cause the cluster to waste resources waiting for node #4 to finish, ultimately running queries slower.
Distributing the data is only half the solution. How the data is stored also has a huge impact on improving query performance in these analysis databases.
In traditional databases like MySQL, a table’s data is stored in a single file:
Row data is contiguous, so it’s very fast to select a single row. That’s why this style of storage is great for databases serving websites. The downside is that with common analytics tasks like finding the minimum value of a column, or the sum of a column, you need to read the entire table because all of the data is in one file. That’s why these analytics databases store each column’s data separately:
Data is split into many files per column. Now finding the minimum value of a column only requires reading that column’s data. Reading less data is the number one way to speed up queries.
With each column’s data split into many files, it’s easy to distribute the column’s data around the cluster. And since each file contains the same data type (integers, dates, etc), the data can be compressed:
This 12-row table can be compressed to 3 rows with some metadata! Compression means there is less data to read, making the query even faster.
Finally, analytics databases maintain minimum and maximum values for the data in each of these files if you define sort keys:
create table app_logs sortkey(created_at) as (...); When you query based on a sort key column: select ... from app_logs where created_at > '2015-01-10'
The database can skip files it knows don’t have matching data:
And skipping these files means there is less data to read yet again.
Distributed Columnar Databases
Distributing data across commodity hardware and then optimizing the storage of that data are two key features of modern analytics databases. Armed with the knowledge of how they both work, you’ll be able to reorganize your warehouse and queries even faster!