Having seven years of experience with managing Redshift, a fleet of 335 clusters, combining for 2000+ nodes, we (your co-authors Neha, Senior Customer Solutions Engineer, and Chris, Analytics Manager, here at Sisense) have had the benefit of hours of monitoring their performance and building a deep understanding of how best to manage a Redshift cluster. This blog is intended to give an overview of the considerations you’ll want to make as you build your Redshift data warehouse to ensure you are getting the optimal performance.
Redshift, like BigQuery and Snowflake, is a cloud-based distributed multi-parallel processing (MPP) database, built for big data sets and complex analytical workflows. Fundamentally they are different than transactional databases we’ve seen in the past, and before we jump into how to build your data warehouse, it’s important to understand that distinction. So let’s dive in!
OLTP vs OLAP
First, we’ll dive into the two types of databases: OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing). An OLAP database is best for situations where you read from the database more often than you write to it. Think of it like something that houses the metrics used to power daily, weekly, or monthly business KPIs. OLAP databases excel at queries that require large table scans (e.g. roll-ups of many rows of data). Redshift is a type of OLAP database.
On the other hand, OLTP databases are great for cases where your data is written to the database as often as it is being read from it. As the name suggests, a common use case for this is any transactional data. Let’s say, for example, you’re keeping track of the amount of money across several accounts. The data needs to be real-time and updated whenever there is a deposit or withdrawal. After all, we would want to detect any fraudulent transactions right away! OLTP databases are best at queries where we are doing point scans or short scans of the data, think “return the number of deposits by X user this week.”
Comparing OLAP and OLTP databases can be a conversation of its’ own, but here’s a quick summary of their differences for reference.
|Write once, read many||Write many, read many|
|Best for large table scans||Best for short table scans|
|Typically a warehouse that collects data from other sources||Usually one of the sources that feed into a warehouse|
|Petabyte level storage||Terabyte level storage|
|Columnar based storage||Row-based storage|
|Lower concurrency||Higher concurrency|
|Examples: Redshift, Bigquery, Snowflake||Examples: Postgres, MySQL|
Let’s say your use case fits an OLAP database. Then, you may be one of many who opt to use a Redshift Warehouse. There are a few settings available that can tailor the performance of your Redshift queries. We will review these below.
Choosing a Cluster
As your warehouse will be the central point from which you deliver data to your business, it is important to choose the storage and performance levels that will match the needs of your business.
Selecting the Right Nodes
The first step in setting up your Redshift cluster is selecting which type of nodes you’ll want to use. This selection will be the biggest driver for the performance of your warehouse, so you’ll want to consider the end user’s needs when making this decision.
Redshift offers four options for node types that are split into two categories: dense compute and dense storage. Amazon describes the dense storage nodes (DS2) as optimized for large data workloads and use hard disk drives (HDD) for storage. The dense compute nodes are optimized for performance-intensive workloads and utilize solid state drives (SSD) to deliver faster I/O, but with the downside of less storage space per node.
Within each of these types there is a small and large node type:
When it comes to picking your node type, while it is easy to see the storage of the Dc2.8xlarge and choose it because it is the largest, if you don’t have the need for that node type, you risk losing out on the benefits of having a distributed database that can process your query in parallel across all of your nodes. The other consideration to make here is that the smallest configuration for a dc2.8xlarge is two nodes, so unless you have the storage needs, we don’t recommend jumping to that option initially.
As a Redshift cluster scales, if you find that it slows down when you have 30 dc2.xlarge nodes, this may be a good time to consider moving to the dc2.8xlarge. At a certain point, a Redshift cluster’s performance slows down as it tries to pass data back and forth between the nodes during query execution. To overcome this I/O hurdle, you can reduce the number of nodes, but maintain the power and storage by opting for the larger dc2.8xlarge. We’ve found the equivalent performance when using a 16:1 ratio of dc2.xlarge nodes to dc2.8xlarge nodes.
Planning for the Right Data Volume
The next step in building your warehouse is to determine the number of nodes your Redshift cluster will need. The number of nodes you have will control how much storage space you have available in your cluster, and as such your data volume will drive this decision.
Across the Sisense redshift fleet, we’ve found that at about 80% disk consumption, you begin to see a degradation in query performance. You’ll want to consider your total data volume as it will be stored on the cluster, and add an additional 30-40% of space. This overhead is consumed by ETL jobs rebuilding tables, complex queries that write temp tables to disk during execution, copying uncompressed data, and spikes in source data volume. Depending on the purpose of the Redshift cluster, degradation in performance can be extremely painful. For example, if you are using Redshift solely for analytics purposes, you can scale the cluster up with more nodes when this happens and resume work once it is complete. Inversely though, if your cluster is used for production reporting (i.e. dashboards), it can leave your consumers frustrated with their experience. With that in mind, consider your use case when planning for cluster sizes.
Cluster Performance Configurations
One of the main settings to configure is your WLM (workload management). WLM can be thought of as “how many processes can be managed by a cluster at one time.” It allows for control over how much of the clusters compute resources can be allocated to a query or set of queries being run at a given time. As an example, you may have ETL jobs that consume high amounts of compute resources if left unchecked. This can have an impact on your other workflows like ad-hoc analysis or BI reporting. Using WLMs you can assign queries from the ETL user or a user group to a specific WLM that only can consume a limited percentage of the cluster’s available compute resources. In this way, you are able to cap specific workflows ability to hog compute resources.
WLM also allows for controlling the number of queries being run for a specific user or user group. For your reporting users, it may be fine to support higher concurrency, but for transformation workloads, it may be best to only have a single transformation running at a given time. Using a WLM allows for control over query concurrency as well.
Finding the best WLM that works for your use case may require some tinkering, many land between the 6-12 range.
Short Query Acceleration
Long queries can hold up analytics by preventing shorter, faster queries from returning as they get queued up behind the long-running queries. To mitigate this, Redshift has the option to enable “short query acceleration,” which allows queries with shorter historical runtimes to complete without waiting for longer queries to complete. This can be used in conjunction with WLM to find the concurrency/performance that works best for you.
Typically, when a Redshift cluster is upsized, both the compute and storage resources are simultaneously upgraded. But a business could very well have a scenario where they need more compute power and higher concurrency to process a backlog of queries, but not necessarily more storage. As data warehousing solutions have become more popular and have become the hub from which reporting infrastructure pulls data from, the use case for DW technologies like Redshift have shifted. It is no longer only a few analysts running queries against a database for one-off analytical queries. We are now seeing these DWs process thousands of queries per day, and serving hundreds of users’ BI and reporting needs. Redshift, like many OLAP databases, wasn’t initially built for this purpose but they have built concurrency scaling to address this specific problem.
If you have a case where you don’t need more storage and have peaks of usage that would require more computational resources/concurrency, Redshift’s concurrency scaling would be a good option to reduce the time spent waiting for queries to even begin running. While each instance may see different results, we have seen up to a 90%+ reduction in queue time when enabling concurrency scaling.
Whenever there are more queries queued up than can be managed by WLM at a given moment, Redshift assesses whether it would be worth the overhead to spin up additional clusters to go through the queued up queries. Currently, Redshift gives the ability to spin up to 10 additional clusters (giving 11X the resources in total) with concurrency scaling.
Users can track how often clusters are spinning up additional clusters due to concurrency scaling. If a higher query volume occurs regularly, rather than simply peaking occasionally, then consider upsizing your cluster to minimize overhead from building and taking down clusters.
Modeling Your Data for Performance
The data landscape has changed significantly over the last two decades. The volume of data being created has increased, and the storage and computational resources needed to store and analyze that data has become cheaper and more widely available.
Through this transition, there have been new technologies that change the way data should be stored, and Redshift is no different. Historically, the preferred approach to storing your data was to use a Snowflake schema, in which the schema consists of highly normalized tables describing all of the dimensions of your data alongside a few long, but narrow, fact tables. This approach made sense during a time in which the cost of storage was high, so normalizing tables reduced the total footprint. At the same time joins were computationally inexpensive making it feasible to merge that data with joins in a query and get quick results.
As data volumes increased though, the cost of joining tables together increased, making highly normalized tables feasible for everyday querying, and thus the star schema was born. It collapses the dimensions of the dimensions into a single dimension from the fact. This results in less joins between the metric data in fact tables, and the dimensions.
While the star schema is the predominant choice for most in a data warehouse, as storing data becomes cheaper, schemas with a series of rollups/aggregations have become more popular. These combine the dimensions and fact tables and pre-aggregate the metrics to simplify reporting. This method optimizes for a BI layer, by reducing the number of joins to query KPIs, and providing those values in a single table.
However you end up designing your schema, it’s important to document it and maintain consistency across your tables.
Sort & Dist Keys
As Redshift is a distributed database, it stores your data across many nodes. To improve query speeds and optimize the storage of your data, it can distribute the records for your table across those nodes and sort that data to make querying it faster. This benefits query speeds on larger data sets tremendously as each node is able to find data quickly and process the smaller amounts of data it has stored locally.
It can be daunting to configure the sort keys and distribution styles when first configuring your database. If you are going to set sort and dist keys, as a general rule you’ll want the dist key set to the column most commonly joined on and the sort key set to the fields that are most commonly filtered on (or used in the WHERE clause).
Redshift offers three different approaches to distribution. If you don’t choose one when you are creating your table, Redshift will set the distribution method to auto, which means that it will pick an EVEN or ALL dist style based on the table size. To check the current distribution style of your table, you can query SVV_TABLE INFO:
select * from SVV_TABLE_INFO where table_name = <your_table’s_name>;
The three distribution styles are:
All – With the “all” distribution, Redshift will keep a copy of the table on every node. This makes sense for small tables where you won’t run into issues from storage consumption by duplicating your data. This style of distribution can be useful when your table is commonly joined to other tables and is relatively small. This prevents Redshift from having to pass the table’s data across the nodes to support processing larger tables in parallel across many nodes.
Key – With the “key” distribution method, a single column is used to determine how to sort the table across the nodes. Redshift will use the values within that column to determine which rows of data are placed on specific nodes, so that rows with the same value are stored on the same node. This style of distribution can be useful for large tables that are joined on the same value regularly. As a general rule, you’ll want to set your dist key to a column that is most regularly joined on if you are using the “key” distribution style. This column should not have high cardinality, and the number of records should not be skewed to a single value.
Even – The “even” distribution method will use a round-robin approach to dividing a table’s rows across the cluster’s nodes. If you’re unsure about using “key” or “all” distribution, and your table is not being joined to regularly, this is a good distribution method.
To check if the distribution method you’ve chosen is reasonably distributed across your nodes, you can query SVV_TABLE_INFO and measure the skew of your table. If you see your table reported as having a high skew, it means that the distribution of the table has resulted in more data being stored on a few nodes, rather than a more even distribution across them all.
Redshift sort keys allow you to specify in what order the data is stored across your nodes. Similar to an index in an OLTP database, they can help accelerate your queries. By using metadata about where the data is stored, it allows the query engine to skip over chunks of data that it knows are not within the bounds of your query’s parameters. This is similar to going to a library and looking for a book by an author’s last name that starts with a “K”, and knowing that you can skip the aisles and shelves with books by authors whose last name starts with A-J.
The sort key is defined during table creation as part of the create table statement. You can set one or more columns as the sort key. If multiple sort keys are set, there are two options you can choose from for how it handles the keys:
Compound – With a compound sort key, the data will be sorted by the first key provided, and then within a grouping of rows that share the same first sort key, the data will be sorted by the next key provided. This style of sort key can improve query performance when the sort key is used for joins, GROUP BY, ORDER BY and window functions that use partition by when the sort key is used.
Interleaved – With an interleaved key, the unique combination of the sort keys provided will be used to order the columns. For tables that get joined to or filtered on multiple columns regularly, this sort method can improve query performance.
When choosing sort keys, the best candidates are often the columns that are joined to or filtered on the most. It’s also important to be aware that as rows are added to a sorted table that already contains data, the new rows are not sorted into the existing data. If this is not managed regularly, you can end up with a large amount of unsorted data that will adversely affect query performance.
As rows are added and deleted from tables, they will either be unsorted or marked for deletion. This means query performance can be reduced and the cluster will have less disk space available. The vacuum operation is useful for resorting data within tables to ensure the maximum benefits from sort keys and reclaims the disk space from rows that have been marked for deletion.
There are four different methods for vacuuming your cluster (or specific table if you add that parameter):
VACUUM DELETE ONLY: Reclaims spaces from deleted rows.
VACUUM SORT ONLY: Resorts rows within tables based on the sort keys.
VACUUM FULL: Performs a combination of VACUUM DELETE ONLY and VACUUM SORT ONLY. This can also be executed as just VACUUM.
VACUUM REINDEX: Used for special cases where tables have interleaved sort keys.
The most common method is VACUUM FULL. However, if you rarely delete data from your Redshift warehouse, running the VACUUM SORT ONLY is likely sufficient for regular maintenance.
Depending on how much data needs to be re-sorted or deleted, the vacuum process can be resource-intensive and time-consuming. As such, it is best practice to run this regularly to reduce the amount of work that needs to be done and to run it during off-peak hours to avoid adversely impacting other services querying the Redshift. We’ve found that once a day is sufficient for most clusters, but if your data is changing less regularly, a less frequent schedule may be more appropriate.
Analyze is a process that you can run in Redshift that will scan all of your tables, or a specified table, and gathers statistics about that table. These statistics are used to guide the query planner in finding the best way to process the data. If the statistics are off on your tables, it can result in queries taking longer to run as the query planner does not know how the data is structured/distributed across the nodes in your cluster. If you are incrementally loading into tables, and not running analyze on a regular basis, it is likely that you’ll experience a reduction in query performance.
As a best practice, we recommend vacuuming and analyzing your database on a regular basis. Beyond keeping your queries fast and saving space, it also means that when it comes to managing your disk consumption, you will have up-to-date table statistics when querying SVV_TABLE_INFO.
Together, selecting the cluster type and optimizing performance configurations will provide the best experience when using a Redshift warehouse. The best configuration will vary based on whether the cluster is to be solely used for analytics, or if it will power your BI layer. By understanding the concepts outlined in this post, you should be well on your way to determining a strategy for how to configure your Redshift so that it is both highly-performant and scalable.
About the authors:
Neha Kumar is a member of the Solutions Team at Sisense, providing architecture guidance and technical support for enterprise customers. Outside of work, Neha is pursuing a Masters in Information and Data Science from UC Berkeley and enjoys dancing and painting.
Chris Meier is a member of our data team, a steward of our single source of truth, and is always in search of new ways to make the business more data-driven. Beyond analytics, he is an avid photographer and drone enthusiast ever-determined to get the perfect shot.