It’s happened to everyone. Your nightly ETLs didn’t run because someone’s long-running job locked the entire cluster overnight. Now your analysts don’t have the data they need to run daily reports.
To keep this from happening, Redshift allows you to separate ETL queries into their own queue. This gives them a dedicated share of memory and concurrent query slots.
Creating ETL Job Queues
Let’s start by creating some queues for our ETL jobs. For this walkthrough, we’ll assume we have two types: small, quick jobs that run frequently and a few big, heavy ones that require serious resources.
The first step is to create a group for all our ETL users:
create group etl with etl_ro, etl_rw
Next, we need to set up workload management (WLM) queues, and assign them based on our group. WLM settings are part of the Parameter Group Settings, which you can find through the “Parameter Group” link in the sidebar of the Redshift console.
Select a parameter set or create a new one, and click Edit WLM.
Let’s start with two new queues for our two types of ETL jobs. Queue #1 will be the big job queue, and we’ll give it one slot and 20% of memory.
Queue #2 will be for all other ETL jobs. We’ll give it concurrency of two and 20% memory, meaning each job will get 10% memory. Here’s what our WLM settings look like:
Notice that queries from users in the etl user group will automatically run in Queue #2.
Assigning Queries to Queues
By default, queries run in the default queue for their user group. For users in the etl group, that’s Queue #2.
Users can also set their query queue on a per-session basis with set query_group. For example, to run in Queue #1:
set query_group to slow_etl_queue
For our set of queues, we might have all ETL jobs use a user account in the etl user group, and additionally have slow queries set their query_group before running. This way, slow ETLs use the slow queue, all other ETLs use the standard ETL queue.
Non-ETL queries will run in Queue #3, which is the default, because they have not specified otherwise. Since that queue has the default concurrency of 5, we can expect each slot to get one fifth of the remaining 60% memory, or 12% of cluster memory apiece.
Checking Our Work
We’ve set up what we think is the ideal slot structure. Now we want to make sure it works.
First, let’s get the service classes for our queues:
select id , condition , action_service_class from STV_WLM_CLASSIFICATION_CONFIG where action_service_class > 4
We get the following results:
The first condition is the special super user queue. If all query slots are taken, the super user can still run queries with set query_group to superuser. This allows one query at a time with a reserved portion of memory.
Now that we have each queue’s service class, we can match it up to number of query slots and total memory:
select name , service_class , num_query_tasks as slots , query_working_mem from STV_WLM_SERVICE_CLASS_CONFIG where service_class > 4
Here are our results:
The name “Service Class #X” maps to “Queue #X” in the Redshift console, and of course service_class maps to our previous query and other Redshift system tables. service_class 1-4 are reserved for internal use, and service_class 5 is always the superuser queue.
Finally, we’ll look at the queries in the general-purpose queue, service_class 8:
select userid , query , service_class , slot_count , total_queue_time , total_exec_time , service_class_start_time from stl_wlm_query order by service_class_start_time desc limit 5
We get these results:
Notice that some queries are using multiple slots, as shown in the slot_count column. That’s because those queries have their wlm_slot_count set above 1, instructing them to wait for multiple slots to be open and then consume all of them.
Time is logged first for waiting in queue, and then for executing. Both of those numbers can be very helpful in debugging slowdowns.
When Not to Use WLM
WLM is great at partitioning your cluster’s resources, but be wary of creating too many queues. If one queue is often full and another empty, then you’ll be wasting cluster capacity. In general, the more constraints you give your Redshift cluster, the less flexibility it will have to optimize your workload for you.
Finally, keep an eye on the total_queue_time field of stl_wlm_query and consider increasing the number of query slots for queues and service class with high queue times.