More Rows, Not More Columns
Many a MySQL or Postgres DBA, with responsibility for serving a fast and responsive website, will design a schema for a survey that looks like this:
This is a great schema for serving a live site. You only need to show one user her responses, and in a row-based DB like MySQL or Postgres, select * from survey_responses where user_id = 5 is very fast.
When this table is moved to the data warehouse, however, tables like this make your data analyst crazy. The analyst would much rather see a table like this:
For one thing, the analyst will tend to run queries that scan all the rows, like this one calculating the “yes” rate on Question 2:
select sum( case when response is true then 1 else 0 end )::float / count(1) from survey_responses where question = 'question 2'
For another, the analyst will be using a data warehouse, which makes looking at all the rows in a couple of columns very efficient, but looking at all the columns — even only a couple hundred — quite expensive.
Making the transition
We know we want a row-based table design for serving our website, and a columnar design for analyzing the data. So how do we get from one to the other? How do we transpose the first table into the second table?
A naive solution is a set of unions:
select user_id, 'question 1', question_1_response from survey_responses where user_id = 1 union select user_id, 'question 2', question_2_response from survey_responses where user_id = 1 union select user_id, 'question d', question_3_response from survey_responses where user_id = 1 ...
This approach has a number of drawbacks. First and foremost, it is a pain in the ass to write.
Moreover, it is expensive. It asks for every whole row n times if there are n questions. And since this transformation typically takes place on the data warehouse, asking for whole rows over and over again is especially brutal.
Row numbering and full joins
We’ll start with a simple list of numbers:
with numbers as ( select row_number() over (order by true) i from survey_responses )
Now, for every row in our row-based table, we’ll create a row per question:
expanded_responses as ( select * from survey_responses join numbers on true )
Notice survey_responses join numbers on true. This is a full join, meaning a row is created in the resulting table for every possible combination of rows survey_responses and numbers.
Our table so far looks like so:
Notice the i column on the far right. We now have one row per user per question, which is about to come in very handy.
Now we will use our row numbering to select responses to question 1 from rows numbered 1, responses to question 2 from rows numbered 2, etc.:
select user_id, 'question ' || i as question, case when i = 1 then question_1_response when i = 2 then question_2_response when i = 3 then question_3_response when i = 4 then question_4_response when i = 5 then question_5_response end as response from expanded_responses
This gives us our fully-transposed table!
The row numbers gave us a handy way of picking which row to use for selecting the answer to each question. And since this is running on a columnar data warehouse, blowing out the row count is not a big deal.
Putting it all together, the query looks like this:
with numbers as ( select row_number() over (order by true) i from survey_responses ), expanded_responses as ( select * from survey_responses join numbers on true ) select user_id, 'question ' || i as question case when i = 1 then question_1_response when i = 2 then question_2_response when i = 3 then question_3_response when i = 4 then question_4_response when i = 5 then question_5_response end as response from expanded_responses
When transposing a 111-column, 132,000-row table on Amazon Redshift, this technique ran in 7 seconds. We’d love to be able to tell you how long the union-based approach took for comparison, but it crashed the cluster!
You guessed it!
There’s an easier way. Using Sisense for Cloud Data Teams, use the Transpose Table checkbox in the Settings tab to transpose your whole table in a single click: