One of the more popular questions our customers ask is how to make histograms in SQL. In this blog post, we’ll take a look at a couple of quick implementations of histograms before showing how you can automate the queries.
A Basic Histogram Example
The defining feature of a histogram is using bins for x-axis values to represent ranges, rather than having set values. Depending on our schema, we can use Amazon Redshift’s case expression to quickly and easily create bins.
Let’s start off with a basic example. We have a users table, and we want to create bins by age groups. We can use a case expression to sort our users into different bins by their ages.
Then, we can just use a simple count for our y-axis values, grouping by the bins we created. Our query would look something like this:
select case when age between 0 and 9 then '0-9' when age between 10 and 19 then '10-19' when age between 20 and 29 then '20-29' when age between 30 and 39 then '30-39' when age between 40 and 49 then '40-49' else '50+' end, count(1) from users group by 1
You can see that this translates into a bar chart, where our x-axis values represent age ranges by 10 year increments.
The major takeaway from this example is that you can create the range of values that the bins represent by modifying what values you are checking in the case expression. While we only checked for integer values between two numbers in this example, you can also check strings or results of entire subqueries!
Segmenting on the Bins
Let’s say we want our age bins to be segmented by gender as well. All that we would have to do is include our gender column in the query, and group by it. Our case expression has already done all of the work of creating the bins, and a simple group by allows us to segment the individual bins.
select case when age between 0 and 9 then '0-9' when age between 10 and 19 then '10-19' when age between 20 and 29 then '20-29' when age between 30 and 39 then '30-39' when age between 40 and 49 then '40-49' else '50+' end, gender, count(1) from users group by 1, 2
This query retains the bins we created, and merely segments them by the new gender column that we added. The resulting histogram looks like this:
Automating Histogram Creation
Let’s take a look at another popular use case we’ve come across. Say that you already have a column that you can group by to create your bins. And now you want to create a histogram using the counts of the different values in that column.
There’s a quick and easy way to automate creating these kinds of histograms!
We recently introduced Parameterized SQL Snippets as a core feature of Sisense for Cloud Data Teams. These snippets let you pass values directly into them when used, and can add a new dynamic to your queries.
Ryan Iyengar from Ziprecruiter generously shared this Parameterized SQL Snippet he created to help generate histograms:
-- histogram(table_name, column) with counts as ( select , count(*) as "count_columns" from [table_name] where is not null group by 1 ) , histogram as ( select count_columns , count(*) from counts group by 1 order by 1 ) select * from histogram
Ryan’s snippet essentially takes the table and column that you pass to it, and generates a histogram of how often each of the counts occurs for that column.
Let’s try it out with our users table, and we’ll say that we’re looking for a histogram of how often the same zipcodes appear.
Our query would simply look like this:
Finally, you can look at the underlying SQL query to see how we’re interpreting the Parameterized SQL Snippet for building a histogram:
with counts as ( select zipcode , count(*) as "count_columns" from users where zipcode is not null group by 1 ) , histogram as ( select count_columns , count(*) from counts group by 1 order by 1 ) select * from histogram limit 5000
And now you can try making a histogram yourself in SQL!