When writing SQL for data analysis, there are two kinds of errors: errors that prevent the query from running, and errors that merrily return the wrong data. The second kind of error can be hard to spot, especially when your results are within range of your expectations.
Over the years we’ve encountered many of these deceptive errors. Here’s a list of common problems you need to look out for when your data looks off:
One of the most common types of analysis queries is the time series. Missing time zone conversions, returning data from outside your intended time period, and aggregating by the wrong created_at field are all common time series errors.
Aggregating by the Wrong Date
Aggregating by the wrong created_at field after a join will preserve aggregate metrics but shift around data for individual dates.
Imagine we have a users and purchases table, and we want to look at how many purchases were made each day per user source. Both tables have a created_at field that we can use.
If we accidentally aggregate on users.created_at, we’d get cohorted purchases rather than daily purchases.
select date_trunc('day', users.created_at), users.source, count(1) from users join purchases using(user_id) group by 1, 2
Instead of users.created_at in the date_trunc function, you’ll want to use purchases.created_at.
Partial Time Zone Conversion
In a query where many date fields are converted to a common time zone, it’s easy to miss a conversion — especially with constants.
select count(1) from events where convert_tz(ts, 'UTC', 'PST') > '2015-06-01'
Here, you’ll want to use convert_tz(‘2015-06-01’, ‘UTC’, ‘PST) in place of ‘2015-06-01’.
Time Traveling — Including Future Data
It’s easy to grab data from outside your intended time period in a time series query. Consider a query that counts the number of blog posts viewed by a user before signing up.
If we forget to restrict the views to those earlier than signups.created_at, our data will look reasonable, but include blog views after the signup time.
select signup.id, count(1) from signups join views on signup.cookie = view.cookie
This query needs another join condition: and views.created_at <= signups.created_at.
Join clauses are usually straightforward, but the common mistakes below can be hard to notice.
Joining More than One Row
Imagine a set of customers and subscription plans, where each customer has one active plan at most. Say we wanted to know how many customers we have. We could join the customers table and the plans table and count the number of rows.
However, the plans table may have all historical plans for each customer. In this case, counting the inner join of accounts and plans will overcount customers that have changed plans:
select count(1) from accounts join plans using (account_id)
To fix this, we could either join the first row in each group, or count(distinct accounts.id).
Joining on the Wrong ID
Imagine we have a users table and a table of users invited by other users named referrals, and they are roughly the same size. We want to determine how many of our users invite their friends.
In our referrals table, we have an id column that tracks all referrals and a user_id column that tracks the user who made the referral. If we confuse referrals.id and referrals.user_id, the result would look like almost all users refer another user:
select count(distinct users.id) from users join referrals on users.id = referrals.id
Instead of referrals.id, you’ll want to use referrals.user_id.
Omitting Left or Right in a Join
A standard join will return the data that two tables have in common, whereas a left join returns all of the records in the left table and any matching records from the right table. A right join returns all of the records in the right table and any matching records from the left table.
Accidentally omitting either left or right when joining two tables will often return a plausible subset of the intended rows.
Let’s use the users and referrals tables from our previous example. If we wanted to count the percent of users who referred another user, the following query will always return 100%, as it drops user rows that do not have any referrals:
select count(distinct referrals.user_id) / count(distinct users.id)::float from users join referrals on users.id = referrals.id
In this query, join should be left join.
Comparing against Null
Null has surprising behavior in comparisons. It is not greater than, less than, or equal to zero. It’s neither equal to itself nor not equal to itself.
If you have a table of users with a purchases column, and wanted to see how many had more than zero purchases, this query would lead you astray:
select count(1) from users where num_purchases != 0
Instead of including only rows with more than zero purchases, this query will include all null rows as well. This is why you want to use where num_purchases is not null and num_purchases > 0 and not where num_purchases != 0.
Reversing Function Parameters
Let’s start with a signups and customers table. Say you wanted to look at how many signups become customers in less than a week. You could look at the datediff of the customer and signup timestamps and filter out rows with a difference of less than seven days.
However, if you reverse signups.date and customers.date in the datediff function, all of the differences will be negative and thus below the filter. This will overcount the signups that convert in less than seven days.
select count(1) from signups join customers using (signup_id) where datediff('day', customers.date, signups.date) < 7
The order of customers.date and signups.date needs to be reversed here.
MySQL Group By
MySQL has the bizarre behavior of not erroring if there are too few group statements.
You will get very surprising results if you write a query like this:
select created_date, source, count(1) from users group by 1
This query should also group by source.
MySQL will return arbitrary results for source, picked from any of the rows with a given created_date.