<p>When writing SQL for data analysis, there are two kinds of errors: errors that prevent the query from running, and errors that merrily return the wrong data. The second kind of error can be hard to spot, especially when your results are within range of your expectations.</p>



<p>Over the years we’ve encountered many of these deceptive errors. Here’s a list of common problems you need to look out for when your data looks off:</p>



<h2 class="wp-block-heading"><strong>Time Mistakes</strong></h2>



<p>One of the most common types of analysis queries is the time series. Missing time zone conversions, returning data from outside your intended time period, and aggregating by the wrong created_at field are all common time series errors.</p>



<h4 class="wp-block-heading"><strong>Aggregating by the Wrong Date</strong></h4>



<p>Aggregating by the wrong created_at field after a <strong>join</strong> will preserve aggregate metrics but shift around data for individual dates.</p>



<p>Imagine we have a users and purchases table, and we want to look at how many purchases were made each day per user source. Both tables have a created_at field that we can use.</p>



<p>If we accidentally aggregate <strong>on users.created_at</strong>, we’d get cohorted purchases rather than daily purchases.</p>



<pre class="wp-block-code"><code>select 
  date_trunc('day', users.created_at), 
  users.source,
  count(1)
from users join purchases using(user_id)
group by 1, 2 </code></pre>



<p>Instead of users.created_at in the date_trunc function, you’ll want to use purchases.created_at.</p>



<h4 class="wp-block-heading"><strong>Partial Time Zone Conversion</strong></h4>



<p>In a query where many date fields are converted to a common time zone, it’s easy to miss a conversion — especially with constants.</p>



<pre class="wp-block-code"><code>select count(1)
from events
where 
  convert_tz(ts, 'UTC', 'PST') > '2015-06-01'</code></pre>



<p>Here, you’ll want to use convert_tz(&#8216;2015-06-01&#8217;, &#8216;UTC&#8217;, &#8216;PST) in place of &#8216;2015-06-01&#8217;.</p>



<h4 class="wp-block-heading"><strong>Time Traveling — Including Future Data</strong></h4>



<p>It’s easy to grab data from outside your intended time period in a time series query. Consider a query that counts the number of blog posts viewed by a user before signing up.</p>



<p>If we forget to restrict the views to those earlier than signups.created_at, our data will look reasonable, but include blog views after the signup time.</p>



<pre class="wp-block-code"><code>select 
  signup.id, 
  count(1)
from signups
join views on
  signup.cookie = view.cookie</code></pre>



<p>This query needs another join condition: and views.created_at &lt;= signups.created_at.</p>



<h2 class="wp-block-heading"><strong>Joins</strong></h2>



<p>Join clauses are usually straightforward, but the common mistakes below can be hard to notice.</p>



<h4 class="wp-block-heading"><strong>Joining More than One Row</strong></h4>



<p>Imagine a set of customers and subscription plans, where each customer has one active plan at most. Say we wanted to know how many customers we have. We could <strong>join</strong> the customers table and the <strong>plans</strong> table and count the number of rows.</p>



<p>However, the plans table may have all historical plans for each customer. In this case, counting the inner join of accounts and plans will overcount customers that have changed plans:</p>



<pre class="wp-block-code"><code>select count(1)
from accounts 
join plans using (account_id)</code></pre>



<p>To fix this, we could either <a href="https://www.sisense.com/blog/4-ways-to-join-only-the-first-row-in-sql/">join the first row in each group</a>, or count(distinct accounts.id).</p>



<h4 class="wp-block-heading"><strong>Joining on the Wrong ID</strong></h4>



<p>Imagine we have a users table and a table of users invited by other users named referrals, and they are roughly the same size. We want to determine how many of our users invite their friends.</p>



<p>In our referrals table, we have an id column that tracks all referrals and a user_id column that tracks the user who made the referral. If we confuse referrals.id and referrals.user_id, the result would look like almost all users refer another user:</p>



<pre class="wp-block-code"><code>select count(distinct users.id)
from users 
join referrals 
  on users.id = referrals.id </code></pre>



<p>Instead of referrals.id, you’ll want to use referrals.user_id.</p>



<h4 class="wp-block-heading"><strong>Omitting Left or Right in a Join</strong></h4>



<p>A standard <strong>join</strong> will return the data that two tables have in common, whereas a left join returns all of the records in the left table and any matching records from the right table. A right join returns all of the records in the right table and any matching records from the left table.</p>



<p>Accidentally omitting either left or right when joining two tables will often return a plausible subset of the intended rows.</p>



<p>Let’s use the users and referrals tables from our previous example. If we wanted to count the percent of users who referred another user, the following query will always return 100%, as it drops user rows that do not have any referrals:</p>



<pre class="wp-block-code"><code>select 
  count(distinct referrals.user_id) / 
    count(distinct users.id)::float
from users
join referrals
  on users.id = referrals.id</code></pre>



<p>In this query, join should be left join.</p>



<h2 class="wp-block-heading"><strong>Other Problems</strong></h2>



<h4 class="wp-block-heading">Comparing against Null</h4>



<p>Null has surprising behavior in comparisons. It is not greater than, less than, or equal to zero. It’s neither equal to itself nor not equal to itself.</p>



<p>If you have a table of users with a purchases column, and wanted to see how many had more than zero purchases, this query would lead you astray:</p>



<pre class="wp-block-code"><code>select count(1)
from users
where num_purchases != 0</code></pre>



<p>Instead of including only rows with more than zero purchases, this query will include all null rows as well. This is why you want to use where num_purchases is not null and num_purchases &gt; 0 and not where num_purchases != 0.</p>



<h4 class="wp-block-heading"><strong>Reversing Function Parameters</strong></h4>



<p>Let’s start with a signups and customers table. Say you wanted to look at how many signups become customers in less than a week. You could look at the datediff of the customer and signup timestamps and filter out rows with a difference of less than seven days.</p>



<p>However, if you reverse signups.date and customers.date in the datediff function, all of the differences will be negative and thus below the filter. This will overcount the signups that convert in less than seven days.</p>



<pre class="wp-block-code"><code>select count(1)
from signups 
join customers using (signup_id)
where datediff('day', customers.date, signups.date) &lt; 7</code></pre>



<p>The order of customers.date and signups.date needs to be reversed here.</p>



<h4 class="wp-block-heading"><strong>MySQL Group By</strong></h4>



<p>MySQL has the bizarre behavior of not erroring if there are too few group statements.</p>



<p>You will get very surprising results if you write a query like this:</p>



<pre class="wp-block-code"><code>select
  created_date,
  source,
  count(1)
from users
group by 1</code></pre>



<p>This query should also group by source.</p>



<p>MySQL will return arbitrary results for source, picked from any of the rows with a given created_date.</p>


The Top SQL Analysis Errors

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article