Why We Built Our Own Analytics System

We love analyzing data, but found some of the existing web analytics solutions like Google Analytics too constraining. There were some complex questions about our readers and users that we just couldn’t answer, or required excessive boilerplate.

It wasn’t fast enough to iterate on analysis of a/b tests, user retention, and a hundred custom metrics.

We ultimately moved to an in-house SQL based system for flexibility, control, and access to the rest of our data, and have been incredibly happy with this decision.

What We’ll Cover

There are four essential tasks for an analytics system:

  • Collecting data from user touch points
  • Reconciling data across touch points
  • Importing data from other systems
  • Running queries

We’ll explain how we designed and implemented each of these pieces for our internal analytics system, and show some example queries that this system enables.

Collecting Web Traffic

The first piece we built was a simple replacement for the Google Analytics tracking code. Whenever a user visits a page, we want to add a ping row to a pings table:

Collecting Web Traffic

This can be accomplished with a bit of javascript on each page of a site. Send an ajax request to www.oursite.com/ping, log it to your database, and set a cookie in the response header. Now we’ll have the same raw data that Google Analytics does:

  • The time the user visited the site
  • The URL they visited
  • The referring URL they came from
  • An id for the user to connect multiple visits

For bonus points, grab the browser and OS in javascript and pass it along on the ping request, to separate desktop vs. mobile traffic.

We also set extra URL parameters on our ad campaigns so we can eventually tell which ads are performant — we look at how to parse out these parameters and segment by ad campaign later on.

Collecting Signup Events

The next most critical data to collect is conversion data — a signup form submission in our case. Most companies likely already have a database table for this kind of data, so it’s just a matter of logging the same user cookie alongside the form submission.

Collecting signup events

The cookie is how we can connect the dots between the first time we see a visitor to when they become a customer.

We can join web traffic to form submissions, and ask simple questions such as how many pages did a user view before signing up?

  count(1) as pageviews
from signups 
join pings on signups.cookie = pings.cookie
              and pings.created_at < signups.created_at

Attributing Events to Channels

While we have the raw data, it’d be unwieldy to directly use the pings table for larger queries. A given user may have dozens of pings across multiple days and channels. This is the attribution problem.

While more established companies that have many touch points with a potential customer can justify the added complexity of attributing a user to multiple channels, we went with the simpler option of attributing a user to the first channel they discovered us on.

To help our queries run quickly, we created a regularly updated materialized view of the first ping for each cookie. Here’s the redshift SQL for this:

create view first_pings as 
select *
from (
    row_number() over (partition by cookie order by created_at asc)
  from pings
) t
where row_number = 1

Postgres can also use a select distinct on… query, and MySQL can select the min ping id grouped by cookie, then select all pings in that list of ids.

Before we use this handy view, let’s also define a cookie_channel view, that will map from a cookie a consolidated channel from a given referrer. Then we can map all pings originating Reddit, Twitter, and Hacker News to a canonical Social channel.

create view cookie_channel as 
  case when referrer ilike '%reddit.com%' then 'Social'
       when referrer ilike '%ycombinator.com%' then 'Social'
       when referrer ilike '%twitter.com%' then 'Social'
       when referrer ilike '%bing.com%' then 'Search'
       else 'Other' end as channel
from first_pings

In practice, we have dozens of rules and both a channel and more specific subchannel columns, but this is a good start.

Let’s create a chart of visitor growth by channel with our new views:

Monthly Unique Visitors
  date_trunc('week', created_at)
  , channel
  , count(1)
join cookie_channel using (cookie)
group by 1, 2

We could make a similar chart for signups by channel by joining from signups.cookie to first_pings.cookie.

Collecting External Events

While it’s great to have all of our signup and visitor data, several important events happen in separate systems. We have a lot of sales data in an external CRM, and it includes great fields we’d like to include. The most important of which are whether a user has had a conversation, a product trial, or eventually became a customer.

We set up a cron job to periodically pull the CRM data and reconcile based on the signup email. We match on email because the CRM doesn’t have our domain cookie, but it does have the user’s email.

We update a series of these sales funnel bits on the signup table we introduced above. Rather than keeping a current status, we track individual bits — demo, trial, customer — so we can see how far a user got down the funnel, even if eventually they decide to not become a customer.

Collecting External Events

Tracking App Usage and Customer Funnels

When a user first logs in, we can connect the app account to the user cookie by logging their cookie in the account table, similarly to what we do with signups. At this point, we can trace all app behavior (usage, conversion to a plaid plan, upgrade, churn) back to the original visit and channel.

Now we can generate entire lifecycle funnels, looking at how each month of signups converts through each stage:

Conversion funnel
  date_trunc('month', created_at
  , count(1) as "Signups"
  , count(distinct case when demo is true 
                          then signup.id 
                          else null end) as "Demos"
  , count(distinct case when trial is true 
                          then signup.id 
                          else null end) as "Trials"
  , count(distinct case when customer is true 
                          then signup.id 
                          else null end) as "Customers"
from signups
group by 1


With all this data, we can calculate retention of visitors, users, and paid users. We’ve previously written about how to write retention and cohort retention in SQL, and it makes for fine looking charts:

New User Retention

Cohort retention is a bit complex, so check out the above link for the step-by-step details.
Basic retention is a bit simpler, especially if you are already familiar with self-joins. We’ll calculate the 1-day retention of webpage visitors, and make use of our cookie_channel view.

Visitor 1 day retention
-- A helper CTE for date truncation and joining in channel
with activity as (
    date_trunc('day', created_at) as date,
  from pings 
  join cookie_channel using (cookie)
-- The main retention query.  Joins activity to itself
-- with a one day offset.
  count(distinct activity.user_id) as active_users, 
  count(distinct future_activity.user_id) as retained_users,
  count(distinct future_activity.user_id) / 
    count(distinct activity.user_id)::float as retention
from activity
left join activity as future_activity on
  activity.user_id = future_activity.user_id
  and activity.date = future_activity.date - interval '1 day'
group by 1,2

There are also alternate definitions of retention that are commonly used in mobile game analysis.

Building your own system

We covered how to collect the data you need to get started on your own in-house analytics platform, and some of the most common queries.

Next, you’ll want to create hundreds of insightful charts about your business. Sisense for Cloud Data Teams is a great option if you want to type SQL and get charts and dashboards to share with your team.