Why We Built Our Own Analytics System
We love analyzing data, but found some of the existing web analytics solutions like Google Analytics too constraining. There were some complex questions about our readers and users that we just couldn’t answer, or required excessive boilerplate.
It wasn’t fast enough to iterate on analysis of a/b tests, user retention, and a hundred custom metrics.
We ultimately moved to an in-house SQL based system for flexibility, control, and access to the rest of our data, and have been incredibly happy with this decision.
What We’ll Cover
There are four essential tasks for an analytics system:
- Collecting data from user touch points
- Reconciling data across touch points
- Importing data from other systems
- Running queries
We’ll explain how we designed and implemented each of these pieces for our internal analytics system, and show some example queries that this system enables.
Collecting Web Traffic
The first piece we built was a simple replacement for the Google Analytics tracking code. Whenever a user visits a page, we want to add a ping row to a pings table:
- The time the user visited the site
- The URL they visited
- The referring URL they came from
- An id for the user to connect multiple visits
We also set extra URL parameters on our ad campaigns so we can eventually tell which ads are performant — we look at how to parse out these parameters and segment by ad campaign later on.
Collecting Signup Events
The next most critical data to collect is conversion data — a signup form submission in our case. Most companies likely already have a database table for this kind of data, so it’s just a matter of logging the same user cookie alongside the form submission.
The cookie is how we can connect the dots between the first time we see a visitor to when they become a customer.
We can join web traffic to form submissions, and ask simple questions such as how many pages did a user view before signing up?
select signups.email, count(1) as pageviews from signups join pings on signups.cookie = pings.cookie and pings.created_at < signups.created_at
Attributing Events to Channels
While we have the raw data, it’d be unwieldy to directly use the pings table for larger queries. A given user may have dozens of pings across multiple days and channels. This is the attribution problem.
While more established companies that have many touch points with a potential customer can justify the added complexity of attributing a user to multiple channels, we went with the simpler option of attributing a user to the first channel they discovered us on.
To help our queries run quickly, we created a regularly updated materialized view of the first ping for each cookie. Here’s the redshift SQL for this:
create view first_pings as select * from ( select *, row_number() over (partition by cookie order by created_at asc) from pings ) t where row_number = 1
Postgres can also use a select distinct on… query, and MySQL can select the min ping id grouped by cookie, then select all pings in that list of ids.
Before we use this handy view, let’s also define a cookie_channel view, that will map from a cookie a consolidated channel from a given referrer. Then we can map all pings originating Reddit, Twitter, and Hacker News to a canonical Social channel.
create view cookie_channel as select cookie, case when referrer ilike '%reddit.com%' then 'Social' when referrer ilike '%ycombinator.com%' then 'Social' when referrer ilike '%twitter.com%' then 'Social' when referrer ilike '%bing.com%' then 'Search' else 'Other' end as channel from first_pings
In practice, we have dozens of rules and both a channel and more specific subchannel columns, but this is a good start.
Let’s create a chart of visitor growth by channel with our new views:
select date_trunc('week', created_at) , channel , count(1) from first_pings join cookie_channel using (cookie) group by 1, 2
We could make a similar chart for signups by channel by joining from signups.cookie to first_pings.cookie.
Collecting External Events
While it’s great to have all of our signup and visitor data, several important events happen in separate systems. We have a lot of sales data in an external CRM, and it includes great fields we’d like to include. The most important of which are whether a user has had a conversation, a product trial, or eventually became a customer.
We set up a cron job to periodically pull the CRM data and reconcile based on the signup email. We match on email because the CRM doesn’t have our domain cookie, but it does have the user’s email.
We update a series of these sales funnel bits on the signup table we introduced above. Rather than keeping a current status, we track individual bits — demo, trial, customer — so we can see how far a user got down the funnel, even if eventually they decide to not become a customer.
Tracking App Usage and Customer Funnels
When a user first logs in, we can connect the app account to the user cookie by logging their cookie in the account table, similarly to what we do with signups. At this point, we can trace all app behavior (usage, conversion to a plaid plan, upgrade, churn) back to the original visit and channel.
Now we can generate entire lifecycle funnels, looking at how each month of signups converts through each stage:
select date_trunc('month', created_at , count(1) as "Signups" , count(distinct case when demo is true then signup.id else null end) as "Demos" , count(distinct case when trial is true then signup.id else null end) as "Trials" , count(distinct case when customer is true then signup.id else null end) as "Customers" from signups group by 1
With all this data, we can calculate retention of visitors, users, and paid users. We’ve previously written about how to write retention and cohort retention in SQL, and it makes for fine looking charts:
Cohort retention is a bit complex, so check out the above link for the step-by-step details.
Basic retention is a bit simpler, especially if you are already familiar with self-joins. We’ll calculate the 1-day retention of webpage visitors, and make use of our cookie_channel view.
-- A helper CTE for date truncation and joining in channel with activity as ( select date_trunc('day', created_at) as date, channel, pings.* from pings join cookie_channel using (cookie) ) -- The main retention query. Joins activity to itself -- with a one day offset. select activity.date, channel, count(distinct activity.user_id) as active_users, count(distinct future_activity.user_id) as retained_users, count(distinct future_activity.user_id) / count(distinct activity.user_id)::float as retention from activity left join activity as future_activity on activity.user_id = future_activity.user_id and activity.date = future_activity.date - interval '1 day' group by 1,2
There are also alternate definitions of retention that are commonly used in mobile game analysis.
Building your own system
We covered how to collect the data you need to get started on your own in-house analytics platform, and some of the most common queries.
Next, you’ll want to create hundreds of insightful charts about your business. Sisense for Cloud Data Teams is a great option if you want to type SQL and get charts and dashboards to share with your team.