Single denormalized events tables are increasingly common, especially for storing logs data from mobile clients. These single-table implementations often use JSON blobs to store properties.
A single table is great for quickly ingesting data. Unfortunately, it can also make real-time queries slow and cumbersome.
User-Based Analysis on Denormalized Tables
A lot of data analysis involves aggregating data by user and reasoning about those users. With a denormalized table, you have to extract these user data point separately for each query, which wastes CPU time and analyst time.
For example, calculating paid user retention requires finding all the users who have paid. Then you join that back to your activity table to see those users’ behaviors changing over time.
Since this is a core metric for your business, you’ll be running a query like this daily, if not hourly:
with retention_users as ( select user_id, date(min(created_at)) as first_login, from events group by 1 having min(created_at) > now() - interval '14 day' ), user_activity as ( select user_id, date(created_at) as day from events group by 1, 2 where created_at > now() - interval '7 day' ) select day, count(retention_users.user_id) / count(user_activity.user_id) from user_activity right join retention_users using (user_id) where retention_users.first_login + interval '7 day' > user_activity.day group by 1
Luckily, there’s a solution: Normalizing metadata from events tables. This can make analysis queries much faster.
A user’s table is a great place to start. You’ll want to store some metadata. The details depend on the analyses you’ll be doing, but common cohorts include total spend, platform, marketing channel, join date, and experiment groups.
Here’s an example:
With such a table, your retention query simplifies to something like this:
with user_activity as ( select user_id, date(created_at) as day from events group by 1, 2 where created_at > now() - interval '7 day' ) select day, count(users.user_id) / count(user_activity.user_id) from user_activity right join users using (user_id) where users.first_login + interval '7 day' > user_activity.day and users.age between 18 and 35 group by day, users.spend_level
As a bonus, this version will run quite a bit faster, as you’re not joining the whole events table to itself!
Creating and Updating Your Tables
Depending on your stack and preferences, you have lots of options for creating and updating these tables.
Sisense for Cloud Data Teams’ Views feature will let you materialize a table with a simple select statement. You write the statement once, and Sisense for Cloud Data Teams updates the view every hour. This is an especially good option if you don’t own the database because your events are stored by a third-party service like Amplitude or Segment.
If you do own the database, certain databases support materialized views, which are a good option.
If you’re already ETLing data into the database, this can provide a natural home for code to create a table. You can also set it up to trigger the table creation when new events are added, keeping the metadata tables perfectly up-to-date.
If you want the metadata tables to live in your database, but don’t own your ETL, a database trigger works well. The trigger will run when new events are inserted and update your metadata tables as well. Remember to keep them fast, or they’ll bog down your insert statements.
Denormalized logs tables can be convenient when you’re logging the data. Just remember to set up normalized metadata tables on the other end!