Earlier, we showed how to make count distinct 50x faster with subqueries. Using probabilistic counting, we’ll make count distinct even faster, trading a little accuracy for the increase in speed.
We’ll optimize a very simple query, which calculates the daily distinct sessions for 5,000,000 gameplays (~150,000/day):
select date(created_at), count(distinct session_id) from gameplays
The original query takes 162.2s. The HyperLogLog version is 5.1x faster (31.5s) with a 3.7% error and uses a small fraction of the RAM.
Databases often implement
count(distinct) in two ways: When there are a few distinct elements, the database makes a HashSet in RAM and then counts the keys. When there are too many elements to fit in RAM, the database writes them to disk, sorts the file, and then counts the number of element groups. The second case — writing the intermediate data to disk — is very slow. Probabilistic counters are designed to use as little RAM as possible, making them ideal for large data sets that would otherwise page to disk.
The HyperLogLog Probabilistic Counter is an algorithm for determining the approximate number of distinct elements in a set using minimal RAM. Distincting a set that has 10 million unique 100-character strings can take over a gigabyte of RAM using a hash table, while HyperLogLog uses less than a megabyte (the “log log” in HyperLogLog refers to its space efficiency). Since the probabilistic counter can stay entirely in RAM during the process, it’s much faster than any alternative that has to write to disk and usually faster than alternatives using a lot more RAM.
The core of the HyperLogLog algorithm relies on one simple property of uniform hashes: The probability of the position of the leftmost set bit in a random hash is 1/2n, where n is the position. We call the position of the leftmost set bit the most significant bit, or MSB.
Here are some hash patterns and the positions of their MSBs:
We’ll use the MSB position soon, so here it is in SQL:
select 31 - floor(log(2, hashtext(session_id) & ~(1 << 31)))) as bucket_hash from gameplays
Hashtext is an undocumented hashing function in Postgres. It hashes strings to 32-bit numbers. We could use
md5 and convert it from a hex string to an integer, but this is faster. We use
~(1 << 31) to clear the leftmost bit of the hashed number. Postgres uses that bit to determine if the number is positive or negative, and we only want to deal with positive numbers when taking the logarithm. The
floor(log(2,...)) does the heavy lifting: The integer part of base-2 logarithm tells us the position (from the right) of the MSB. Subtracting that from 31 gives us the position of the MSB from the left, starting at 1. With that line, we’ve got our MSB per-hash of the session_id field!
The maximum MSB for our elements is capable of crudely estimating the number of distinct elements in the set. If the maximum MSB we’ve seen is 3, given the probabilities above we’d expect around 8 (i.e. 23) distinct elements in our set. Of course, this is a terrible estimate to make as there are many ways to skew the data. The HyperLogLog algorithm divides the data into evenly-sized buckets and takes the harmonic mean of the maximum MSBs of those buckets.
The harmonic mean is better here since it discounts outliers, reducing the bias in our count. Using more buckets reduces the error in the distinct count calculation, at the expense of time and space. The function for determining the number of buckets needed given the desired error is:
We’ll aim for a +/- 5% error, so plugging in 0.05 for the error rate gives us 512 buckets. Here’s the SQL for grouping MSBs by date and bucket:
select date(created_at) as created_date, hashtext(session_id) & (512 - 1) as bucket_num, 31 - floor(log(2, min(hashtext(session_id) & ~(1 << 31)))) as bucket_hash from sessions group by 1, 2 order by 1, 2
hashtext(...) & (512 - 1) gives us the rightmost 9 bits , 511 in binary is 111111111), and we’re using that for the bucket number. The bucket_hash line uses a min inside the logarithm instead of something like this:
max(31 - floor(log(...))) so that we can compute the logarithm once – greatly speeding up the calculation. Now we’ve got 512 rows for each date – one for each bucket – and the maximum MSB for the hashes that fell into that bucket. In future examples we’ll call this select bucketed_data.
It’s time to put together the buckets and the MSBs. The paper linked above has a lengthy discussion on the derivation of this function, so we’ll only recreate the result here. The new variables are m (the number of buckets, 512 in our case) and M (the list of buckets indexed by j, the rows of SQL in our case). The denominator of this equation is the harmonic mean mentioned earlier:
In SQL, it looks like this:
select created_date, ((pow(512, 2) * (0.7213 / (1 + 1.079 / 512))) / ((512 - count(1)) + sum(pow(2, -1 * bucket_hash))))::int as num_uniques, 512 - count(1) as num_zero_buckets from bucketed_data group by 1 order by 1
We add in
(512 - count(1)) to account for missing rows. If no hashes fell into a bucket it won’t be present in the SQL, but by adding 1 per missing row to the result of the sum we achieve the same effect. The
num_zero_buckets is pulled out for the next step where we account for sparse data.
Almost there! We have distinct counts that will be right most of the time – now we need to correct for the extremes. In future examples, we’ll call this select counted_data.
The results above work great when most of the buckets have data. When a lot of the buckets are zeros (missing rows), then the counts get a heavy bias. To correct for that we apply the formula below only when the estimate is likely biased, with this equation:
And the SQL for that looks like this:
select counted_data.created_date, case when num_uniques < 2.5 * 512 and num_zero_buckets > 0 then ((0.7213 / (1 + 1.079 / 512)) * (512 * log(2, (512::numeric) / num_zero_buckets)))::int else num_uniques end as approx_distinct_count from counted_data order by 1
Now, putting it all together:
select counted_data.created_date, case when num_uniques < 2.5 * 512 and num_zero_buckets > 0 then ((0.7213 / (1 + 1.079 / 512)) * (512 * log(2, (512::numeric) / num_zero_buckets)))::int else num_uniques end as approx_distinct_count from ( select created_date, ((pow(512, 2) * (0.7213 / (1 + 1.079 / 512))) / ((512 - count(1)) + sum(pow(2, -1 * bucket_hash))))::int as num_uniques, 512 - count(1) as num_zero_buckets from ( select date(created_at) as created_date, hashtext(session_id) & (512 - 1) as bucket_num, 31 - floor(log(2, min(hashtext(session_id) & ~(1 << 31)))) as bucket_hash from gameplays group by 1, 2 ) as bucketed_data group by 1 order by 1 ) as counted_data order by 1
And that’s the HyperLogLog probabilistic counter in pure SQL!
The HyperLogLog algorithm really shines when you’re in an environment where you can count distinct in parallel. The results of the bucketed_data step can be combined from multiple nodes into one superset of data, greatly reducing the cross-node overhead usually required when counting distinct elements across a cluster of nodes. You can also preserve the results of the bucketed_data step for later, making it nearly free to update the distinct count of a set on the fly!
Hash MSB Position Hashes like this 1xxxxx 1 50 01xxxx 2 25% 001xxx 3 12.5% 0001xx 4 6.25%