Leaving random() Behind
If you’ve ever had to generate sample data in your database, odds are you’ve started with random() or its equivalent. It’s fast and it’s reliable. Unfortunately, it’s also evenly distributed. You’re just as likely to get 0.5 as 0.9. Most real-world data is not like this.
After an even distribution, your most basic distribution of random numbers is a normal distribution. In this post, we’ll show you how to use the Marsaglia Polar Method to convert random()’s uniform distribution into a normal distribution that looks like this:
Two Random Numbers Are Better Than One
The Marsaglia method converts a pair of uniformly distributed random numbers into a pair of normally distributed random numbers.
First, we need a list of random numbers. We’ll use generate_series for brevity. If you’re on Redshift or MySQL, check out the alternatives to generate_series. The Marsaglia method requires numbers between -1 and 1. We’ll generate 100,000 of them:
with numbers as ( select 2 * random() - 1 as x, 2 * random() - 1 as y from generate_series(0, 100000) ),
The Marsaglia method is a type of rejection sampling, which is a fancy way of saying that it throws away inputs that don’t fit the distribution.
Since it’s based on polar coordinates, the sum of the squares of the numbers needs to be less than one. This ends up filtering out about 𝜋/4 (21%) of the data, so make you generate enough numbers at the outset.
polar_points as ( select x, y, x*x + y*y as s from numbers where x*x + y*y < 1 ),
Now we know we have the right set of inputs, so we can turn them into normally distributed pairs. We’ll compute the new pairs separately, and then union the lists together to get one long list of normally distributed numbers.
The formulas are straightforward:
Translating these to SQL, we get:
marsaglia as ( select x * sqrt((-2 * ln(s))/s) as n from polar_points union select y * sqrt((-2 * ln(s))/s) as n from polar_points )
Bringing It All Together
To make a pretty graph of your distribution like the one above, combine the previous steps and then bucket and count the numbers (by rounding them) like so:
with numbers as ( select 2 * random() - 1 as x, 2 * random() - 1 as y from generate_series(0, 100000) ), polar_points as ( select x, y, x*x + y*y as s from numbers where x*x + y*y < 1 ), marsaglia as ( select x * sqrt((-2 * ln(s))/s) as n from polar_points union select y * sqrt((-2 * ln(s))/s) as n from polar_points ) select round(n::numeric, 1), count(1) from marsaglia group by 1 order by 1
Now your randomly generated test data can look a lot more normal!