When dealing with massive quantities of customer data, it can be difficult to answer simple questions like: is a customer going to churn or not? In this post, we explore some basics of judgment making using a statistical method called hypothesis testing.
A hypothesis is a proposal on the underlying relationship between two data sets. Through hypothesis testing, we can make an inference on this relationship and the underlying unknown factors by manipulating our observable data.
There are two hypotheses in this test – the null hypothesis and the alternative hypothesis. The hypothesis test involves making a comparison between the two hypotheses and deciding whether or not we reject the null hypothesis according to our observable data.
In order to evaluate the null hypothesis, we will need to identify the p-value. The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set if the null hypothesis is true.
In other words, the p-value is a way of quantifying the strength of the evidence against the null hypothesis and in favor of the alternative.
To calculate this p-value, we first derive the Z-score, then use this Z-score to calculate the probability using a distribution. Since we will be working with sampled observations, we will derive the p-value by using the cumulative value of the t-distribution.
The first step to calculating the p-value is the Z-score, whose formula looks like the following:
For this Z-score, we need each of the components: sample mean and sample standard error.
Once we have the Z-score and the corresponding mapping of the p-value through the t-distribution, we make a judgment decision on the null hypothesis. In most statistical cases, we judge a p-value to be statistically significant against the null hypothesis when the p-value is 5% or less.
If the p-value is less than 5%, it is very unlikely (with < 5% probability) that the null hypothesis is actually true when we’ve rejected it. As such, when the p-value crosses the 5% threshold, we reject the null hypothesis in favor of the alternative hypothesis.
Application to Customer Usage Data
To test out this statistical method, let’s take a look at some daily active user data for a mobile game. The question we want to answer is: Has our top user from our product significantly played less in the past month than historically, implying that this user will churn? From an initial exploration of the data, our top user is Julian Ford:
Julian has played our game 1874 times since he first signed up. To see if Julian’s usage from the most recent month is significantly lower, we want to set up a hypothesis test. First, we draft out the null and alternative hypotheses:
H0 (null hypothesis): There is no difference between Julian’s daily gameplays average in the most recent month and his historical daily average. Ha (alternative hypothesis): Julian’s daily average gameplays is less in the most recent month than his historical daily average.
To conduct our hypothesis setting, we first clean the data and get the summary statistics needed for computing the Z-score. Our original data in Redshift looks like:
with all_data as ( select email , date_trunc('day', gameplays.created_at) as date , count(1) from gameplays join users on users.id = user_id where users.first_name = 'Julian' and users.last_name = 'Ford' group by 1, 2 )
We create more CTEs to compute the sample average, sample standard deviation, and sample size for each data sets:
... , historical_stats as ( select 'historical average' as category , avg(daily_plays) as avg , STDDEV_SAMP(daily_plays) as stdev , count(*) from all_data where date < (getdate() - interval '30 days') group by 1,2 ) ... , recent_stats as ( select 'recent average' as category , avg(daily_plays) as avg , STDDEV_SAMP(daily_plays) as stdev , count(*) from all_data where date >= (getdate() - interval '30 days') group by 1,2 )
Next, we create a user-defined function (UDF) in Redshift to calculate our p-value. This user-defined function is based on our discussion of UDFs in our A/B Testing blog post.
In this user-defined function, we use Python’s SciPy library to calculate the p-value of the Z-score. The family of functions we’ll use falls under the group stats.t.
Since we are comparing two datasets, we will use the modified standard error formula to take into consideration the distribution of both samples:
create or replace function hypotest ( historical_avg integer , historical_stdev integer , historical_count integer , recent_avg integer , recent_stdev integer , recent_count integer ) returns float stable as $$ from scipy.stats import t def standard_error (a_stdev, a_count, b_stdev, b_count): return (a_stdev**2/a_count+b_stdev**2/b_count) ** 0.5 def zscore (a_avg, b_avg, standard_error): return (b_avg-a_avg)/standard_error def df(a_count, b_count): return min (a_count, b_count) - 1 def p_value (zscore, df): return t.cdf (zscore, df) test_df = df (historical_count, recent_count) test_se = standard_error (historical_stdev, historical_count , recent_stdev, recent_count) test_zscore = zscore (historical_avg, recent_avg, test_se) return p_value ( test_zscore , test_df ) $$ language plpythonu
We conclude our query with the output of the function:
select hypotest ( historical_stats.avg , historical_stats.stddev , historical_stats.count , recent_stats.avg , recent_stats.stddev , recent_stats.count )
And the result from our test is 4.80%
From this result, we have evidence convincing enough to reject the null hypothesis that there was no difference between average daily gameplays for Julian’s most recent month versus his historical daily average.
As such, with Julian’s performance from the most recent month, our hypothesis test identifies him as someone who will probably churn. With the power of hypothesis testing, we can more easily pinpoint customers who have been losing interest in our games, and could benefit from a friendly outreach!