Everyone wants to get more out of their data, but how exactly to do that can leave you scratching your head. Our BI Best Practices demystify the analytics world and empower you with actionable how-to guidance.

One of the simplest ways to start exploring your data is to aggregate the metrics you are interested in by their relevant dimensions. For example, assuming you would like to explore your company’s revenue, you would analyze it by countries, products, time, etc. This kind of analysis, in most cases, can lead to insights that can be later translated to business actions. However, it can also produce incorrect results if translated poorly. How does this happen? How can good data lead to faulty conclusions?

Let’s start with an example. You’re working for an ice cream company that is about to market a brand new special-edition flavor. After lots of meetings and discussions, two flavors are chosen as finalists: ginger and sugar cookie. Only one flavor will be selected for production. Your department conducts a survey and asks 100 people if they like the ginger ice cream and 100 different people if they like the sugar cookie flavor. Here are the results:

According to your first analysis, clearly ginger is the winner (62% > 54%). However, to further explore the data, you decide to break down the distribution of likes between men and women:

Something strange has happened! When looking at responses by gender, we can see that both men and women preferred the sugar cookie flavor over the ginger. So how do we get totally different results when breaking the data down by gender?

This is an example of Simpon’s paradox, a statistical phenomenon in which a trend that is present when data is put into groups reverses or disappears when the data is combined. It was first introduced by the statistician Edward H. Simpson in 1951 (although different people mentioned similar effects earlier).

In our example, when the data is split into two groups, we can say that both groups prefer the sugar cookie flavor. But when the data is combined, our conclusion reverses and it seems like ginger is preferable.

It’s time to introduce a new statistical term. A lurking variable (also known as confounding variable) is an extra variable that has not been taken into account during the experiment/analysis and can lead to wrong conclusions.

In our example, two effects combined together create the paradox:

1. Men liked the two flavors less compared to women. Maybe men are more critical than women in general when asked about ice cream flavors? We don’t know. However, this is a lurking variable because we didn’t take it into account when we analyzed the data.
2. In addition, we can see that the distribution of men and women surveyed is unbalanced. In the ginger flavor survey, 37% of the responders were men and 63% were women, while in the sugar cookie survey, 90% were men and only 10% were women.

These two effects combined to create the paradox in our example.

## Better analysis, better decisions

Which flavor is the real winner? In our case, when we take the lurking variable into consideration, it’s clear that the sugar cookie flavor should be the winner since both men and women prefer it over ginger.

In general, it is not possible to give a rule of thumb about when data should be partitioned or combined. It really depends on the circumstances. As an example, I’ll present a case from The Book of Why by Judea Pearl.

A new drug promising to reduce the risk of heart attack was tested with two groups. The participants of the first group, the control group, did not use the drug, while the participants of the second group, the treatment group, did. The results show the proportions of people who had a heart attack:

Again, we see Simpson’s paradox in the results. When the data is combined, it seems that the drug reduces the risk of getting a heart attack. On the other hand, when the results are grouped by gender, we can observe that for both men and women, the risk of getting a heart attack after using the drug increases. “This drug seems to be bad for women, bad for men, but good for people!”

Of course, that statement doesn’t make sense. The paradox can be resolved by better understanding the data — exploring how it was generated and identifying the lurking variable. This isn’t a randomized controlled trial (RCT) experiment, but an observational study in which people decide if they take the drug or not. It is clear in this study that women have a preference of taking the drug (⅔ of women took the drug) and men preferred not to (only ⅓ of men took the drug). In addition, men are at a greater risk of having a heart attack, overall. Gender affects both the target variable (heart attack) and the decision to take the drug. It is correct in this case to analyze the data by gender. The drug is actually bad for women, bad for men, and bad for people.

Now, let’s check a slightly different case in which grouping the data leads to incorrect results. Continuing the previous example, let’s assume that blood pressure is known to be a cause for heart attack and the goal of the test drug is to reduce blood pressure. The researchers wanted to check if the drug will also reduce the risk of heart attacks. They measured both the blood pressure of the participants and if they had a heart attack or not. All the participants had high blood pressure at the beginning.

Notice that the numbers are exactly the same as in the previous example. Yet, since blood pressure doesn’t affect the decision to take the drug, focusing on the combined data is correct. We can see that drug reduced blood pressure among participants in the treatment group. It also reduced their risk of heart attack.

To better understand when the data should be grouped, you should be familiar with causal inference. If you don’t have the time to read “The Book of Why,’” you can refer to Towards Data Science

## How common is Simpson’s paradox?

In 2009, researchers suggested that Simpson’s paradox may occur more often than commonly thought. (See “How likely is Simpson’s paradox?“) They showed that the paradox occurred in 1.67% of cases simulated with uniformly distributed random data. Another study showed using experimental studies that the paradox might occur, and that people are often poor at recognizing it. (See Kievit, Rogier, et al.)

## Tackling Simpson’s paradox in correlations

Simpson’s paradox can also arise in correlations when two variables appear to have a correlation in one direction (positive/negative), but the direction reverses when the variables are broken by a dimension. A very nice example was demonstrated in a blog post by Jon Wayland:

Teachers investigated the effect of students’ study time before tests on their test scores. The results were very surprising and indicated on a strong negative correlation (-0.7981) between study time and scores (the less a student studied, the higher they tended to score on tests).

When the data is broken by course, the correlation reverses and we can see that investing more time in studying is worth the effort!

In this case, the course difficulty is a lurking variable — it affects both test results and the number of hours needed for preparation.

## Forewarned is forearmed

Simpson’s paradox, when it goes unnoticed, can lead to wrong conclusions and bad decisions.  It’s important to be aware of this phenomenon when analyzing your data. Knowing your data, understanding how it was generated, and getting a handle on confounding variables are all crucial if you’re going to make smarter data-driven decisions!

Ayelet Arditi is a data scientist on the AI research team at Sisense, constantly improving the platform’s data and analytics capabilities to enable users to build and consume AI applications for augmented analytics, automatic data preparation, and conversational data exploration.