Analyze with Insight Miner is a technology developed by Sisense that uses machine learning to identify statistically important insights in dashboards. We apply a vast range of analytical and statistical tests to detect interesting insights and compute common exploratory analysis, such as cohort, decision tree, bivariate analysis, and more.
As the lead of the team who built this technology, in this post, I’ll introduce a high-level flow of data in Analyze with Insight Miner – starting from reading the data, cleaning it, and lastly generating insights. It’s important to keep in mind during this section that we needed a data flow that is robust and generic enough to handle many types of datasets from different domains and with different distributions. Following that, I will discuss how we detect non-relevant or too obvious insights.
Analyze With Insight Miner
Analyze with Insight Miner generates insights based on tabular datasets. Imagine you have a big table with lots of columns and want to explore the impact of all the columns on one specific column. This one column will be the target variable, i.e. the column that we are interested to explore. All other columns will be referred to as the explaining variables, which are the variables by which we want to explain the target.
Using Analyze with Insight Miner, some typical insights a user might see are:
“When the age of the customer is larger than 60 in the US, the likelihood of the customer to churn is 5%, compared to a 30% likelihood in the entire population.”
“We detect a decreasing trend in the likelihood to churn over time in Italy.”
If we look at the examples above, the target variable is the churn column (a binary column describing if a customer churned or not) and the explaining variables are the age of the customers and their countries.
But first things first…
Before Analyze with Insight Miner starts looking for insights, it needs to preprocess the dataset. It starts by understanding the type of the variables in the dataset (are they numeric, categorical, dates, and so on). It then applies some preprocessing steps. For instance, many statistical tests are not robust for outliers so it needs to detect and remove them.
On top of this, if there are missing values in the dataset, Analyze with Insight Miner needs to decide what to do with them. In some cases, it imputes them. A common approach for imputation if you’re working with a numeric column, for example, is to replace the missing values with the average of all other values of the same column. Imputation of values in a categorical column can be done by replacing the missing values with the most common value of the column.
After applying a sequence of cleansing and preprocessing steps it can get to the interesting part – generating insights.
In this step, Analyze with Insight Miner conducts a set of tests to discover patterns in the data or subpopulations that are different than the general population. For example, it creates combinations of subgroups (such as customers under 35 from the US) and checks if their target variable distribution is significantly different than the others (for example, the likelihood of this group to churn is significantly lower than all the others).
To form these groups Analyze with Insight Miner uses, among other techniques, decision trees. A decision tree is a popular machine learning algorithm used both for classification and regression. One of its strengths is the ability to interpret its logic, in contrast to many other machine learning algorithms that are black boxes. Analyze with Insight Miner trains a decision tree on the target variable using the explaining variables. Below you can see an example for a decision tree.
We can extract several insights from this decision tree. For example, females above 35 are likely to churn 23%.
Now that we understand the basics behind how Analyze with Insight Miner works I would like to present one of the techniques it uses to avoid generating insights that are obvious and uninteresting for the end user.
What Are Obvious Insights?
One of our assumptions in developing Analyze with Insight Miner was that end users don’t necessarily need to have the knowledge of constructing the right dataset that is most suitable for the exploratory analysis.
Continuing from the previous example, let’s assume that we are interested in uncovering insights related to the churn patterns of our customers. In this case, our target variable can be a binary column indicating if a customer churned or renewed his account. The explaining variables can be columns related to the customers’ demographics, past usage, and so on.
Below we can see a mini example for such a dataset. The last column in this dataset (“IsRenewal”) indicates if the customer renewed his account and is the complete opposite of “IsChurn”. In this case, we’ll return a very strong insight – “when customers renew their account they don’t churn.” Unfortunately, for obvious reasons, this is not interesting at all.
This example is easy to detect since there is a one-to-one mapping between the target variable and the “IsRenewal” explaining variable. But what if the relationship between the variables is not so straightforward? How can we detect those relationships automatically and avoid obvious insights?
How to Avoid Generating Obvious Insights
One approach to avoid generating obvious insights is to apply what we refer to as “opposite feature selection.”
Before we dive into the details, let’s first define feature selection. Feature selection is a well-known process in machine learning that selects a subset of features (columns) to use in the model construction.
Eventually, after applying feature selection, you would stay with the subset of features that are the most informative/predictive and drop the redundant or irrelevant. There are a few reasons to use feature selection techniques prior to model construction. Among them are the simplification of models, avoiding the curse of dimensionality, and reducing overfitting.
While there are many feature selection techniques, we’ll focus on mutual information based technique. Mutual information is a measure taken from information theory that measures the mutual dependence of two random variables. Intuitively, mutual information measures how much knowing one of these variables reduces uncertainty about the other.
In our example, the mutual information between the target variable “IsChurn” and the “IsRenewal” column will be very high since knowing the values of one of these columns completely reduces the uncertainty of the other. To apply feature selection, we can rank the explaining variables by their mutual information with the target variable and select the top K variables with the highest mutual information.
Our problem is a bit different than feature selection in which we want to choose the variables with the highest dependency on the target. Here, we want to avoid obvious insights that are caused by variables that are too dependent on the target. So, it stands to reason that we can do the opposite from feature selection and remove the variables with very high mutual information, right?
Not exactly. To do so, we will need to define a threshold such that all variables with mutual information higher than 0.95 are discarded. The problem is that mutual information is not a normalized measure – it can get any non-negative, real value. To do so we can use the normalized version of mutual information, which yields values between 0 to 1. When the mutual information values are normalized we can set a threshold and remove columns with normalized mutual information above it.
Another technique for selecting the most relevant and interesting insights is to learn from user feedback. This can be accomplished in several ways. We can explicitly ask the users for feedback by presenting “Like” or “Dislike” buttons next to the insights they receive.
We can also implicitly get feedback from the users’ usage. We can consider insights that were shared with other users as interesting insights or we can detect if a new widget was constructed based on the insight presented. By doing this we can learn using machine learning models the patterns of interesting insights and predict in advance how interesting a user will find specific insights.
More to Come
Outsourcing data science often doesn’t live up to customer expectations because there needs to be a strong understanding of the domain data for insights to be delivered. However, with a basic overview of how Analyze with Insight Miner works, it’s easy to understand how it can help data engineers, analysts, and developers scale delivery of hidden insights beyond predefined dashboards to their end users.
I’m so excited to share this feature with all of you. We are planning on adding some more new cool features in the future and integrating Analyze with Insight Miner with other features in Sisense – so stay tuned!