In a previous post, we learned what Machine Learning (ML) classification problems are, we saw how Naive Bayes was used to solve the classification problem of sentiment analysis — detecting whether text is positive or negative. In this post, we are going to learn about Support Vector Machines (SVM), another popular technique used for classification problems. We are going to use this technique to predict whether someone is likely to have diabetes using predictor factors such as age, number of pregnancies, insulin levels, glucose levels, and more.
Diabetes is a chronic illness affecting many people and is characterized by the presence of high blood sugar levels. Early detection is important since diabetes detected in early stages can be controlled by lifestyle changes and/or minimal medication. Diabetes prediction serves as a useful reference for doctors because they can order further tests to detect diabetes early.
Preparing Our Training Data
The training data we are going to use for this problem is the Pima Indian Diabetes database. The dataset contains several predictor factors for diabetes and an outcome. The outcome indicates whether the person has diabetes (1) or not (0). In ML terms, these predictor factors are called features.
As usual, the first step in the ML process is preparing the training data. The dataset is available as CSV, so we can import the data using our CSV upload feature. Once the data has been imported, it needs to be filtered to include only the relevant features for training and cleaned to remove duplicates and missing data. This is best done using SQL, the most popular language for data analysts. Here’s the SQL I used to prepare this dataset for ML analysis:
select PREGNANCIES , GLUCOSE , BLOODPRESSURE , INSULIN , BMI , AGE , OUTCOME from [pima_indian_diabetes] where PREGNANCIES is not null and GLUCOSE is not null and BLOODPRESSURE is not null and INSULIN is not null and BMI is not null and AGE is not null and OUTCOME is not null
The result of that query is a table like this:
Once the data has been cleaned, we will use it as our training data. It’s ready to be fed into our ML algorithm (Support Vector Machine) to build our model. Before we do that, let me explain Support Vector Machines.
Understanding Support Vector Machines
In order to understand SVM, we need to understand what a N-dimensional hypercube is. To explain that concept, we’ll start with shapes and geometry.
A point is a 0-dimensional shape. It has no axis and no size.
A line is a 1-dimensional shape. There is a single axis. A point on a line is represented by a single variable (x), which represents the distance of the point from some origin.
A square is a 2-dimensional shape. There are 2 axes. A point on a square is represented by two variables (x, y), where (x) represents the distance of the point on X-axis and (y) represents the distance of the point on Y-axis.
A cube is a 3-dimensional shape. There are 3 axes. A point on a cube is represented by three variables (x, y, z), where (x) represents the distance of the point on the X-axis, (y) represents the distance of the point on the Y-axis, and (z) represents the distance of the point on the Z-axis.
Extending this idea, an N-dimensional hypercube is an N-dimensional shape. There are N axes. A point on this hypercube is represented by N number of variables. Our human eyes are only capable of visualizing up to three dimensions, so we are going to have to imagine this shape.
The second concept of dividing our N-dimensional hypercube starts with the concept that a line can be divided into two sections using a point.
A square can be divided into two sections using a line.
A cube can be divided into two sections using a 2-D plane.
Extending this idea, an N-dimensional hypercube can be divided into two sections using a (N-1) dimension hyperplane.
In the training phase, the SVM algorithm first draws an N-dimensional hypercube by representing each feature as a separate dimension. It then uses the numerical values of those features to plot points on the N-dimensional hypercube. It then attempts to find a boundary that separates the two classes of data — points where outcome is 0 (no diabetes) and points where outcome is 1 (diabetes), for example. The boundary is a (N-1) dimension hyperplane.
Here is an example boundary (a line) when there are two features.
Here is an example boundary (a 2D plane) when there are three features.
If there are more than two classes of data, then the SVM algorithm draws more hyperplanes.
In the testing phase, we can start with real-time data about a patient such as age, number of pregnancies, insulin levels, and so on. The SVM algorithm determines a 1/0 outcome about diabetes based on which side of that boundary the data falls on.
Why is this algorithm called Support Vector Machines? To accurately classify all the data points, the SVM algorithm needs to find the optimum hyperplane between the two classes. The optimum hyperplane is the one that maximizes the margin between the two classes. The data points, also known as vectors, that lie closest to the hyperplane are called Support Vectors, which gives the name Support Vector Machines to the algorithm.
Support Vectors are the most important data points of the training dataset. If these data points are removed from the training dataset, the position of the dividing hyperplane would change. They are also the data points that are the most difficult to classify.
An ideal SVM analysis produces a hyperplane that perfectly separates the data points into two non-overlapping classes, like in the picture above. However, perfect separation is not always possible. Perfect separation may result in a model that performs many misclassifications. In these situations, the SVM finds the hyperplane that maximizes the margin and minimizes the misclassifications.
The simplest way to separate data into two classes is through a straight line when there are 2 features or 2-D plane when there are 3 features or N-D hyperplane when there are (N+1) features. These separations are called linear separations. There are many situations where a non-linear region can separate the data more efficiently with fewer misclassifications. SVM can handle these cases using non-linear kernel functions. The most common of these is RBF (Radial Basis Functions). Others are polynomial and sigmoid kernel functions. While performing deep analysis, it is important to try different kernel functions and pick the one that provides the best results for the training data.
Below is an example where non-linear separation performs better than any linear separation.
Applying Support Vector Machines
The next step is to build our model using Support Vector Machines. The output of the SQL query above is available as a dataframe (df). Skikit-learn package has an algorithm for SVM and we import it. The code for building our model is below. We select the features we want to include and pass that along with the outcomes to the fit method of SVC (Support Vector Classifier). This builds the model. Note that we are using the linear kernel function.
# SQL output is imported as a dataframe variable called 'df' import pandas as pd from sklearn import svm outcomes = df['OUTCOME'] features = df[['PREGNANCIES', 'GLUCOSE', 'BLOODPRESSURE', 'INSULIN', 'BMI', 'AGE']].as_matrix() model = svm.SVC(kernel='linear') model.fit(features, outcomes)
After that Python code has been run, we are ready to test our model. Values can be entered manually in the Python code or automatically by setting up filters to pass values from the dashboard.
Once the filters are set up, modify the SQL to pass the input values from a filter into Python code.
select PREGNANCIES , GLUCOSE , BLOODPRESSURE , INSULIN , BMI , AGE , OUTCOME , '[INPUT_PREGNANCIES]' AS INPUT_PREGNANCIES , '[INPUT_GLUCOSE]' AS INPUT_GLUCOSE , '[INPUT_BLOOD_PRESSURE]' AS INPUT_BLOOD_PRESSURE , '[INPUT_INSULIN]' AS INPUT_INSULIN , '[INPUT_BMI]' AS INPUT_BMI , '[INPUT_AGE]' AS INPUT_AGE from [pima_indian_diabetes] where PREGNANCIES is not null and GLUCOSE is not null and BLOODPRESSURE is not null and INSULIN is not null and BMI is not null and AGE is not null and OUTCOME is not null
In the Python code, we reference the values passed from the dashboard through the filters.
result = model.predict([[df['INPUT_PREGNANCIES'], df['INPUT_GLUCOSE'], df['INPUT_BLOOD_PRESSURE'], df['INPUT_INSULIN'], df['INPUT_BMI'], df['AGE']]]) sisense.text('DIABETES') if result == 1 else sisense.text('NO DIABETES')
This allows us to invoke the diabetes predictor by supplying values directly from the dashboard.
Visualizing the Hyperplane and Support Vectors
Since we cannot visualize data when there are so many dimensions, let’s pick only 2 dimensions for visualizing our hyperplane — insulin levels and age. We can filter the data to only include patients who are at least 30 years old and have serum insulin levels over 350 mu U/ml to obtain a separation without misclassifications for illustrative purposes.
select PREGNANCIES , GLUCOSE , BLOODPRESSURE , INSULIN , BMI , AGE , OUTCOME , '[INPUT_PREGNANCIES]' as INPUT_PREGNANCIES , '[INPUT_GLUCOSE]' as INPUT_GLUCOSE , '[INPUT_BLOOD_PRESSURE]' as INPUT_BLOOD_PRESSURE , '[INPUT_INSULIN]' as INPUT_INSULIN , '[INPUT_BMI]' as INPUT_BMI , '[INPUT_AGE]' as INPUT_AGE from [pima_indian_diabetes] where PREGNANCIES is not null and GLUCOSE is not null and BLOODPRESSURE is not null and INSULIN is not null and BMI is not null and AGE is not null and OUTCOME is not null and INSULIN > 350 and age > 30 limit 10
Now let’s plot insulin levels vs. age and see what the visualization looks like. The code for that chart is below.
import pandas as pd import seaborn as sns data_plot = sns.lmplot('INSULIN', 'AGE', data=df, hue='OUTCOME', fit_reg=False) sisense.image(data_plot)
The output is a chart like this:
Next, let’s use the piece of code below to draw the separating hyperplane and parallels that pass through the Support Vectors for this data.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import svm import numpy as np data_plot = sns.lmplot('INSULIN', 'AGE', data=df, hue='OUTCOME', fit_reg=False) outcomes = df['OUTCOME'] features = df[['INSULIN', 'AGE']].as_matrix() model = svm.SVC(kernel='linear') model.fit(features, outcomes) # Plot the separating hyperplane w = model.coef_ a = -w / w xx = np.linspace(30, 800) yy = a * xx - (model.intercept_) / w plt.plot(xx, yy, linewidth=2, color='black') # Plot the parallels to the hyperplane that pass through the support vectors b = model.support_vectors_ yy_down = a * xx + (b - a * b) b = model.support_vectors_[-1] yy_up = a * xx + (b - a * b) plt.plot(xx, yy_down, 'k--') plt.plot(xx, yy_up, 'k--') plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=80, facecolors='none') # Plot the data points sisense.image(data_plot)
This results in the below hyperplane and parallels (dotted lines) passing through the Support Vector.
Using a few lines of SQL, we have prepared our training diabetes data to be analyzed; using a few lines of Python, we have trained a model that is capable of predicting whether a person is likely to have diabetes, providing an efficient means to utilize medical resources to identify and treat the highest percentage of patients with diabetes. This shows the power of today’s most advanced data analysis tools. Sisense supports dozens of R and Python libraries made for data analysis and visualization, ready and waiting for your next data project!