Analytics is constantly evolving; as datasets become bigger and more complex, only AI, materialized views, and more sophisticated coding languages will be able to glean insights from them. In Next-Level Moves, we dig into the ways advanced analytics are paving the way for the next wave of innovation.
Machine learning (ML) refers to the use of existing data, computing power, and effective algorithms to identify patterns in data, recognize those patterns when they occur again, and correctly predict an outcome based on those patterns. A frequent type of problem encountered in ML is the classification problem. In these problems, we attempt to predict whether an object or an event belongs to a certain category. Some examples of classification problems are detecting whether a credit card transaction is fraudulent, detecting whether an email is spam, and detecting whether a customer is likely to churn.
Sentiment analysis is a classification problem where data teams attempt to predict whether text is positive or negative in tone. Many companies use sentiment analysis to automatically analyze product reviews, social media comments, and survey responses to quantify feedback about their products and services. In this post, we will build a sentiment analyzer using Python after preparing text data using SQL. We will use the Naive Bayes algorithm, a popular algorithm for sentiment analysis problems. Let’s get started.
The ML Learning Process
The ML process involves three major steps — preparing data, training a model, and testing the model. After the model is tested, it is deployed. Once the model is deployed, the applications use the model to answer a question — in this case, we’ll determine whether text is positive or negative. But it does not stop there, the ML process is very iterative. A successful model needs to be constantly tested, trained, and re-created as the world changes!
The first step in any ML process is preparing the training data. We will use the Sentiment Labelled Sentences Dataset from UCI Machine Learning Repository. That dataset contains user reviews from Amazon, IMDB, and Yelp plus a judgment about whether each review is positive (score of 1) or negative (score of 0). The dataset is available as a CSV, so we can import the data using the CSV upload feature.
Once the data has been imported, it needs to be cleaned to remove duplicates and missing data. This is best done using SQL, the most popular language for data analysts. Here’s a look at the SQL I used to prepare this dataset for ML analysis:
select review , sentiment from [govind_amazon_reviews] union select review , sentiment from [govind_yelp_restaurant_reviews] union select review , sentiment from [govin_imdb_movie_reviews] where review is not null and sentiment is not null
Once the data has been cleaned, we will use it as our training data. It’s ready to be fed into our ML algorithm (Naive Bayes) to build our model. Before we do that, let me spend a little time explaining the Naive Bayes algorithm.
Understanding Naive Bayes
If we pick a review from the Labelled Sentences Dataset at random, the probability of it being positive is P and the probability of it being negative (N) is 1-P. Reviews are made up of words. Using the frequency of a specific word across all the reviews, we can compute a positive score and a negative score for each word. For example, here’s the calculation for P, N, and the positive and negative scores of the word “love.”
P = Number of Positive Reviews / Total Number of Reviews
N = 1 – P
Positive Score(“Love”) = Sum of freq. of “Love” in Positive Reviews / Sum of freq. of “Love” in All Reviews Negative Score(“Love”) = 1-Positive Score(“Love”)
After going through our entire training data, we will have P & N, the probabilities that any review picked at random in the dataset is positive and negative respectively, and Positive Score & Negative Score for every individual word present in our training data. Let’s assume at the end of our training phase, P is 60%, N is 40%, Positive Score(“Love”) is 90%, Positive Score(“Sisense”) is 80%.
Given a new review, the algorithm now determines a positive score and negative score for that review based on the individual words in that review. If the positive score is greater than negative score, it treats the overall review as positive.
To compute the positive and negative score for a comment, our model uses the information obtained in the training phase. For example,
Positive Score(“Love Sisense “) = Positive Score(“Love”) * Positive Score(“Sisense”) * P
Negative Score(“Love Sisense”) =(1-Positive Score(“Love”)) * (1-Positive Score(“Sisense”)) *(1-P)
Positive Score(“Love Sisense”) = 0.9 * 0.8 * 0.6 = 0.43
Negative Score(“Love Sisense”) = 0.1 * 0.2 * 0.4 = 0.008
Hence the review “Love Sisense” is classified as a positive review. The Naive Bayes algorithm assumes that each word contributes independently to the positive or negative score of a review. It does not consider the dependencies between the words. Despite this, Naive Bayes is a powerful algorithm that generates powerful results, especially when we don’t have a large amount of training data or a lot of information about the problem domain.
Building the model
Let’s get back to building our model using the Naive Bayes algorithm. The output of our SQL query is available as a dataframe (df). The first step in building the Naive Bayes model is to represent each review in term frequency representation. Skikit-learn package has a built-in object named CountVectorizer which will represent our reviews as a term frequency matrix.
# SQL output is imported as a dataframe variable called 'df' import pandas as pd import sklearn.feature_extraction.text as skltext import sklearn.naive_bayes as sklnb reviews = df['REVIEW'] sentiments = df['SENTIMENT'] count_vectorizer = skltext.CountVectorizer(binary='true') transformed_reviews = count_vectorizer.fit_transform(reviews) print(transformed_reviews.shape)
Each review has been converted into a tuple of 4812 numbers, which is the number of unique words in the dataset. Out of the 4812 numbers, many of them will be 0 since they will not be present in a single review. If we print any one review, we get only the elements which are 1.
The Skikit-learn package also contains the algorithm for Naive Bayes classifier. We instantiate this classifier (BernoulliNB) and pass the reviews in term frequency representation along with the sentiments to the fit method. This builds a model that is capable of classifying text as positive or negative.
import pandas as pd import sklearn.feature_extraction.text as skltext import sklearn.naive_bayes as sklnb reviews = df['REVIEW'] sentiments = df['SENTIMENT'] count_vectorizer = skltext.CountVectorizer(binary='true') transformed_reviews = count_vectorizer.fit_transform(reviews) classifier = sklnb.BernoulliNB().fit(transformed_reviews, sentiments)
Testing the Model
Now that the model has been built, we are ready to test it. This is done by calling the predict method on the classifier and passing the review to test in term frequency representation. The method returns whether the review is positive or negative.
import pandas as pd import sklearn.feature_extraction.text as skltext import sklearn.naive_bayes as sklnb reviews = df['REVIEW'] sentiments = df['SENTIMENT'] count_vectorizer = skltext.CountVectorizer(binary='true') transformed_reviews = count_vectorizer.fit_transform(reviews) classifier = sklnb.BernoulliNB().fit(transformed_reviews, sentiments) result = classifier.predict(count_vectorizer.transform(['I love Sisense'])) sisense.text('POSITIVE') if result == 1 else sisense.text('NEGATIVE')
Instead of modifying the Python code each time to supply the text for testing, we can set up a filter that the user can enter free-form or load from a data source and pass the text from a filter into our analysis code.
Once the filter is set up, we modify the SQL to pass the input values from a filter into Python code.
select review , sentiment , '[InputText]' as InputText from [govind_amazon_reviews] union select review , sentiment , '[InputText]' as InputText from [govind_yelp_restaurant_reviews] union select review , sentiment , '[InputText]' as InputText from [govin_imdb_movie_reviews] where review is not null and sentiment is not null
Then in the Python code, we replace our test text “I love Sisense” with the filter input received through the data frame as df[‘INPUTTEXT’].
result = classifier.predict(count_vectorizer.transform([df['INPUTTEXT']]))
This allows us to test our sentiment analyzer now by entering text directly from the user interface.
Using a few lines of SQL, we have prepared data to be analyzed; using a few lines of Python, we have trained a model that is capable of analyzing the sentiment of that text. This shows the power of tools in our hands that help us perform data analysis today. Sisense for Cloud Data Teams supports dozens of R and Python libraries made for data analysis and visualization, ready and waiting for your next data project!
Govind Rajagopalan is a senior engineering manager at Sisense.