In Talking Data, we delve into the rapidly evolving worlds of Natural Language Processing and Generation. Text data is proliferating at a staggering rate, and only advanced coding languages like Python and R will be able to pull insights out of these datasets at scale.

As the title suggests, in this article we’ll explore best practices in natural language processing (NLP). To do this, we’ll use machine learning (ML) algorithms, which are based on concepts of linear algebra and statistics. These best practices will help us address a common question: “How do we represent text for machine learning systems?”

packages-CTA-banners_Cloud-Data-Teams

Text preprocessing

Understanding the real meaning of words by analyzing the context of the surrounding text is called semantic analysis. One of the first things you have to do for semantic analysis for an NLP project is text preprocessing. This is a vital practice in NLP and makes data more understandable for the algorithms. To understand text preprocessing, let’s use a common natural language processing task, sentiment analysis, as an example.

Sentiment analysis has many applications, and it’s something we do as humans without really thinking. For instance: When customers write product reviews, a human reader can easily identify whether a given review is positive or negative. For computers, this process is much more complicated, so preprocessing steps are essential to get clean data.

Let take a sample review posted for a product from https://www.fakereviewsite.com: 

https://www.fakereviewsite.com Can’t do sugar.  Have tried scores of SF Syrups.  NONE of them can touch the excellence of this product.<br /><br />Thick, delicious.  Perfect.  3 ingredients: Water, Maltitol, Natural Maple Flavor.  PERIOD.  No chemicals.  No garbage.<br /><br/> Unbelievably delicious…<br /><br /> Can you tell I like it? 🙂

Preprocessing works like this:

1. Begin by removing the URLs and HTML tags

URLs like https://www.fakereviewsite.com in above example and HTML tags like <br /> <br /> don’t add any value to the review text. Think of this as doing “noise removal” for a piece of recorded audio.

Sample code:

text_review = re.sub(r"http\S+", "", text_review)
text = BeautifulSoup(text_review, 'lxml').get_text()

Output:

Can’t do sugar.  Have tried scores of SF Syrups.  NONE of them can touch the excellence of this product.Thick, delicious.  Perfect.  3 ingredients: Water, Maltitol, Natural Maple Flavor.  PERIOD.  No chemicals.  No garbage. Unbelievably delicious… Can you tell I like it? 🙂

2. Expand the contractions (text standardization)

Contractions, the shortened/combined forms of commonly used words, are usually easier for humans to understand. We often think of them as making text and speech sound more “natural.” But for machines, using the full words allows them to handle sentiment analysis more easily. That’s why the next step involves expanding these contractions into complete words (e.g., “can’t” becomes “can not”). This is also called “text standardization.”

Sample code:

text = re.sub(r"Can\'t", "can not", text)

Output:

can not do sugar.  Have tried scores of SF Syrups.  NONE of them can touch the excellence of this product.Thick, delicious.  Perfect.  3 ingredients: Water, Maltitol, Natural Maple Flavor.  PERIOD.  No chemicals.  No garbage. Unbelievably delicious… Can you tell I like it? 🙂

3. Remove numeric and alphanumeric words

Remove numbers and any word containing numbers, as in sentiment analysis numbers do not add any value.

Sample code:

text = re.sub("\S*\d\S*", "", text).strip()

Output:

can not do sugar.  Have tried scores of SF Syrups.  NONE of them can touch the excellence of this product.Thick, delicious.  Perfect.   ingredients: Water, Maltitol, Natural Maple Flavor.  PERIOD.  No chemicals.  No garbage. Unbelievably delicious… Can you tell I like it? 🙂

4. Remove punctuation and special characters

Remove all the punctuation and characters: pound signs (#), commas, periods, parentheses, etc. The text analysis we are about to perform will derive no value from these characters.

Sample code:

text = re.sub('[^A-Za-z0-9]+', ' ', text)

Output:

can not do sugar Have tried scores of SF Syrups NONE of them can touch the excellence of this product Thick delicious Perfect ingredients Water Maltitol Natural Maple Flavor PERIOD No chemicals No garbage Unbelievably delicious Can you tell I like it

5. Convert the text to lowercase

Next, we continue preprocessing the text by converting it all to lowercase. This is because the ML system interprets words with different cases as being different words. For example, “delicious” and “Delicious” appearing in the same text will be counted as two different words, as opposed to two instances of the same word!

6. Remove stop words

“Stop words” occur in every language and include common words like “we,” “they,” “can,” “not,” etc. These stop words do not add any value in NLP analysis, so we remove them. The Natural Language Toolkit (NLTK) list of English stop words can be found here

However, be careful with stop words like “no” and “not,” as these words change the meaning of a sentence. For example, “tasty” and “not tasty” lead to two different sentiments.

So, we generally do not remove these specific stop words.

Sample code for text to lowercase and remove stop words:

text = ' '.join(e.lower() for e in text.split() if e.lower() not in stopwords)

Output:

not sugar tried scores sf syrups none touch excellence product thick delicious perfect ingredients water maltitol natural maple flavor period no chemicals no garbage unbelievably delicious tell like

7. Lemmatization

Lemmatization means converting the word to its meaningful base form. For example, “tried” is turned into “try,” “cries” to “cry,” “cars” to “car,” etc. So, words like “tried,” “tries,” and “try” will all be considered by the system as multiple instances of the same word: “try.” Lemmatization can also be considered a part of normalization.

Sample code:

lemmatizer = WordNetLemmatizer() 
tokenization = nltk.word_tokenize(text)
output = ' '.join([lemmatizer.lemmatize(w,pos="v") for w in tokenization])

Output:

not sugar try score sf syrup none touch excellence product thick delicious perfect ingredients water maltitol natural maple flavor period no chemical no garbage unbelievably delicious tell like

That does it for text preprocessing. Let’s move to the next best practice of NLP: data tokenization.

SQL, Python, and R

Data tokenization

Tokenization is one of the most common best practices when working with text data. Tokenization means splitting the text into sentences or splitting the sentences into words. These split units are called tokens.

For example, the phrase “best practices in natural language processing” would be tokenized as: “best,” “practice,” “in,” “natural,” “language,” and “processing.”

In NLP, tokenization is important as the essence of a text can be easily interpreted by analysis of tokens present in it.

Sample code:

tokenization = nltk.word_tokenize(output)

Output:

[‘not’, ‘sugar’, ‘try’, ‘score’, ‘sf’, ‘syrups’, ‘none’, ‘touch’, ‘excellence’, ‘product’, ‘thick’, ‘delicious’, ‘perfect’, ‘ingredients’, ‘water’, ‘maltitol’, ‘natural’, ‘maple’, ‘flavor’, ‘period’, ‘no’, ‘chemicals’, ‘no’, ‘garbage’, ‘unbelievably’, ‘delicious’, ‘tell’, ‘like’]

Word embedding

Word embedding is an engineering technique in NLP where words or tokens are mapped to vectors of real numbers. In word embedding, each word converts into vectors with some dimensions, say “d.”

This process involves a learned representation for text where words or tokens that have the same meaning have a similar representation and can easily be understood by machine learning algorithms

Word embedding extracts meaning from text data as it also takes care of semantic relationships between genders, verb tense, and countries-capitals. 

There are various word embedding models available, such as word2vec by Google, Glove by Stanford, and fastText by Facebook.

For an example, let’s use Glove’s word embedding of 300 dimensions for the word “delicious.”

Sample code:

Embeddings_index[“delicious”]

Output:

array([
-0.27801  , -0.14519  ,  0.49453  ,  0.12529  , -0.057677 ,
0.70151  ,  0.28826  , -0.20441  ,  0.03009  ,  1.3899   ,
-0.26564  ,  0.43441  , -0.47501  , -0.13348  ,  0.24737  ,
-0.45528  , -0.67027  ,  1.1701   ,  0.040979 , -0.12553  ,
-0.075785 , -0.27344  , -0.049158 , -0.42694  , -0.041506 ,
-0.11606  ,  0.3838   , -0.1245   ,  0.018793 , -0.78534  ,
0.055234 ,  0.007924 ,  0.055043 , -0.20598  , -0.06414  ,
-0.10594  , -0.06305  , -0.027444 , -0.26364  ,  0.64279  ,
-0.29369  ,  0.11132  ,  0.28754  , -0.27219  ,  0.48337  ,
0.93093  , -0.23844  ,  0.61936  ,  0.12593  , -0.24751  ,
-0.08677  ,  0.19172  , -0.36446  , -0.071028 ,  0.64807  ,
0.12868  , -0.046247 ,  0.42061  , -0.12793  ,  0.19642  ,         
0.68146  , -0.55865  , -0.27874  ,  0.039101 , -0.17919  ,
-0.59897  ,  0.20486  ,  0.15241  ,  0.34993  ,  0.47898  ,
0.36544  ,  0.57892  ,  0.24779  , -0.35317  ,  0.2616   ,
-0.22896  , -0.22391  ,  0.16569  , -0.61168  , -0.18378  ,
-0.023205 , -0.18056  ,  0.054312 , -0.1776   , -0.098411 ,
-0.6113   ,  0.38856  ,  0.88379  , -0.29055  , -0.12958  ,
0.015754 ,  0.23812  ,  0.10429  ,  0.41016  ,  0.23708  ,
1.0123   , -0.86614  , -0.16838  ,  0.066406 , -0.0050272,
-0.22711  , -0.28863  ,  0.36877  , -0.25895  , -0.22054  ,
-0.31888  ,  0.58853  ,  0.21332  ,  0.55837  , -0.23193  ,
-0.21208  , -0.56664  , -0.66216  ,  0.22095  ,  0.12373  ,
-0.48547  , -0.44839  ,  0.0091947, -0.27908  ,  0.014443 ,
0.21652  ,  0.18283  , -0.35423  , -0.4034   ,  0.27554  ,
-0.52523  ,  0.436    , -0.60157  ,  0.18374  ,  0.11548  ,
-0.17291  , -0.89038  ,  0.38101  ,  0.32373  , -0.25688  ,
0.19965  , -0.11587  ,  0.025945 , -0.041164 , -0.31512  ,
-2.5693   ,  0.35069  ,  0.57936  ,  0.24183  , -0.025946 ,
-1.205    , -0.061434 ,  0.057562 ,  0.55008  , -0.016724 ,
0.32504  ,  0.11345  ,  0.40961  ,  0.16263  ,  0.19778  ,
-0.017293 , -0.12313  , -0.1714   ,  0.80623  , -0.039624 ,
-0.39989  ,  0.52971  ,  0.42122  ,  0.077219 , -0.005385 ,
-0.63461  ,  0.25213  , -0.37005  ,  0.20136  , -0.052806 ,
0.59213  , -0.12338  ,  0.22055  , -0.195    , -0.66998  ,
-0.18867  ,  0.022199 , -0.82456  , -0.080926 , -0.41921  ,
0.034355 , -0.69545  ,  0.24931  , -0.10916  ,  0.12605  ,
-0.75361  , -0.033696 , -0.2305   , -0.0046053, -0.31902  ,
-0.31114  , -0.036903 , -0.022895 , -0.13569  , -0.27607  ,
-0.1031   ,  0.30701  ,  0.34186  ,  0.45891  , -0.13587  ,
-0.45223  , -0.090191 , -0.099395 , -0.074891 ,  0.19245  ,
0.29045  , -0.39008  ,  0.73567  ,  0.62942  , -0.2174   ,
-0.51588  , -0.19546  ,  0.081308 , -0.030894 , -0.34068  ,
0.28449  ,  0.40505  , -0.33851  ,  0.24456  ,  0.093235 ,
-0.53475  ,  0.34995  ,  0.1656   ,  0.80673  ,  0.48231  ,
-0.39488  , -0.20581  ,  0.063178 ,  0.065316 ,  0.17051  ,
-0.16726  , -0.28956  , -0.50795  , -0.39699  ,  0.19386  ,      
-0.16445  , -0.4318   , -0.47626  , -0.20233  ,  0.11089  ,
0.13755  ,  0.026714 , -0.93893  , -0.20077  ,  0.20623  ,
0.7216   , -0.58006  , -0.38965  , -0.2282   ,  0.17188  ,
-0.3815   ,  0.04917  , -0.32791  ,  0.10739  ,  0.023031 ,
0.67157  ,  0.32911  ,  0.28143  ,  0.036222 ,  0.1453   ,
-0.12512  , -0.27149  ,  0.04054  , -0.020042 , -0.056311 ,
0.34275  , -0.62091  , -0.0058532, -0.49363  ,  0.072698 ,
-0.81502  , -0.026662 , -0.23517  , -0.34235  , -0.54425  ,
0.45515  ,  0.085665 ,  0.070533 ,  0.36966  ,  0.95099  ,
0.47395  , -0.1195   ,  0.12501  , -0.50397  ,  0.10813  ,
0.53519  ,  0.58557  ,  0.56703  , -0.17101  ,  0.48838  ,
0.46119  ,  0.4737   , -0.14692  ,  0.0055303, -0.37672  ,
-0.090149 ,  0.52314  , -0.97767  ,  0.18443  ,  0.25023  ],
dtype=float32)

The vectors for tokens from word embedding are used for the next ML step (not in this exercise), modeling. Modeling is where we teach a program to learn from training data using ML algorithms.

Proper preparation, accurate execution

Now we’ve got a properly prepared text dataset: Unnecessary filler words have been removed, all text is lowercase and in its simplest form, and words have been assigned values that our ML algorithms can work with. These best practices will help you get the most from your text analysis, pulling sentiments out of human words without having to read each one yourself. And that’s just the start of what ML can do for you. Happy coding!

packages-CTA-banners_Cloud-Data-Teams

Nidhi Bansal is Data Scientist, Machine Learning/Artificial Intelligence enthusiast, and writer who loves to experiment with data and write about it. She has over a decade of experience in software development in various programming languages and holds a B.Tech and M.E in Electronics and Communications Engineering.

Tags: | | | | |