In Talking Data, we delve into the rapidly evolving worlds of Natural Language Processing and Generation. Text data is proliferating at a staggering rate, and only advanced coding languages like Python and R will be able to pull insights out of these datasets at scale.
Today, text data is everywhere. As humans, we can easily understand this information, but for computers it’s a complicated task. The science of understanding and learning from text data is called natural language processing (NLP). Programmers encounter many common challenges when trying to teach computers to understand natural language text data.
In this post, we’ll discuss these challenges in detail and include some tips and tricks to help you handle text data more easily.
Unstructured data and Big Data
Most common challenges we face in NLP are around unstructured data and Big Data. Data generated from online conversations, comments, tweets, etc. is “big” and highly unstructured. It’s a huge challenge to process that data and get useful information out of it.
Big Data and unstructured data can be converted into useful/meaningful text by the following ways:
Preprocessing of data
Preprocessing of data means removal of unwanted URLs, HTML tags, stop words, numeric and alphanumeric words, punctuation, and special characters. It also involves converting all text to lowercase. For more details, refer to this article.
Data standardization and lemmatization
Data standardization means converting the words into standard form, like expanding contractions into complete words (e.g., “can’t” becomes “can not”). Machines understand full words better.
Lemmatization means converting the word to its meaningful base form. For example, “tried” is turned into “try,” “cries” to “cry,” “cars” to “car,” etc. So, words like “tried,” “tries,” and “try” will all be considered by the system as multiple instances of the same word: “try.” Lemmatization can also be considered a part of normalization.
Tokenization means splitting the text into sentences or splitting the sentences into words. These split units are called tokens.
For example, the phrase “best practices in natural language processing” would be tokenized as: “best,” “practice,” “in,” “natural,” “language,” and “processing.”
In NLP, tokenization is important as the essence of a text can be easily interpreted by analysis of tokens present in it.
By implementing the above methods, we can extract meaningful data from Big Data and output data that is structured in nature.
Semantic meaning of words
Another common challenge is the semantic meaning of words. The vocabulary of any given language is very vast, and many words have similar meanings. So, machines need to find those words.
While training a model for NLP, words not present in the training data commonly appear in the test data. Because of this, predictions made using test data may not be correct. To solve this problem, machines need to capture the semantic meaning of words. Using the semantic meaning of words it already knows as a base, the model can understand the meanings of words it doesn’t know that appear in test data.
For example, the words “tasty” and “delicious” are close in terms of semantic meaning.
Consider an example where the training data contains this sentence:
Pasta is very tasty.
Test data then contains this sentence:
Pasta is delicious.
The word “delicious” is not in the training data, but “tasty” is. As both words are semantically close to each other, machine learning models can easily understand that “delicious” also refers to the pasta tasting good.
Pretrained word embeddings can be used here. Word embedding is a type of word representation that allows words with similar meanings to be understood by machine learning algorithms. For more details, check out this article.
Another major challenge is extracting useful information from data. With the increase in data availability, extracting important information is quite challenging, like searching for a needle in a haystack.
Named entity recognition (NER) is the technique used extensively for extraction of useful information from huge text data. It locates and classifies different entities such as person names, organizations, locations, time expressions, quantities, monetary values, and percentages.
Here’s what sample text data might look like:
“Sisense is a business analytics software company with offices in New York City, San Francisco, Tel Aviv, London, Melbourne, Tokyo, Kiev, and Scottsdale, Arizona. Sisense was founded in 2004 in Tel Aviv by Elad Israeli, Eldad Farkash, Aviad Harell, Guy Boyangu, and Adi Azaria. Amit Bendov was appointed CEO in July 2012. In April 2013, Sisense announced a $10 million series B funding round led by Battery Ventures, with participation from Genesis Partners and Opus Capital.”
And here is the output of NER:
[(‘New York City’, ‘GPE’), (‘San Francisco’, ‘GPE’), (‘Tel Aviv’, ‘GPE’), (‘London’, ‘GPE’), (‘Melbourne’, ‘GPE’), (‘Tokyo’, ‘GPE’), (‘Kiev’, ‘GPE’), (‘Scottsdale’, ‘GPE’), (‘Arizona’, ‘GPE’), (‘2004’, ‘DATE’), (‘Tel Aviv’, ‘GPE’), (‘Israeli’, ‘NORP’), (‘Eldad Farkash’, ‘PERSON’), (‘Aviad Harell’, ‘PERSON’), (‘Guy Boyangu’, ‘PERSON’), (‘Adi Azaria’, ‘PERSON’), (‘Amit Bendov’, ‘PERSON’), (‘July 2012.In April 2013’, ‘DATE’), (‘Sisense’, ‘ORG’), (‘$10 million’, ‘MONEY’), (‘Battery Ventures’, ‘ORG’), (‘Genesis Partners’, ‘ORG’), (‘Opus Capital’, ‘ORG’)]
We can see that the major named entities like “PERSON” meaning people, “ORG” meaning organization, and “GPE” meaning geopolitical entity have all been identified.
Understanding different meanings of the same word
One of the most important and challenging tasks in the entire NLP process is to train a machine to derive the actual meaning of words, especially when the same word can have multiple meanings within a single document.
There are many words that have the same spelling but different meanings. Consider the following two sentences:
‘’A crane flew above the car.”
“I saw a big construction crane.”
The context of the word “crane” used in the above two sentences is quite different. The actual meaning of such words can be extracted by understanding the context of words.
Word embeddings can be used to understand the context of words. Also deep learning models like recurrent neural networks help machines to understand the context of the words used.
Dealing with spelling mistakes
Spelling mistakes are another common challenge in NLP. They can cause problems in understanding the correct meaning of words, which can lead to the system missing important information from the text.
Spelling mistakes can occur for a variety of reasons, from typing errors to extra spaces between letters or missing letters.
Cosine similarity is one of the methods used to find the correct word when a spelling mistake has been detected. Cosine similarity is calculated using the distance between two words by taking a cosine between the common letters of the dictionary word and the misspelled word. This way we can find different combinations of words that are close to the misspelled word by setting a threshold to the cosine similarity and identifying all the words above the set threshold as possible replacement words.
For example, if the misspelled word is “speling,” the system will find the correct word: “spelling.”
Datasets are expanding at breakneck speed; new data is being generated every second, and old information is updated in real time. It’s difficult to retrain models frequently from scratch for new data. So here, transfer learning comes to the rescue.
Transfer learning is a technique in which a pretrained model that has already been trained on different but somehow similar problems is used. The benefits of transfer learning are:
- Helps solve real-world complex problems
- Saves time, effort, and machine memory
- Handles Big Data more easily
There are many transfer learning models available, including: Embedding from Language Models (ELMo), Transformer, and Bidirectional Encoder Representations from Transformers (BERT). BERT is the latest pretrained model that can be used for various NLP tasks.
Building a better future with NLP
Data is the new oil. It creates new prospects and challenges every day. Established and emerging companies alike are putting their efforts into creating platforms and apps that understand natural language the way humans do. In the future, we’ll simply talk to all of our devices to get them to do what we want, and techniques like these are part of the foundation of that future.
Scott Castle is the VP & GM for Cloud Data Teams at Sisense. He brings over 25 years of experience in software development and product management at leading technology companies including Adobe, Electric Cloud, and FileNet. Scott is a prolific writer and speaker on all things data, appearing at events like the Gartner Enterprise Data Conference, Data Champions, and Strata Data NYC.