In Talking Data, we delve into the rapidly evolving worlds of Natural Language Processing and Generation. Text data is proliferating at a staggering rate, and only advanced coding languages like Python and R will be able to pull insights out of these datasets at scale.

One of the most-asked questions from aspiring data scientists is: “What is the best language for data science? R or Python?”

People looking into data science languages are usually confused about which language they should learn first: R or Python. Both are extremely useful for an array of data science applications, including Natural Language Processing (NLP). To understand the strengths and weaknesses of each, let’s explore R vs Python for data science by analyzing which language works best with NLP.

Natural language processing: Teaching computers human words

Natural language processing means, as the name implies, teaching computers to process natural human languages (English, Hindi, etc.) and perform analyses. NLP can be used on written text or speech data.

SQL, Python, and R

For our example, we will use written text for our comparison of R vs Python for data science. We are surrounded by written text every day: emails, SMS messages, webpages, books, and much more. Text data plays a vital role in our day-to-day life, which makes NLP a very important area for data scientists to explore.

R vs Python for data science: Digging into the differences

Python and R are two of the top data science languages. Both are open-source and have large user bases. In the real world, it’s often difficult to choose between R and Python for data science and NLP. Here, we’re going to run through some of the must-know info about each of these versatile languages.

R: Analytics powerhouse 

R is a tool built by statisticians mainly for mathematics, statistics, research, and data analysis. It’s quite popular for its visualizations: charts, graphs, pictures, and various plots. These visualizations are useful for helping people visualize and understand trends, outliers, and patterns in data.

Python: Versatile workhorse

Python is a general-purpose, robust, versatile language with readable syntax. Python’s readable syntax makes it easy to learn and understand, since it can be read much like a human language. Python also integrates well in a variety of different project environments.

Libraries for NLP

Libraries are collections of modules and functions that programmers can include in programs and projects to accomplish specific tasks. Programmers choose different libraries because they help do a particular task more efficiently. For example, the wordcloud library is used to create a word cloud displaying the most frequently used words from a text dataset. We’ll actually do this later in this article.

R libraries

R boasts more than 10,000 libraries, such as Caret, Dplyr, tidyr, caTools, ggplot2, and many others. These support a wide array of uses, such as data analysis, manipulation, visualizations, and machine learning (ML) modeling. Some of the libraries used for NLP are: tm, tidytext, text2vec, and wordcloud. Again, the library you use will be based on your use case. Check out this page from the R Project for a detailed look at the other libraries used for NLP.

Python libraries

Python has 200+ standard libraries and nearly infinite third-party libraries. Some standard Python libraries are Pandas, Numpy, Scikit-Learn, SciPy, and Matplotlib. These libraries are used for data collection, analysis, data mining, visualizations, and ML modeling. Libraries used for NLP are: NLTK, gensim, SpaCy, glove, and Scikit-Learn. Every library has its own purpose and benefits. For instance, NLTK is excellent for learning and exploring NLP concepts, but it is not meant for production. SpaCy, meanwhile, is a new NLP library that’s designed to be fast and production-ready.

Data exploration in R and Python

Data exploration is the initial step in data analysis, yielding visualizations like charts or plots that show human users patterns and trends. For our NLP demo, let’s take a dataset of commonly used words from Kaggle and do some data exploration on it in both data science languages.

Loading data in both R and Python

First, let’s load training data in both Python and R and check how much time it takes each language.

R code:

#Load data
start_time <- Sys.time()
train<- read.csv("train.csv")
timediff<-difftime(Sys.time(),start_time)
cat("Time taken to load csv is: ",timediff, units(timediff))

Output:

The time taken to load the csv is 3.038562 minutes.

Python code:

# Load Data
start_time = time.time()
tr_data=pd.read_csv("train.csv")
time_diff=time.time()-start_time
print("Time taken to load csv is: {} seconds ".format(time_diff))

Output:  

The time taken to load the CSV is 17.733536 seconds.

It takes significantly less time for Python to load the CSV than for R to load the same dataset.

SQL, Python, and R

Create a word cloud to find the most-repeated words

Word clouds are a representation of the words in a dataset as a cluster of words: The more frequently a word appears in the text data, the bigger and bolder it appears in the word cloud.

Let’s create a word cloud of the 50 most frequent words in our training dataset, using both the Python and R programming languages.

R code:

library(wordcloud)
start_time <- Sys.time()
wordcloud(d_toxic$word,d_toxic$freq,min.freq=100,max.words = 50, colors=brewer.pal(8, "Dark2"))
timediff<-difftime(Sys.time(),start_time)
cat("Time take to create wordcloud is ",timediff, units(timediff))

Output:

The time taken to create the word cloud is 0.2203801 seconds.

Word cloud of top 50 words in R

Python code:

start_time = time.time()
wordcloud = WordCloud(max_words=50, background_color="white", collocations=False).generate(text)
time_diff = time.time() - start_time
print("Time taken to create wordcloud is {} seconds".format(time_diff) )
plt.imshow(wordcloud, interpolation='bilinear') # display the generated image
plt.title(f"Most popular 50 words")
plt.show()

Output:

The time taken to create the word cloud was 129.392610 seconds.

Word cloud of top 50 words in Python

Let’s compare the time taken by both data science languages for data loading and word cloud creation.


R (time taken)Python (time taken)
Data loading3.038562 minutes17.7335 seconds
Word cloud0.2203801 seconds129.3926 seconds ~ 2 minutes

We can see that in both programming languages, the most frequent words are similar, but the visualization is more beautiful in the R programming language. Additionally, the time taken in word cloud creation is lower in R compared to Python. However, the time taken to load the data in Python is significantly lower compared with R.

Modeling in R and Python

When we say “modeling” in data science, we mean teaching a program to learn from training data using machine learning algorithms. In modeling, we find relationships between input variables and the target variable. There are many machine learning algorithms like KNN, Naive Bayes, SVM, Logistic Regression, Decision Trees, Random Forest, XG-Boost, etc. to find relationships between input and target variables. 

For example, a customer has written a review for a product (let’s say pasta), so that review (in text) is our input variable, and classifying that text as a positive review or negative review is our target variable. 

Review 1: The pasta is very tasty.

Review 2: The pasta is cheap and not tasty.

So, Review 1 is a positive review and Review 2 is a negative review. ML algorithms learn from the input and output variables of past data and predict the target variable for new data.

Here, we will implement the XG-Boost algorithm, an algorithm that learns on the basis of training data (which we loaded earlier in both R and Python programming languages) with the help of probability and statistics. We’ll create a model that helps predict future trends; the system keeps on learning and makes decisions on the basis of what it has learned.

R code:

ctrain <- xgb.DMatrix(Matrix(data.matrix(X_train)), label = (y_train))
xb <- xgboost(ctrain, max_depth = 100, 
              nround=100, 
              eval_metric = "auc",
              objective = "binary:logistic")  
y_pred=predict(xb, y_val)
val-auc=auc(y_pred,y_val)

Output: 
val-auc: 0.80443

Python code:

clf = xgb.XGBClassifier(max_depth=100,min_samples_split=100);
clf.fit(X_train, y_train)
train_fpr, train_tpr, thresholds = roc_curve(y_train, clf.predict_proba(X_train)[:,1])
train-auc= auc(train_fpr, train_tpr)
val_fpr, val_tpr, thresholds = roc_curve(y_val, clf.predict_proba(X_val)[:,1])
val-auc= auc(val_fpr, val_tpr)

Output: 
val-auc: 0.83768
XG-Boost performs similarly in both programming languages.

Comparison of Python and R for NLP

Using XG-Boost to model the text data resulted in an almost identical score for Python and R. There are many performance metrics to evaluate performance of Machine Learning models. Here we have used the AUC score; AUC means “area under the curve.” This metric can be used in classification analyses to identify a model’s ability to predict a desired attribute, based on the training data. The slightly higher AUC we get via Python is how we know it performs slightly better than R.

Nowadays text data is huge, so Deep Learning also comes into the picture. Deep learning works well with Big Data sets, and it is based on the concept of our brain cells (neurons), which is the root of the term “Artificial Neural Networks.” As the amount of data in a dataset increases, many ML algorithms’ performance becomes stagnant, but Deep Learning performance improves with increases in data volume.

We can also implement deep learning models like Bidirectional LSTM to improve our performance further. Deep Learning models use Keras and Tensorflow API, which are built in Python. However, an R interface for Keras is now available for programming in R. The Keras R package allows us to enjoy the benefit of R programming while having access to the capabilities of the Python Keras package — a powerful combination.

Conclusion

So, after all that: What’s the best language for data science and NLP: Python or R? The answer is, it depends on your personal preferences and what you’re trying to do in your analysis!

All common and necessary data science tasks (data loading, data analysis, data exploration, data preprocessing, data featurization, data modeling, and predictive modeling) are available in both R programming and Python languages. Both languages are user-friendly and easy to implement. There are also a wide array of libraries available for both languages for text processing, text analysis, and text modeling. All ML and deep learning models can be easily implemented in both languages. Deep learning models in R can actually even run the capabilities of the Python APIs in the back end, meaning there’s an added benefit to being conversant in both languages.

A dedicated data expert never stops developing their skills. While you can certainly start with either language, learning a little bit of the other and building more knowledge over time will definitely help make you more capable in the long run.

packages-CTA-banners_Cloud-Data-Teams

Nidhi Bansal is Data Scientist, Machine Learning/Artificial Intelligence enthusiast, and writer who loves to experiment with data and write about it. She has over a decade of experience in software development in various programming languages and holds a B.Tech and M.E in Electronics and Communications Engineering.

Tags: | | | |