Natural Language Processing with R

The field of natural language processing (NLP), which studies how computer science and human communication interact, is rapidly growing. By enabling robots to comprehend, interpret, and produce natural language, NLP opens up a world of research and application possibilities. The first section of this article will look at the various languages that can be used for NLP, and the second section will focus on five NLP packages available in the R language. We’d also do a little NLP project in R with the “sentimentr” package.

Natural Language Processing (NLP) plays a crucial role in advancing research in various fields, such as computational linguistics, computer science, and artificial intelligence. The ability to analyze and understand human language, in context, is becoming increasingly important in many areas of research, such as natural language understanding, text mining, and sentiment analysis.

In this article, we’ll look at a few of the languages used for NLP tasks and dive into a Twitter NLP task with R.

Languages

With NLP techniques, researchers can extract valuable insights from unstructured data such as social media posts, customer reviews and scientific articles, this allows researchers to gain a deeper understanding of a wide range of phenomena, from social dynamics and consumer behavior to medical diagnostics and drug discovery. In short, NLP is an essential tool for researchers as it enables them to gain new insights and knowledge, leading to advances in many fields.

Several programming languages have the ability to allow for NLP tasks, the programming language of choice can be based on various reasons.
Some of the reasons that can affect your choice of programming language for your NLP project include:

– Availability of versatile libraries
– Execution and runtime ability of the language
– Your project goals and deliverables
– Cross-language ability

The mainstream languages that have NLP libraries and allow for exploratory model selection and model development include:

Python

Python’s versatility has led to its reputation as the go-to language for machine learning programming. Because of its consistent syntax and human-like language, it is also one of the languages that are easiest for beginners to learn. Python also includes a large number of packages that allow for code reuse. It is a fantastic option for natural language processing because its semantics and syntax are transparent.

Python packages such as Scikit-learn assist fundamental machine learning algorithms such as classification and regression, whereas Keras, Caffe, and TensorFlow enable deep learning. Python is a popular natural language processing programming language due to its simple structure and text-processing libraries such as NTLK and SpaCy.

Statisticians developed R as a tool for statistical computing. R is frequently used for statistical software development, data analysis, and data visualisation because it can handle large data sets with ease. This programming language offers a variety of methods for model training and evaluation, making it perfect for machine learning projects that need a lot of data processing. You can read more about the creation of the R language here.

Many R libraries can be used for NLP, including randomForest for building decision trees and CARAT for classification and regression training. The most common NLP techniques, such as tokenizing, stemming, and creating ngrams, are used by Quanteda to make it easy and rapid to alter the texts in a corpus. Because of its interactive character, R is an excellent tool for quick prototyping and problem resolution. R is often used for exploratory model building and selection rather than model deployment. You can read more about the packages available in the R project here.

Java

Java is a popular programming language with a large number of open-source libraries. Java is user-friendly and provides an autonomous platform, making it ideal for developing AI.

A powerful open-source Java NLP framework called Apache OpenNLP serves as a learning-based toolkit for natural language text processing. Supported tools include a Name finder, Tokenizer, Document categorization, POS tagger, Parser, Chunker, and Sentence detector.

Other languages that can also be used for NLP are:

C++: This language, which is an extension of the C programming language, can be used to build neural networks. C++’s main advantage is its speed, which allows it to do complex computations more quickly, which is vital for AI development.
Prolog: An abbreviation for LOGICAL PROGRAMMING. It is a computer language that is both logical and declarative. Prolog enables users to create shorter, clearer programmes even when dealing with challenging AI problems. Prolog is a great choice for artificial intelligence programming because many AI problems are inherently recursive.

What tips do big name companies have for students and start ups? We asked them! Read or watch our industry Q&A for advice from teams at Stanford, Google, and HuggingFace.

NLP with R in action

Now let’s dive into the main part of our learning. R is a popular and effective programming language for natural language processing (NLP). The key advantage of adopting R for NLP is its ability to store enormous amounts of text data and perform hard text analysis tasks with relative ease. The “tm” package for text mining and the “openNLP” package for natural language processing are only two of the many libraries and packages available in R for NLP.

The “tm” package:
This package provides a comprehensive framework for text mining and text analysis in R. It includes text filtering, stemming, and tokenization functions, among others. Text pre-processing and cleaning, a crucial step in text mining and NLP projects, is one of the best uses for the “tm” package. The package includes features like stopword removal, stemming, and punctuation removal that can help prepare text data for additional analysis.

#To install it, simply type into the R terminal.
install.packages("tm")

#Use of this library
library(tm)
data <- "I travelled yesterday to the great Benin city. The journey was a bit tiring has my flight got delayed for about 4 hours,
and I had to stay in traffic for an hour plus to get to my hotel.
The hotel I am stay at is quite nice, the ambiance of the place is nice."

#Tokenization
tokens <- wordpunct_tokenizer(tdata)
#The line above uses the 'tm' package's word_tokenizer() function to tokenize the text data into individual words.

#DocumentTermMatrix
dtm <- DocumentTermMatrix(Corpus(VectorSource(tokens)))
inspect(dtm)
#The tm package's DocumentTermMatrix() function generates a Document-Term Matrix (DTM) that represents the frequency of terms in the documents.

#This gives you a matrix with the rows as documents and columns as terms and the frequency of that term in that document.

2. The “openNLP” package:
This package provides an interface to the Apache OpenNLP library, which is a natural language processing machine learning toolkit. It includes tokenization, part-of-speech tagging, and named entity recognition functions. Tokenization and sentence segmentation are two of the “openNLP” package’s best applications. Tokenizing text into words or sentences, a necessary step in many NLP tasks like text classification, sentiment analysis, and text generation, is provided by the package.

#To install it, simply type into the R terminal.
install.packages("openNLP")

#To use the library
library(openNLP)

# You migth get the an error that "JAVA_HOME cannot be determined from the Registry"
# The error occured because you are using a 64-bit version of R but not a 64-bit version of Java. 
# It's possible you installed a 32-bit version of Java or did not instal any Java at all.
# Download JAVA 64-bits and reinstall rJAVA package

library(openNLP)

# Download the en-token.bin model file 
download.file("http://opennlp.sourceforge.net/models-1.5/en-token.bin", destfile = "en-token.bin")

# Define the text string to be tokenized
data <- "I travelled yesterday to the great Benin city. The journey was a bit tiring has my flight got delayed for about 4 hours,
and I had to stay in traffic for an hour plus to get to my hotel.
The hotel I am stay at is quite nice, the ambiance of the place is nice."

# Tokenize the text string using the opennlp command-line tool
tokens <- system(paste("echo", shQuote(data), "| opennlp TokenizerME en-token.bin"), intern = TRUE)
# This code uses the system() function to execute the opennlp TokenizerME command, passing in the path to the en-token.bin model file and the text data to be tokenized.

# Print the tokens
print(tokens)

3. The “sentimentr” library:
The library enables quick and simple sentiment analysis. Functions for sentiment scoring, classification, and visualization are also included. The Afinn Lexicon, a set of terms and their corresponding sentiment scores, is used by the sentimentr package to do sentiment analysis on English text data. The sentimentr package offers a number of functions for text sentiment analysis. Sentiment(), which is used to categorize the sentiment of a given text, is the most significant function.

# To instal it, simply run the command 
install.packages ("sentimentr")

# Load the sentimentr package
library(sentimentr)

# Define the text string to be analyzed
text_data <- "The ambiance of the hotel is nice. I love staying at the hotel"

# Perform sentiment analysis on the text string
sentiment_result <- sentiment(text_data)

# Print the sentiment result
print(sentiment_result)

The sentiment() function returns a class sentiment by object containing elements such as element id, sentence id, word count, and sentiment.
Each element in the text has its own identity, which is referred to as an element id. The sentence id is the sentence number of the element in the text, and the word count is the element’s word count.

The element’s emotion is represented by a numeric value between -1 and 1. Positive values represent positive emotions, negative values represent negative emotions, and values close to zero represent neutral emotions.

4. The “wordcloud” package:
The R “wordcloud” package makes it easy to create word clouds, which are visual representations of the words that appear most frequently in a corpus of text. A word cloud is a graphic representation of text data where each word’s size reflects how frequently it appears in the text.

The most important function in the “wordcloud” package is wordcloud(), which produces a word cloud from a supplied text corpus. The function takes several inputs, including the text data, the amount of words that can be included in the word cloud, and the size and shape of the word cloud.

# Install the wordcloud package if it is not already installed
install.packages("wordcloud")

# Load the wordcloud package
library(wordcloud)

# Define the text string to be used for the word cloud
text_data <- "This is a very nice hotel, I love it so much! The hotel is so good, I highly recommend it to everyone."

# Create the word cloud
wordcloud(text_data)

The word cloud will include all of the terms in the text by default, and the size of each word will be proportional to its frequency in the text. In a new window, the wordcloud will be plotted.

5. The “quanteda” package:
Quanteda is an R package for quantitative text analysis. It provides a flexible and effective framework for working with text data in R. Tokenization, stemming, part-of-speech tagging, n-grams, and text statistics are just a few of the text analysis tools available. It also provides a simple interface for creating and editing text corpora, or groupings of text documents.

Text pre-processing and cleaning is one of the “quanteda” package’s best applications. Stopword removal, stemming, and punctuation removal are functions offered by the package that can assist in preparing text data for additional analysis. Additionally, it has an integrated feature that allows for the reading and writing of data in a variety of formats, including plain text, pdf, and Microsoft Word, which is helpful for reading and writing data from different sources.

# Install the quanteda package if it is not already installed
install.packages("quanteda")

# Load the quanteda package
library(quanteda)

# Define the text data to be used for the corpus
text_data <- c("This is a very nice hotel, I love it so much!",
               "The hotel is so good, I highly recommend it to everyone.")

# Create the corpus
corpus <- corpus(text_data)

# Perform some basic text analysis
tokens <- tokens(corpus)
dfm <- dfm(tokens)

# Print the tokens
print(tokens)

# Print the Document-Feature Matrix
print(dfm)

We will be doing a simple NLP project in R that uses the twitter package to extract tweets from Twitter and the sentimentr package to classify the sentiment of each tweet.

The first step is to get your Twitter credentials. These credentials are used to authenticate your application with the Twitter API and allow you to access the Twitter data.

Here are the steps to get these credentials:

Go to the Twitter Developer website (https://developer.twitter.com/) and sign in with your Twitter account.
Click on the “Create an app” button.
Fill in the required information for your application, including the name, website, and a brief description.
Once you have created your app, click on the “Keys and Tokens” tab.
Click on the “Generate” button to generate an API key and API secret for your app.
Click on the “Generate” button under “Access Token & Access Token Secret” to generate an access token and an access token secret for your app.
Save these credentials as they will be used in the setup_twitter_oauth()function.

# Install the twitteR and sentimentr packages if they are not already installed
install.packages(c("twitteR", "sentimentr"))

# Load the twitteR and sentimentr packages
library(twitteR)
library(sentimentr)

# Authenticate with Twitter using your Twitter API credentials
setup_twitter_oauth("API_key", "API_secret", "access_token", "access_token_secret")

# Define the search term and number of tweets to retrieve
search_term <- "#2023election"
num_tweets <- 1000

# Search for tweets containing the search term
tweets <- searchTwitter(search_term, n = num_tweets)

# Extract the text from the tweets
tweet_text <- sapply(tweets, function(x) x$getText())

# Classify the sentiment of each tweet
sentiment_result <- sentiment(tweet_text)

# Create a data frame of the tweets and their sentiment
tweet_sentiment <- data.frame(text = tweet_text, sentiment = sentiment_result$type)

# Print the first few rows of the data frame
head(tweet_sentiment)

Conclusion

The field of natural language processing (NLP) is becoming increasingly important in a variety of industries. As was already mentioned, R is a powerful language that meets the majority of NLP analysis requirements, particularly when used with the well-liked “tm” and “quanteda” packages. These tools enable text mining, sentiment analysis, and text classification.

By utilizing these tools and taking an organized approach, it is possible to develop a successful NLP project using R, as shown in the simple project. R offers a user-friendly and effective platform for NLP projects, making it a crucial tool for data scientists and researchers who study natural language processing.

Here is a list of articles that I found helpful and inspired me in writing this:

Ambika Choudhury, “Top 10 R Packages For Natural Language Processing (NLP)”, DEVELOPERS CORNER
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. “quanteda: An R package for the quantitative analysis of textual data”, Journal of Open Source Software. 3(30)
Covington Michael, Barker Ken & Szpakowicz Stan.“Natural Language Processing for Prolog Programmers”, ResearchGate
Fell stats, “wordcloud makes words less cloudy”, Fellow Statistics
Finnstats, “error: JAVA_HOME cannot be determined from the Registry”, R bloggers.
Ingo Feinerer, “tm (version 0.7–10)”, RDocumentation
Rinker, T. W., sentimentr: Calculate Text Polarity Sentiment version 2.2.3
Turing, “Which Language Is Useful for NLP and Why?”, Turing

Thanks for taking the time to read my blog ❤️. You can reach out to me on LinkedIn.