skip to Main Content
Join Us for Comet's Annual Convergence Conference on May 8-9:

How To Perfectly Clean Your Text Data For NLP

Photo by Afif Kusuma on Unsplash

Natural language refers to the medium we humans use to communicate with each other, and processing simply means the conversion of data into a readable form. In short, natural language processing is a way to provide computers with the ability to understand and communicate in human language.

NLP is a branch of AI that uses text data as input and return models that can understand and generate insights from new text data. One of the most important steps of creating these models is converting raw text data into a much better-cleaned version that contains only useful information. In this blog, we will look at some techniques to perfectly clean text data for natural language processing.

It is important to apply each step in the same serial manner as mentioned below, otherwise, you could end up losing lots of useful data.

Cleaning Data

Normalize Text

It is very common for any text data to have words that follow a certain capitalization like camel case, title case, sentence case, etc., or some mis-capitalized words (eg: pYthOn). Both types create problems in analysis thus it is important to normalize the text into lowercase.

text = 'Python PROGRAMMING LanGUage.'
text.lower()
------------------
python programming language.

Remove Unnecessary Whitespaces

Most of the text data you collect from the web may contain some extra spaces between words, before and after a sentence. It is important to remove these before applying any text processing or cleaning technique to the data.

doc = 'python programming    language     'import regex as re
re.sub("\s+"," ",doc)
------------------------
python programming language

Removing Unwanted Data

Unwanted data refers to certain parts of the text that don’t add any value in analysis and model building. For example hashtags, HTML tags, mentions, emails, URLs, phone numbers, or some special combination of characters. We can remove these completely from our text data or replace them with their representative word.

HTML Tags
HTML Tags start with an < followed by tag name, ends with> .

doc = '<p> Food is very good and <b>cheap</b>.</p>'import regex as re
re.sub('<.*?>','',doc)
-------------------
Food is very good and cheap.

Emails
Gmail is one of the most famous and commonly used service providers for email services. Usually, an email starts with a personalized name followed by some initials like digits, special symbols, etc., then @ ends with an email service provider. Like dazzleninja_44@gmail.com .

doc = 'you can contact me on my work email dazzleninja_44@gmail.com for any queries.'import regex as re
re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", doc)
---------------------
you can contact me on my work email for any queries.
"""
[a-z0-9+._-]+
@
[a-z0-9+._-]+
\
.
[a-z0-9+_-]+
"""

URLs
A generic URL contains a protocol, subdomain, domain name, top level domain, and directory path.

doc = 'follow my medium profile at https://medium.com/@abhayparashar31 and subscribe to my email list at https://abhayparashar31.medium.com/subscribe'
import regex as re
re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , doc)
--------------------
"""
follow my medium profile at  and subscribe to my email list at.
"""

"""
(http|https|ftp|ssh)
://
([\w_-]+(?:(?:\.[\w_-]+)+))
([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
"""

Relying on traditional processes and inconsistent model management can block your team from getting models to production. Building an MLOps strategy can help. Learn more with our free ebook.

Accented Characters
Accent marks are symbols used over letters especially vowels to emphasize the pronunciation of a word. These characters cause problems in analysis by increasing the vocabulary size unnecessarily.

For example, résumé and resume are two different words for our model, whereas both of them produce the same meaning. These usually occur when you try to collect data from a web source, or a multilingual source.

doc = 'résumé length is good. resume font is bad.'import unicodedata
unicodedata.normalize('NFKD', doc).encode('ascii', 'ignore').decode('utf-8', 'ignore')
-----------------------
resume length is good. resume font is bad.

Abbreviations

An abbreviation is a shortened form of a word, for example: TTL: Talk to you later. These usually occur in social media datasets. It becomes important to replace abbreviations with their full form otherwise our model will not be able to learn proper patterns from the data. You can find the JSON file with the most common abbreviation short form and their full version here on my Github profile.

x = "it'd've better if less food oil is added."import json
abbreviations = json.load(open('PATH'))
for key in abbreviations:
    if key in x:
    x = x.replace(key,abbreviations[key])
print(x)

Remove Special Symbols

Special symbols are characters that are not considered either letters or digits. Different symbols, punctuation, and accent marks are considered special symbols. They don’t add any value while modeling thus it is important to remove all of them from the text.

doc = 'Congrats!, David You have won 1000$.'import regex as re
re.sub(r'[^\w ]+', "", doc)
-----------------------
Congrats David You have won 1000

Stopwords

Stopwords are English words that do not add any value to the sentence. For the purpose of analyzing text and building NLP models, these words might not add much value thus it is a best practice to remove all the stopwords before proceeding further for vectorization. Some of the most common stopwords are: the, is, for, when, to, at, etc.

There are many ways to remove stopwords, one of the simplest methods is by using the NLTK library.

doc = 'this is one of the best action movie i have ever watched.'import nltk
nltk.download('stopwords')
from nltk.corpus import stopwordsenglish_stopwords = set(stopwords.words('english'))cleaned_doc = ' '.join([word for word in doc.split() if word not in english_stopwords])
print(cleaned_doc)
------------------------
one best action movie ever watched.

Stemming

Stemming is the process of converting a word to its root by removing suffix and prefix from it. Stemming will reduce ‘Learning,’ ‘Learns,’ and ‘Learned,’ to their root word ‘Learn.’ The NLTK library offers many stemmers, but out of them all, Porter Stemmer and its upgraded version are mostly used.

# nltk.download('punkt')from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
ps = PorterStemmer()doc = 'learning learn learned learns'text = " ".join([ps.stem(word) for word in word_tokenize(doc)])
print(text)
------------------
learn learn learn learn

Lemmatization

Lemmatization is similar to stemming but the difference between the two is that it takes into consideration the morphological analysis of the words that allows us to differentiate between present, past, and indefinite tense.

# nltk.download('wordnet')
# nltk.download('omw-1.4')doc = 'history always repeat itself.'from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()text = " ".join([lemmatizer.lemmatize(word) for word in word_tokenize(doc)])
print('Lemmatization: ',text)
----------------
Lemmatization:  history always repeat itself. 
Stemming:  histori alway repeat itself.

Conclusion

As a quick recap of the article, the initial step for text cleaning is normalization which converts text into lowercase. The next step includes the use of regular expressions to remove any unwanted data from the text by replacing it with white space or some text initials. The text cleaning process ends by removing stopwords and converting text to its base using stemming or lemmatization.

Abhay Parashar, Heartbeat

Abhay Parashar

Back To Top