{"id":7264,"date":"2023-08-21T09:17:42","date_gmt":"2023-08-21T17:17:42","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7264"},"modified":"2025-04-24T17:14:38","modified_gmt":"2025-04-24T17:14:38","slug":"how-to-perfectly-clean-your-text-data-for-nlp","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/","title":{"rendered":"How To Perfectly Clean Your Text Data For NLP"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<figure class=\"mk ml mm mn mo mp mh mi paragraph-image\">\n<div class=\"mq mr eb ms bg mt\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mu mv c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*V5vgPHChyVTcsUUG\" alt=\"\" width=\"700\" height=\"394\"><\/figure><div class=\"mh mi mj\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mw mx my mh mi mz na be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af hb\" href=\"https:\/\/unsplash.com\/@javaistan?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Afif Kusuma<\/a> on <a class=\"af hb\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<p id=\"838d\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\">Natural language refers to the medium we humans use to communicate with each other, and processing simply means the conversion of data into a readable form. In short, natural language processing is a way to provide computers with the ability to understand and communicate in human language.<\/p>\n<p id=\"0e3c\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\">NLP is a branch of AI that uses text data as input and return models that can understand and generate insights from new text data. One of the most important steps of creating these models is converting raw text data into a much better-cleaned version that contains only useful information. In this blog, we will look at some techniques to perfectly clean text data for natural language processing.<\/p>\n<p id=\"2d6d\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\">It is important to apply each step in the same serial manner as mentioned below, otherwise, you could end up losing lots of useful data.<\/p>\n<h1 id=\"a620\" class=\"nw nx fo be ny nz oa go ob oc od gr oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">Cleaning Data<\/h1>\n<h2 id=\"a29c\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Normalize Text<\/h2>\n<p id=\"df08\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">It is very common for any text data to have words that follow a certain capitalization like camel case, title case, sentence case, etc., or some mis-capitalized words (eg: pYthOn). Both types create problems in analysis thus it is important to normalize the text into <code class=\"cw po pp pq pr b\">lowercase.<\/code><\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"a1fa\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">text = 'Python PROGRAMMING LanGUage.'\ntext.lower()\n------------------\n<em class=\"pz\">python programming language.<\/em><\/span><\/pre>\n<h2 id=\"1562\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Remove Unnecessary Whitespaces<\/h2>\n<p id=\"738d\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">Most of the text data you collect from the web may contain some extra spaces between words, before and after a sentence. It is important to remove these before applying any text processing or cleaning technique to the data.<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"a2cc\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">doc = 'python programming    language     '<\/span><span id=\"6932\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">import regex as re\nre.sub(\"\\s+\",\" \",doc)\n------------------------\n<em class=\"pz\">python programming language<\/em><\/span><\/pre>\n<h2 id=\"7f62\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Removing Unwanted Data<\/h2>\n<p id=\"371a\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">Unwanted data refers to certain parts of the text that don\u2019t add any value in analysis and model building. For example hashtags, HTML tags, mentions, emails, URLs, phone numbers, or some special combination of characters. We can remove these completely from our text data or replace them with their representative word.<\/p>\n<p id=\"7835\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\"><strong class=\"be qb\">HTML Tags<br>\n<\/strong>HTML Tags start with an <code class=\"cw po pp pq pr b\">&lt;<\/code> followed by tag name, ends with<code class=\"cw po pp pq pr b\">&gt;<\/code> .<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"133f\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">doc = '&lt;p&gt; Food is very good and &lt;b&gt;cheap&lt;\/b&gt;.&lt;\/p&gt;'<\/span><span id=\"e483\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">import regex as re\nre.sub('&lt;.*?&gt;','',doc)\n-------------------\n<em class=\"pz\">Food is very good and cheap.<\/em><\/span><\/pre>\n<p id=\"5953\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\"><strong class=\"be qb\">Emails<br>\n<\/strong>Gmail is one of the most famous and commonly used service providers for email services. Usually, an email starts with a personalized name followed by some initials like digits, special symbols, etc., then <code class=\"cw po pp pq pr b\">@<\/code> ends with an email service provider. Like <code class=\"cw po pp pq pr b\">dazzleninja_44@gmail.com<\/code> .<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"b603\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">doc = 'you can contact me on my work email dazzleninja_44@gmail.com for any queries.'<\/span><span id=\"3b4d\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">import regex as re\nre.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\\.[a-z0-9+_-]+)',\"\", doc)\n---------------------\n<em class=\"pz\">you can contact me on my work email for any queries.\n<\/em><\/span><span id=\"6625\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\"><strong class=\"pr fp\">\"\"\"<\/strong>\n[a-z0-9+._-]+\n@\n[a-z0-9+._-]+\n\\\n.\n[a-z0-9+_-]+\n<strong class=\"pr fp\">\"\"\"<\/strong><\/span><\/pre>\n<p id=\"c51e\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\"><strong class=\"be qb\">URLs<br>\n<\/strong>A generic URL contains a protocol, subdomain, domain name, top level domain, and directory path.<\/p>\n<pre>doc = 'follow my medium profile at https:\/\/medium.com\/@abhayparashar31 and subscribe to my email list at https:\/\/abhayparashar31.medium.com\/subscribe'\nimport regex as re\nre.sub(r'(http|https|ftp|ssh):\/\/([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&amp;:\/~+#-]*[\\w@?^=%&amp;\/~+#-])?', '' , doc)\n--------------------\n\"\"\"\nfollow my medium profile at  and subscribe to my email list at.\n\"\"\"\n\n\"\"\"\n(http|https|ftp|ssh)\n:\/\/\n([\\w_-]+(?:(?:\\.[\\w_-]+)+))\n([\\w.,@?^=%&amp;:\/~+#-]*[\\w@?^=%&amp;\/~+#-])?\n\"\"\"<\/pre>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<blockquote class=\"qn\"><p id=\"3006\" class=\"qo qp fo be qq qr qs qt qu qv qw nv dv\" data-selectable-paragraph=\"\">Relying on traditional processes and inconsistent model management can block your team from getting models to production. Building an MLOps strategy can help.<a class=\"af hb\" href=\"https:\/\/go.comet.ml\/ebook-Building-Effective-ML-Teams.html\" target=\"_blank\" rel=\"noopener ugc nofollow\"> Learn more with our free ebook<\/a>.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"ab ca qf qg qh qi\" role=\"separator\"><\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"aed1\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\"><strong class=\"be qb\">Accented Characters<br>\n<\/strong>Accent marks are symbols used over letters especially vowels to emphasize the pronunciation of a word. These characters cause problems in analysis by increasing the vocabulary size unnecessarily.<\/p>\n<p id=\"db61\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\">For example, r\u00e9sum\u00e9 and resume are two different words for our model, whereas both of them produce the same meaning. These usually occur when you try to collect data from a web source, or a multilingual source.<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"0f38\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">doc = 'r\u00e9sum\u00e9 length is good. resume font is bad.'<\/span><span id=\"4aea\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">import unicodedata\nunicodedata.normalize('NFKD', doc).encode('ascii', 'ignore').decode('utf-8', 'ignore')\n-----------------------\n<em class=\"pz\">resume length is good. resume font is bad.<\/em><\/span><\/pre>\n<h2 id=\"8e69\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Abbreviations<\/h2>\n<p id=\"f150\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">An abbreviation is a shortened form of a word, for example: TTL: Talk to you later. These usually occur in social media datasets. It becomes important to replace abbreviations with their full form otherwise our model will not be able to learn proper patterns from the data. You can find the JSON file with the most common abbreviation short form and their full version here on my <a class=\"af hb\" href=\"https:\/\/github.com\/Abhayparashar31\/crazytext\/blob\/main\/crazytext\/data\/abbreviations_wordlist.json\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pz\">Github<\/em><\/a> profile.<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"5150\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">x = \"it'd've better if less food oil is added.\"<\/span><span id=\"8739\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">import json\nabbreviations = json.load(open('<a class=\"af hb\" href=\"https:\/\/raw.githubusercontent.com\/Abhayparashar31\/crazytext\/main\/crazytext\/data\/abbreviations_wordlist.json\" target=\"_blank\" rel=\"noopener ugc nofollow\">PATH<\/a>'))\nfor key in abbreviations:\n    if key in x:\n    x = x.replace(key,abbreviations[key])\nprint(x)<\/span><\/pre>\n<h2 id=\"90a0\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Remove Special Symbols<\/h2>\n<p id=\"47fb\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">Special symbols are characters that are not considered either letters or digits. Different symbols, punctuation, and accent marks are considered special symbols. They don\u2019t add any value while modeling thus it is important to remove all of them from the text.<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"f021\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">doc = 'Congrats!, David You have won 1000$.'<\/span><span id=\"2083\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">import regex as re\nre.sub(r'[^\\w ]+', \"\", doc)\n-----------------------\n<em class=\"pz\">Congrats David You have won 1000<\/em><\/span><\/pre>\n<h2 id=\"b012\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Stopwords<\/h2>\n<p id=\"3726\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">Stopwords are English words that do not add any value to the sentence. For the purpose of analyzing text and building NLP models, these words might not add much value thus it is a best practice to remove all the stopwords before proceeding further for vectorization. Some of the most common stopwords are: the, is, for, when, to, at, etc.<\/p>\n<p id=\"2478\" class=\"pw-post-body-paragraph nb nc fo be b gm nd ne nf gp ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fh bj\" data-selectable-paragraph=\"\">There are many ways to remove stopwords, one of the simplest methods is by using the NLTK library.<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"117f\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\">doc = 'this is one of the best action movie i have ever watched.'<\/span><span id=\"af7a\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">import nltk\nnltk.download('stopwords')\nfrom nltk.corpus import stopwords<\/span><span id=\"4f52\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">english_stopwords = set(stopwords.words('english'))<\/span><span id=\"1334\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">cleaned_doc = ' '.join([word for word in doc.split() if word not in english_stopwords])\nprint(cleaned_doc)\n------------------------\n<em class=\"pz\">one best action movie ever watched.<\/em><\/span><\/pre>\n<h2 id=\"322c\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Stemming<\/h2>\n<p id=\"e0a3\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">Stemming is the process of converting a word to its root by removing suffix and prefix from it. Stemming will reduce \u2018Learning,\u2019 \u2018Learns,\u2019 and \u2018Learned,\u2019 to their root word \u2018Learn.\u2019 The NLTK library offers many stemmers, but out of them all, Porter Stemmer and its upgraded version are mostly used.<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"940d\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\"># nltk.download('punkt')<\/span><span id=\"2191\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">from nltk.tokenize import word_tokenize\nfrom nltk.stem import PorterStemmer\nps = PorterStemmer()<\/span><span id=\"ad7c\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">doc = 'learning learn learned learns'<\/span><span id=\"9a1a\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">text = \" \".join([ps.stem(word) for word in word_tokenize(doc)])\nprint(text)\n------------------<em class=\"pz\">\nlearn learn learn learn<\/em><\/span><\/pre>\n<h2 id=\"7d20\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Lemmatization<\/h2>\n<p id=\"ab75\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">Lemmatization is similar to stemming but the difference between the two is that it takes into consideration the morphological analysis of the words that allows us to differentiate between present, past, and indefinite tense.<\/p>\n<pre class=\"mk ml mm mn mo ps pr pt pu ax pv bj\"><span id=\"e2cc\" class=\"os nx fo pr b ib pw px l ir py\" data-selectable-paragraph=\"\"># nltk.download('wordnet')\n# nltk.download('omw-1.4')<\/span><span id=\"7808\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">doc = 'history always repeat itself.'<\/span><span id=\"3299\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">from nltk.stem import WordNetLemmatizer\nlemmatizer = WordNetLemmatizer()<\/span><span id=\"31ad\" class=\"os nx fo pr b ib qa px l ir py\" data-selectable-paragraph=\"\">text = \" \".join([lemmatizer.lemmatize(word) for word in word_tokenize(doc)])\nprint('Lemmatization: ',text)\n----------------\n<strong class=\"pr fp\"><em class=\"pz\">Lemmatization<\/em><\/strong><em class=\"pz\">:  history always repeat itself.\n<\/em><strong class=\"pr fp\"><em class=\"pz\">Stemming<\/em><\/strong><em class=\"pz\">:  histori alway repeat itself.<\/em><\/span><\/pre>\n<h2 id=\"6b5e\" class=\"os nx fo be ny ot ou ov ob ow ox oy oe nj oz pa pb nn pc pd pe nr pf pg ph pi bj\" data-selectable-paragraph=\"\">Conclusion<\/h2>\n<p id=\"5da6\" class=\"pw-post-body-paragraph nb nc fo be b gm pj ne nf gp pk nh ni nj pl nl nm nn pm np nq nr pn nt nu nv fh bj\" data-selectable-paragraph=\"\">As a quick recap of the article, the initial step for text cleaning is normalization which converts text into lowercase. The next step includes the use of regular expressions to remove any unwanted data from the text by replacing it with white space or some text initials. The text cleaning process ends by removing stopwords and converting text to its base using stemming or lemmatization.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Afif Kusuma on Unsplash Natural language refers to the medium we humans use to communicate with each other, and processing simply means the conversion of data into a readable form. In short, natural language processing is a way to provide computers with the ability to understand and communicate in human language. NLP is [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[140],"class_list":["post-7264","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How To Perfectly Clean Your Text Data For NLP - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How To Perfectly Clean Your Text Data For NLP\" \/>\n<meta property=\"og:description\" content=\"Photo by Afif Kusuma on Unsplash Natural language refers to the medium we humans use to communicate with each other, and processing simply means the conversion of data into a readable form. In short, natural language processing is a way to provide computers with the ability to understand and communicate in human language. NLP is [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-21T17:17:42+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*V5vgPHChyVTcsUUG\" \/>\n<meta name=\"author\" content=\"Abhay Parashar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abhay Parashar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How To Perfectly Clean Your Text Data For NLP - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/","og_locale":"en_US","og_type":"article","og_title":"How To Perfectly Clean Your Text Data For NLP","og_description":"Photo by Afif Kusuma on Unsplash Natural language refers to the medium we humans use to communicate with each other, and processing simply means the conversion of data into a readable form. In short, natural language processing is a way to provide computers with the ability to understand and communicate in human language. NLP is [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-21T17:17:42+00:00","article_modified_time":"2025-04-24T17:14:38+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*V5vgPHChyVTcsUUG","type":"","width":"","height":""}],"author":"Abhay Parashar","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abhay Parashar","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"How To Perfectly Clean Your Text Data For NLP","datePublished":"2023-08-21T17:17:42+00:00","dateModified":"2025-04-24T17:14:38+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/"},"wordCount":814,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*V5vgPHChyVTcsUUG","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/","url":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/","name":"How To Perfectly Clean Your Text Data For NLP - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*V5vgPHChyVTcsUUG","datePublished":"2023-08-21T17:17:42+00:00","dateModified":"2025-04-24T17:14:38+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*V5vgPHChyVTcsUUG","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*V5vgPHChyVTcsUUG"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-perfectly-clean-your-text-data-for-nlp\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"How To Perfectly Clean Your Text Data For NLP"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7264","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7264"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7264\/revisions"}],"predecessor-version":[{"id":15573,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7264\/revisions\/15573"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7264"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7264"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7264"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7264"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}