{"id":6913,"date":"2023-07-24T15:49:05","date_gmt":"2023-07-24T23:49:05","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=6913"},"modified":"2025-04-24T17:15:07","modified_gmt":"2025-04-24T17:15:07","slug":"the-biggest-challenges-in-nlp-and-how-to-overcome-them","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/","title":{"rendered":"The biggest challenges in NLP and how to overcome them"},"content":{"rendered":"\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<figure class=\"lx ly lz ma mb mc lu lv paragraph-image\">\n<div class=\"md me eb mf bg mg\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mh mi c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*jVPlabZnlU1PqZFKHGv8zg.jpeg\" alt=\"\" width=\"700\" height=\"467\"><\/figure><div class=\"lu lv lw\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mj mk ml lu lv mm mn be b bf z dv\" data-selectable-paragraph=\"\"><a class=\"af mo\" href=\"https:\/\/unsplash.com\/@mrthetrain\" target=\"_blank\" rel=\"noopener ugc nofollow\">Joshua Hoehne<\/a> via Unsplash<\/figcaption><\/figure>\n<p id=\"3fc6\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Humans produce so much text data that we do not even realize the value it holds for businesses and society today. We don\u2019t realize its importance because it\u2019s part of our day-to-day lives and easy to understand, but if you input this same text data into a computer, it\u2019s a big challenge to understand what\u2019s being said or happening.<\/p>\n<p id=\"e5b1\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">This is where NLP (Natural Language Processing) comes into play \u2014 the process used to help computers understand text data. However, this is not an easy task. Learning a language is already hard for us humans, so you can imagine how difficult it is to teach a computer to understand text data.<\/p>\n<p id=\"8a89\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Although NLP has been growing and has been working hand-in-hand with NLU (Natural Language Understanding) to help computers understand and respond to human language, the major challenge faced is how fluid and inconsistent language can be.<\/p>\n<p id=\"da76\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">So let\u2019s look into some of these challenges and a few solutions.<\/p>\n<h1 id=\"eff9\" class=\"nm nn fo be no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj bj\" data-selectable-paragraph=\"\">Context<\/h1>\n<p id=\"06fe\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Context constitutes 90% of a message, words only 10%. Yes, words make up text data, however, words and phrases have different meanings depending on the context of a sentence. As humans, from birth, we learn and adapt to understand the context. Although NLP models are inputted with many words and definitions, one thing they struggle to differentiate is the context.<\/p>\n<p id=\"3e02\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Let\u2019s take Mandarin, for example. The language has four tones and each of these tones can change the meaning of a word. This is what we call homonyms, two or more words that have the same pronunciation but have different meanings. This can make tasks such as speech recognition difficult, as it is not in the form of text data.<\/p>\n<h2 id=\"3746\" class=\"op nn fo be no oq or os ns ot ou ov nw mz ow ox oy nd oz pa pb nh pc pd pe pf bj\" data-selectable-paragraph=\"\">Solution<\/h2>\n<p id=\"8791\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Embedding. This is the representation of words for text analysis. It helps a machine to better understand human language through a distributed representation of the text in an n-dimensional space. The technique is highly used in NLP challenges \u2014 one of them being to understand the context of words.<\/p>\n<p id=\"b842\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">There are two types of embedding techniques I will talk about here:<\/p>\n<ul class=\"\">\n<li id=\"cf27\" class=\"mp mq fo be b mr ms mt mu mv mw mx my pg na nb nc ph ne nf ng pi ni nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Word embedding<\/li>\n<li id=\"efec\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Contextual embedding<\/li>\n<\/ul>\n<p id=\"a539\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">The aim of both of the embedding techniques is to learn the representation of each word in the form of a vector.<\/p>\n<p id=\"2389\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Word embedding creates a global glossary for itself \u2014 focusing on unique words without taking context into consideration. With this, the model can then learn about other words that also are found frequently or close to one another in a document. However, the limitation with word embedding comes from the challenge we are speaking about \u2014 context.<\/p>\n<p id=\"71ad\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">The most popular technique used in word embedding is word2vec \u2014 an NLP tool that uses a neural network model to learn word association from a large piece of text data. However, the major limitation to word2vec is understanding context, such as polysemous words.<\/p>\n<p id=\"0596\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Some examples:<\/p>\n<ul class=\"\">\n<li id=\"3250\" class=\"mp mq fo be b mr ms mt mu mv mw mx my pg na nb nc ph ne nf ng pi ni nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\"><em class=\"pr\">\u201cI need to go to the <\/em><strong class=\"be ps\"><em class=\"pr\">bank<\/em><\/strong><em class=\"pr\"> to deal with some financial matters\u201d<\/em><\/li>\n<li id=\"12f6\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\"><em class=\"pr\">\u201cYou need to cross over the river <\/em><strong class=\"be ps\"><em class=\"pr\">bank<\/em><\/strong><em class=\"pr\"> to reach your destination\u201d<\/em><\/li>\n<\/ul>\n<p id=\"4045\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">This is where contextual embedding comes into play and is used to learn sequence-level semantics by taking into consideration the sequence of all words in the documents. This technique can help overcome challenges within NLP and give the model a better understanding of polysemous words.<\/p>\n<p id=\"92bd\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Contextual word embedding works by building a vector for each word. This provides representation for each token of the entire input sentence.<\/p>\n<figure class=\"pu pv pw px py mc lu lv paragraph-image\">\n<div class=\"md me eb mf bg mg\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mh mi c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*LgLiPOG4Y_GCcttq\" alt=\"\" width=\"700\" height=\"481\"><\/figure><div class=\"lu lv pt\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*LgLiPOG4Y_GCcttq 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*LgLiPOG4Y_GCcttq 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*LgLiPOG4Y_GCcttq 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*LgLiPOG4Y_GCcttq 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*LgLiPOG4Y_GCcttq 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*LgLiPOG4Y_GCcttq 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*LgLiPOG4Y_GCcttq 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*LgLiPOG4Y_GCcttq 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*LgLiPOG4Y_GCcttq 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*LgLiPOG4Y_GCcttq 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*LgLiPOG4Y_GCcttq 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*LgLiPOG4Y_GCcttq 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*LgLiPOG4Y_GCcttq 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*LgLiPOG4Y_GCcttq 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mj mk ml lu lv mm mn be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af mo\" href=\"https:\/\/www.cs.princeton.edu\/courses\/archive\/spring20\/cos598C\/lectures\/lec3-contextualized-word-embeddings.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Contextualised Word Embedding Princeton<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"f22f\" class=\"nm nn fo be no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj bj\" data-selectable-paragraph=\"\">Errors in spelling<\/h1>\n<p id=\"2646\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Everybody makes spelling mistakes, but for the majority of us, we can gauge what the word was actually meant to be. However, this is a major challenge for computers as they don\u2019t have the same ability to infer what the word was actually meant to spell. They literally take it for what it is \u2014 so NLP is very sensitive to spelling mistakes.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<blockquote class=\"qh\"><p id=\"0b60\" class=\"qi qj fo be qk ql qm qn qo qp qq nl dv\" data-selectable-paragraph=\"\">Comet Artifacts lets you track and reproduce complex multi-experiment scenarios, reuse data points, and easily iterate on datasets. <a class=\"af mo\" href=\"https:\/\/www.comet.com\/site\/blog\/announcing-comet-artifacts\/?utm_source=heartbeat&amp;utm_medium=referral&amp;utm_campaign=AMS_US_EN_AWA_heartbeat_CTA\" target=\"_blank\" rel=\"noopener ugc nofollow\">Read this quick overview of Artifacts to explore all that it can do<\/a>.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h2 id=\"6c65\" class=\"op nn fo be no oq or os ns ot ou ov nw mz ow ox oy nd oz pa pb nh pc pd pe pf bj\" data-selectable-paragraph=\"\">Solution<\/h2>\n<p id=\"6057\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Cosine similarity is a method that can be used to resolve spelling mistakes for NLP tasks. It mathematically measures the cosine of the angle between two vectors in a multi-dimensional space. As a document size increases, it\u2019s natural for the number of common words to increase as well \u2014 regardless of the change in topics.<\/p>\n<p id=\"98ab\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">In relation to NLP, it calculates the distance between two words by taking a cosine between the common letters of the dictionary word and the misspelt word. Using this technique, we can set a threshold and scope through a variety of words that have similar spelling to the misspelt word and then use these possible words above the threshold as a potential replacement word.<\/p>\n<figure class=\"pu pv pw px py mc lu lv paragraph-image\">\n<div class=\"md me eb mf bg mg\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mh mi c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*VtjrWJqq8Ksu8vYB\" alt=\"\" width=\"700\" height=\"328\"><\/figure><div class=\"lu lv qr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*VtjrWJqq8Ksu8vYB 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*VtjrWJqq8Ksu8vYB 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*VtjrWJqq8Ksu8vYB 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*VtjrWJqq8Ksu8vYB 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*VtjrWJqq8Ksu8vYB 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*VtjrWJqq8Ksu8vYB 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*VtjrWJqq8Ksu8vYB 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*VtjrWJqq8Ksu8vYB 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*VtjrWJqq8Ksu8vYB 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*VtjrWJqq8Ksu8vYB 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*VtjrWJqq8Ksu8vYB 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*VtjrWJqq8Ksu8vYB 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*VtjrWJqq8Ksu8vYB 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*VtjrWJqq8Ksu8vYB 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mj mk ml lu lv mm mn be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af mo\" href=\"https:\/\/datascience-enthusiast.com\/DL\/Operations_on_word_vectors.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Data Science Enthusiast<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"fe17\" class=\"nm nn fo be no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj bj\" data-selectable-paragraph=\"\">Unstructured Data<\/h1>\n<p id=\"9542\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Most of the data used for NLP tasks comes from conversations, emails, tweets, etc. This type of data is highly unstructured causing many challenges to producing useful information.<\/p>\n<p id=\"8ea7\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Before you start cooking, preparing your ingredients makes your life 10x easier. You don\u2019t want to be in the middle of cooking your dish and realize you have three missing ingredients.<\/p>\n<h2 id=\"0137\" class=\"op nn fo be no oq or os ns ot ou ov nw mz ow ox oy nd oz pa pb nh pc pd pe pf bj\" data-selectable-paragraph=\"\">Solution<\/h2>\n<p id=\"2d20\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">The same applies when working with data. You want to ensure the data that you are using to input into your NLP model is of high quality. Solutions include:<\/p>\n<ul class=\"\">\n<li id=\"0d44\" class=\"mp mq fo be b mr ms mt mu mv mw mx my pg na nb nc ph ne nf ng pi ni nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Removing the URLs and HTML tags<\/li>\n<li id=\"dd0f\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Removing numeric and alphanumeric words<\/li>\n<li id=\"5b8d\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Removing punctuation and special characters<\/li>\n<li id=\"f54f\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Removing stop words<\/li>\n<li id=\"9c20\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Converting the text to lowercase<\/li>\n<li id=\"80b2\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Text standardization<\/li>\n<li id=\"16dc\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Lemmatization<\/li>\n<li id=\"7127\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Stemming<\/li>\n<li id=\"93d8\" class=\"mp mq fo be b mr pm mt mu mv pn mx my pg po nb nc ph pp nf ng pi pq nj nk nl pj pk pl bj\" data-selectable-paragraph=\"\">Tokenization<\/li>\n<\/ul>\n<p id=\"8501\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">I will briefly expand a few.<\/p>\n<h2 id=\"eb3f\" class=\"op nn fo be no oq or os ns ot ou ov nw mz ow ox oy nd oz pa pb nh pc pd pe pf bj\" data-selectable-paragraph=\"\">Text Standardization<\/h2>\n<p id=\"f261\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Text standardization is the process of expanding contraction words into their complete words. Contractions are words or combinations of words that are shortened by dropping out a letter or letters and replacing them with an apostrophe.<\/p>\n<p id=\"fc8e\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Although it simplifies our text and speech and we can easily understand it; machines handle text data better when it uses full words. For example, \u201ccan\u2019t\u201d will be standardized to \u201ccan not.\u201d<\/p>\n<h2 id=\"4b7a\" class=\"op nn fo be no oq or os ns ot ou ov nw mz ow ox oy nd oz pa pb nh pc pd pe pf bj\" data-selectable-paragraph=\"\">Lemmatization<\/h2>\n<p id=\"ae9c\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">In linguistics, Lemmatization means grouping together different forms of the same word \u2014 bringing it to its base form. For example, the word \u201ctried,\u201d \u201ctries,\u201d and \u201ctrying\u201d will be converted and grouped to \u201ctry.\u201d<\/p>\n<h2 id=\"916b\" class=\"op nn fo be no oq or os ns ot ou ov nw mz ow ox oy nd oz pa pb nh pc pd pe pf bj\" data-selectable-paragraph=\"\">Stemming<\/h2>\n<p id=\"de77\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Similar to lemmatization, stemming does not have the ability to apply context during its process and removes the last few characters from a word \u2014 it finds the \u2018stem\u2019. For example, if we used stemming on the word \u201ccaring\u201d it would reduce it down to \u201ccar.\u201d However, with lemmatization, it applies context and it applies context to the word \u201ccaring\u201d and brings it down to \u201ccare.\u201d<\/p>\n<p id=\"a25b\" class=\"pw-post-body-paragraph mp mq fo be b mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl fh bj\" data-selectable-paragraph=\"\">Lemmatization is more computationally expensive than stemming as it requires the need to scan through look-up tables, etc.<\/p>\n<h2 id=\"97a9\" class=\"op nn fo be no oq or os ns ot ou ov nw mz ow ox oy nd oz pa pb nh pc pd pe pf bj\" data-selectable-paragraph=\"\">Tokenization<\/h2>\n<p id=\"d463\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">Tokenization is the process of splitting paragraphs and sentences into smaller units, or splitting a string, text into a list of tokens. For example, if we have the sentence \u201cthis article is about text data\u201d \u2014 this will be split up into individual tokens of \u201cthis,\u201d \u201carticle,\u201d \u201cis,\u201d \u201cabout,\u201d \u201ctext,\u201d \u201cdata.\u201d<\/p>\n<h1 id=\"cabf\" class=\"nm nn fo be no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi oj bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"f9d6\" class=\"pw-post-body-paragraph mp mq fo be b mr ok mt mu mv ol mx my mz om nb nc nd on nf ng nh oo nj nk nl fh bj\" data-selectable-paragraph=\"\">These are the most common challenges that are faced in NLP that can be easily resolved. The main problem with a lot of models and the output they produce is down to the data inputted. If you focus on how you can improve the quality of your data using a <a class=\"af mo\" href=\"https:\/\/heartbeat.comet.ml\/towards-data-centric-ai-7a291ef2d508\" target=\"_blank\" rel=\"noopener ugc nofollow\">Data-Centric AI mindset<\/a>, you will start to see the accuracy in your models output increase.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Joshua Hoehne via Unsplash Humans produce so much text data that we do not even realize the value it holds for businesses and society today. We don\u2019t realize its importance because it\u2019s part of our day-to-day lives and easy to understand, but if you input this same text data into a computer, it\u2019s a big [&hellip;]<\/p>\n","protected":false},"author":59,"featured_media":7006,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[161],"class_list":["post-6913","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The biggest challenges in NLP and how to overcome them - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The biggest challenges in NLP and how to overcome them\" \/>\n<meta property=\"og:description\" content=\"Joshua Hoehne via Unsplash Humans produce so much text data that we do not even realize the value it holds for businesses and society today. We don\u2019t realize its importance because it\u2019s part of our day-to-day lives and easy to understand, but if you input this same text data into a computer, it\u2019s a big [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-07-24T23:49:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:07+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-5.50.47-PM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"304\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Nisha Arya Ahmed\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nisha Arya Ahmed\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"The biggest challenges in NLP and how to overcome them - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/","og_locale":"en_US","og_type":"article","og_title":"The biggest challenges in NLP and how to overcome them","og_description":"Joshua Hoehne via Unsplash Humans produce so much text data that we do not even realize the value it holds for businesses and society today. We don\u2019t realize its importance because it\u2019s part of our day-to-day lives and easy to understand, but if you input this same text data into a computer, it\u2019s a big [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-07-24T23:49:05+00:00","article_modified_time":"2025-04-24T17:15:07+00:00","og_image":[{"width":300,"height":304,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-5.50.47-PM.png","type":"image\/png"}],"author":"Nisha Arya Ahmed","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Nisha Arya Ahmed","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/"},"author":{"name":"Nisha Arya Ahmed","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/0e05c37ace46014784df75d236c71907"},"headline":"The biggest challenges in NLP and how to overcome them","datePublished":"2023-07-24T23:49:05+00:00","dateModified":"2025-04-24T17:15:07+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/"},"wordCount":1237,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-5.50.47-PM.png","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/","url":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/","name":"The biggest challenges in NLP and how to overcome them - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-5.50.47-PM.png","datePublished":"2023-07-24T23:49:05+00:00","dateModified":"2025-04-24T17:15:07+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-5.50.47-PM.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-5.50.47-PM.png","width":300,"height":304,"caption":"Comet ML: The biggest challenges in NLP and how to overcome them"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/the-biggest-challenges-in-nlp-and-how-to-overcome-them\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"The biggest challenges in NLP and how to overcome them"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/0e05c37ace46014784df75d236c71907","name":"Nisha Arya Ahmed","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/e58bf351134edf9e1df492b22fc283cd","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/nisha-arya-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/nisha-arya-96x96.jpg","caption":"Nisha Arya Ahmed"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/nishaaryaahmedgmail-com\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/07\/Screen-Shot-2023-07-28-at-5.50.47-PM.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6913","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/59"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=6913"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6913\/revisions"}],"predecessor-version":[{"id":15598,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6913\/revisions\/15598"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/7006"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=6913"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=6913"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=6913"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=6913"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}