{"id":7346,"date":"2023-08-29T13:50:38","date_gmt":"2023-08-29T21:50:38","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7346"},"modified":"2025-04-24T17:14:27","modified_gmt":"2025-04-24T17:14:27","slug":"introduction-to-text-wrangling-techniques-for-natural-language-processing","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/","title":{"rendered":"Introduction to Text Wrangling Techniques for Natural Language Processing"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*GyAwrqkCetMRkGWGibdgOA.jpeg\" alt=\"\" width=\"2000\" height=\"1333\"><\/figure><div class=\"mg bg\">\n<figure class=\"mh mi mj mk ml mg bg paragraph-image\"><picture><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"9f30\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\">What is Text Wrangling?<\/h1>\n<p id=\"fe19\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Although is has many forms, text wrangling is basically the pre-processing work that\u2019s done to prepare raw text data ready for training. Simply put, it\u2019s the process of cleaning your data to make it readable by your program, and then formatting it as such.<\/p>\n<p id=\"9357\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Many of you may be wrangling text without knowing it yourself. In this tutorial, I will teach you how to clean up your text in Python. I will show you to perform the most common forms of text wrangling: sentence splitting, tokenization, stemming, lemmatization, and stop word removal.<\/p>\n<h1 id=\"34aa\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Prerequisites<\/strong><\/h1>\n<p id=\"7365\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Obviously, you\u2019ll need a little bit of Python know-how in order to run the code I\u2019ll show below. I\u2019ll be using a Google Colab notebook to host all my code. I\u2019ll share the link at the end so you can see how your code compares. To create a new notebook, click <a class=\"af ok\" href=\"https:\/\/colab.research.google.com\/notebooks\/welcome.ipynb#recent=true\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/p>\n<p id=\"0868\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">If you\u2019ve worked with Natural Language code before in Python, you\u2019re probably familiar with the Python package <code class=\"cw ol om on oo b\">nltk<\/code> or the <a class=\"af ok\" href=\"https:\/\/heartbeat.comet.ml\/nlp-chronicles-intro-to-nlp-with-nltk-b2c369fbb9a7\" target=\"_blank\" rel=\"noopener ugc nofollow\">Natural Language Toolkit.<\/a> It\u2019s an amazing library with many functions for building Python programs to work with human language data. Let\u2019s begin by typing the following code:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"7136\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">import nltk\nnltk.download(\u2018punkt\u2019)<\/span><\/pre>\n<p id=\"880b\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">In this cell, we are importing the library and asking our notebook to download <code class=\"cw ol om on oo b\">punkt<\/code>. This is a tokenizer that divides a text into a list of sentences. This brings us to our first example of text wrangling\u2014Sentence Splitting.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"5600\" class=\"mo mp fo be mq mr pp go mt mu pq gr mw mx pr mz na nb ps nd ne nf pt nh ni nj bj\" data-selectable-paragraph=\"\">Sentence Splitting<\/h1>\n<p id=\"4623\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">If you\u2019ve ever been given a large paragraph of text, you know that the best way to analyze it is by splitting the text into sentences. In real life conversations, we also compute information at the sentence level by analyzing conjoined words. However, trying to split paragraphs of text into sentences can be difficult in raw code. Luckily, with <code class=\"cw ol om on oo b\">nltk<\/code>, we can do this quite easily. Type the following code:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"c541\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">sampleString = \u201cLet\u2019s make this our sample paragraph. It will split at the end of a sentence marker, like a period. It even knows that the period in Mr. Jones is not the end. Try it out!\u201d<\/span><span id=\"5695\" class=\"ot mp fo oo b ia pu ov l iq ow\" data-selectable-paragraph=\"\"><strong class=\"oo fp\">from<\/strong> nltk.tokenize <strong class=\"oo fp\">import<\/strong> sent_tokenize\ntokenized_sent = sent_tokenize(sampleString)\nprint(tokenized_sent)<\/span><\/pre>\n<p id=\"7d15\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">This code might be self-explanatory, but it\u2019s okay if this is your first time. Here is what we typed line by line:<\/p>\n<ol class=\"\">\n<li id=\"838a\" class=\"nk nl fo be b gm of nn no gp og nq nr pv oh nu nv pw oi ny nz px oj oc od oe py pz qa bj\" data-selectable-paragraph=\"\">First, we define a variable called <code class=\"cw ol om on oo b\">sampleString<\/code> that contains a couple of sentences. You can change the text in this variable to whatever you wish.<\/li>\n<li id=\"728c\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Next, we import <code class=\"cw ol om on oo b\">sent_tokenize<\/code>, which is the sentence tokenization function from the <code class=\"cw ol om on oo b\">nltk<\/code> library.<\/li>\n<li id=\"5b8d\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">We call the <code class=\"cw ol om on oo b\">sent_tokenize<\/code> function on our <code class=\"cw ol om on oo b\">sampleString<\/code>. This runs the tokenization function over our string and saves the results to a new variable called <code class=\"cw ol om on oo b\">tokenized_sent<\/code>.<\/li>\n<li id=\"00cb\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Finally, we print <code class=\"cw ol om on oo b\">tokenized_sent<\/code> to the log. You should receive an output that looks like this:<\/li>\n<\/ol>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"11d5\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">[\u201cLet\u2019s make this our sample paragraph.\u201d, \u2018It will split at the end of a sentence marker, like a period.\u2019, \u2018It even knows that the period in Mr. Jones is not the end.\u2019, \u2018Try it out!\u2019]<\/span><\/pre>\n<p id=\"1e8e\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">As you can see, we were able to split up the paragraph into exact sentences. What\u2019s even more fascinating is that the code knows the difference between a period used to end a sentence versus a period used in the name <strong class=\"be qg\">Mr. Jones<\/strong>.<\/p>\n<h1 id=\"3719\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\">Tokenization<\/h1>\n<p id=\"daf8\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">By now, you\u2019re probably wondering what tokenization is. Well a <strong class=\"be qg\">token<\/strong> is the smallest text unit a machine can process. Therefore, every chunk of text needs to be tokenized before you can run natural language programs on it. Sometimes, it makes sense for the smallest unit to be either a word or a letter. In the previous section, we tokenized the paragraph into sentences.<\/p>\n<p id=\"7bbd\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">For a language like English, it can be easy to tokenize text, especially with <code class=\"cw ol om on oo b\">nltk<\/code> to guide us. Here\u2019s how we can tokenize text using just a few lines of code:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"8edc\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\"><strong class=\"oo fp\">msg<\/strong> = \u201cHey everyone! The party starts in 10mins. Be there ASAP!\u201d\nprint(msg.split())<\/span><\/pre>\n<p id=\"f279\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Like before, we define a variable called <code class=\"cw ol om on oo b\">msg<\/code> (short for message). Then, we run a function called <code class=\"cw ol om on oo b\">split<\/code> over this chunk of text and print the results to the console. You should receive an output like this:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"0004\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">[\u2018Hey\u2019, \u2018everyone!\u2019, \u2018The\u2019, \u2018party\u2019, \u2018starts\u2019, \u2018in\u2019, \u201810mins.\u2019, \u2018Be\u2019, \u2018there\u2019, \u2018ASAP!\u2019]<\/span><\/pre>\n<p id=\"8aa8\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">The <code class=\"cw ol om on oo b\">split()<\/code> function is one of the simplest tokenizers. It looks for whitespace as the delimiter (the limit or boundary) and takes the words around it. However, we can take this to the next level with more functions. Type the following:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"0aa8\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\"><strong class=\"oo fp\">from<\/strong> nltk.tokenize <strong class=\"oo fp\">import<\/strong> word_tokenize, regexp_tokenize\nword_tokenize(msg)<\/span><\/pre>\n<ol class=\"\">\n<li id=\"57a7\" class=\"nk nl fo be b gm of nn no gp og nq nr pv oh nu nv pw oi ny nz px oj oc od oe py pz qa bj\" data-selectable-paragraph=\"\">We import 2 functions from the <code class=\"cw ol om on oo b\">nltk.tokenize<\/code> list of functions.<\/li>\n<li id=\"1188\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">First, let\u2019s look at the <code class=\"cw ol om on oo b\">word_tokenize()<\/code> function. This is very similar to the <code class=\"cw ol om on oo b\">split()<\/code> function with one key difference. Instead of looking for the whitespace as the delimiter, it even splits the punctuation, as it considers exclamation points and periods as their own tokens.<\/li>\n<\/ol>\n<p id=\"a48c\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">This is what your output should look like:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"08cf\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">[\u2018Hey\u2019,\n \u2018everyone\u2019,\n \u2018!\u2019,\n \u2018The\u2019,\n \u2018party\u2019,\n \u2018starts\u2019,\n \u2018in\u2019,\n \u201810mins\u2019,\n \u2018.\u2019,\n \u2018Be\u2019,\n \u2018there\u2019,\n \u2018ASAP\u2019,\n \u2018!\u2019]<\/span><\/pre>\n<p id=\"8c69\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Finally, let\u2019s take a look at the <code class=\"cw ol om on oo b\">regex_tokenize<\/code> function. This is an even more advanced tokenizer that can be customized to fit your needs. Let\u2019s take a look at an example:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"25fd\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">regexp_tokenize(msg, pattern=\u201d\\w+\u201d)<\/span><\/pre>\n<p id=\"6b63\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">You might notice that we have an extra parameter in this function called <code class=\"cw ol om on oo b\">pattern<\/code>. This is where developers can choose how they want to tokenize the text. <code class=\"cw ol om on oo b\">\\w+<\/code> means that we want all words and digits to be in our token, but symbols like punctuation can be ignored. This is why our output looks like this:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"c9b8\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">[\u2018Hey\u2019,\n \u2018everyone\u2019,\n \u2018The\u2019,\n \u2018party\u2019,\n \u2018starts\u2019,\n \u2018in\u2019,\n \u201810mins\u2019,\n \u2018Be\u2019,\n \u2018there\u2019,\n \u2018ASAP\u2019]<\/span><\/pre>\n<p id=\"1d9d\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Now, let\u2019s try a different pattern:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"7ce6\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">regexp_tokenize(msg, pattern=\u201d\\d+\u201d)<\/span><\/pre>\n<p id=\"3c38\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Just like before, we have the same function, but with a different pattern: <code class=\"cw ol om on oo b\">\\d+<\/code>. This asks the text to print only the digits. That\u2019s why our output only contains the number 10.<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"e93b\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">[\u201810\u2019]<\/span><\/pre>\n<p id=\"3adf\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">These are the two most common tokenizers you\u2019ll need to clean up your text. Next, let\u2019s move over to stemming, another crucial step in text wrangling.<\/p>\n<h1 id=\"a16b\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\">Stemming<\/h1>\n<p id=\"f9cd\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Stemming is exactly what it sounds like\u2014cutting down a token to its root stem. For instance, take the word \u201crunning\u201d. It can be broken down to its root: \u201crun\u201d. However, \u201crun\u201d itself has many variation: runs, ran, etc. With stemming, we can club all the variations of the word into a single root. Let\u2019s look at the code to do this:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"3905\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\"><strong class=\"oo fp\">from<\/strong> nltk.stem <strong class=\"oo fp\">import<\/strong> PorterStemmer\nporter = PorterStemmer()\nporter.stem(\u201crunning\u201d)<\/span><\/pre>\n<ol class=\"\">\n<li id=\"ecbb\" class=\"nk nl fo be b gm of nn no gp og nq nr pv oh nu nv pw oi ny nz px oj oc od oe py pz qa bj\" data-selectable-paragraph=\"\">First, we import the <code class=\"cw ol om on oo b\">PorterStemmer<\/code> from the toolkit. There are many algorithms to stem words, and <code class=\"cw ol om on oo b\">PorterStemmer<\/code> uses just one of the many. However, I\u2019ve found it to be the most precise since it uses a lot of rules.<\/li>\n<li id=\"7fca\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Next, we define a variable called <code class=\"cw ol om on oo b\">porter<\/code> and set it equal to the <code class=\"cw ol om on oo b\">PorterStemmer()<\/code>.<\/li>\n<li id=\"7482\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Finally, we ask the stemmer to stem the word \u201crunning\u201d. You should receive the following output:<\/li>\n<\/ol>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"1c21\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">\u2018run\u2019<\/span><\/pre>\n<p id=\"dd5b\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Now, you could skip to the next section, but I\u2019d like to take a moment and show you two more stemmers that use different algorithms. The first is <a class=\"af ok\" href=\"https:\/\/www.nltk.org\/_modules\/nltk\/stem\/lancaster.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Lancaster stemming<\/a>. It\u2019s very easy to implement and the results are close to that of <a class=\"af ok\" href=\"https:\/\/tartarus.org\/martin\/PorterStemmer\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Porter stemming<\/a>. Here\u2019s a look:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"0ce1\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\"><strong class=\"oo fp\">from<\/strong> nltk.stem <strong class=\"oo fp\">import<\/strong> LancasterStemmer\nlancaster = LancasterStemmer()\nlancaster.stem(\u201ceating\u201d)<\/span><\/pre>\n<p id=\"65d7\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">You should recognize this code by now. It\u2019s the same as the previous example, only this time we import <code class=\"cw ol om on oo b\">LancasterStemmer<\/code>. Running the stemmer on the word \u201ceating\u201d gives us an output of:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"50b7\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">\u2018eat\u2019<\/span><\/pre>\n<p id=\"123f\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Now, the last stemmer I want to show you is the <code class=\"cw ol om on oo b\"><a class=\"af ok\" href=\"https:\/\/snowballstem.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">SnowballStemmer<\/a><\/code>. What makes this stemmer unique is that it\u2019s been trained on many languages and works well for English, German, French, Russian, and many others. Here\u2019s how you implement it. It\u2019s a little different from the previous two stemmers:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"4455\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\"><strong class=\"oo fp\">from<\/strong> nltk.stem.snowball <strong class=\"oo fp\">import<\/strong> SnowballStemmer\nsnowball = SnowballStemmer(\u201cenglish\u201d)\nsnowball.stem(\u201chaving\u201d)<\/span><\/pre>\n<ol class=\"\">\n<li id=\"c412\" class=\"nk nl fo be b gm of nn no gp og nq nr pv oh nu nv pw oi ny nz px oj oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Instead of <code class=\"cw ol om on oo b\">nltk.stem<\/code>, we import <code class=\"cw ol om on oo b\">SnowballStemmer<\/code> from <code class=\"cw ol om on oo b\">nltk.stem.snowball<\/code> since it is another major subsection.<\/li>\n<li id=\"ae32\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">We define <code class=\"cw ol om on oo b\">snowball<\/code> as our stemmer. However, when we do so, we define which language the stemmer should detect.<\/li>\n<li id=\"17b2\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Finally, we stem the word using our newly-created stemmer.<\/li>\n<\/ol>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"6c91\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">\u2018have\u2019<\/span><\/pre>\n<p id=\"cf0e\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Stemming is great for its simplicity in NLP-related tasks. However, if we want to get more complex, stemming won\u2019t be the best technique to use. Instead, this is where <strong class=\"be qg\">lemmatization<\/strong> shines.<\/p>\n<h1 id=\"2544\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\">Lemmatization<\/h1>\n<p id=\"232a\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Lemmatization is much more advanced than stemming because rather than just following rules, this process also takes into account context and part of speech to determine the <em class=\"qh\">lemma<\/em>, or the root form of the word. Here\u2019s a perfect example to show the difference between lemmatization and stemming:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"dfac\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">nltk.download(\u2018wordnet\u2019)\n<strong class=\"oo fp\">from<\/strong> nltk.stem <strong class=\"oo fp\">import<\/strong> WordNetLemmatizer\nlem = WordNetLemmatizer()\nprint(lem.lemmatize(\u201cate\u201d))\nprint(porter.stem(\u201cate\u201d))<\/span><\/pre>\n<ol class=\"\">\n<li id=\"5758\" class=\"nk nl fo be b gm of nn no gp og nq nr pv oh nu nv pw oi ny nz px oj oc od oe py pz qa bj\" data-selectable-paragraph=\"\">First we download <code class=\"cw ol om on oo b\">wordnet<\/code> from the toolkit. <code class=\"cw ol om on oo b\">Wordnet<\/code> is a massive semantic dictionary that\u2019s used for search-specific <em class=\"qh\">lemmas <\/em>of words.<\/li>\n<li id=\"bff3\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Next, we import the <code class=\"cw ol om on oo b\">WordNetLemmatizer<\/code> from <code class=\"cw ol om on oo b\">nltk.stem<\/code><\/li>\n<li id=\"1b42\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">We define our variable <code class=\"cw ol om on oo b\">lem<\/code> to be the lemmatization function.<\/li>\n<li id=\"8cb4\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Finally, we lemmatize the word \u201cate\u201d and ask the results to be printed to the console.<\/li>\n<li id=\"bc07\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">For comparison, we use our previously created <code class=\"cw ol om on oo b\">PorterStemmer<\/code> to stem the same word and print the result to the console.<\/li>\n<\/ol>\n<p id=\"3765\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">WordNet is constantly updating, but at the time of writing, this is what my console displayed:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"363c\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">eat\nate<\/span><\/pre>\n<p id=\"7b39\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">So we can see that through lemmatization, we can even detect the tenses of the word and present the simplest form of the word in the present tense, all with a few lines of code. Lemmatization is one of the many powerful techniques in text wrangling.<\/p>\n<h1 id=\"db5b\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\">Stop Word removal<\/h1>\n<p id=\"fb23\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Finally, we come to the last section of this tutorial\u2014stop word removal. Stop words are commonly-used word that are usually ignored because of their many occurrences. Most of these words are articles and prepositions, such as \u201cthe\u201d, \u201ca\u201d, \u201cin\u201d, etc.<\/p>\n<p id=\"83dd\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">These words can either end up taking too much space or eating up too much time. Luckily, <code class=\"cw ol om on oo b\">nltk<\/code> has a list of stop words in 16 different languages. We can use this list to parse paragraphs of text and remove the stop words from them. Here\u2019s how to do it:<\/p>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"8a5d\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">nltk.download(\u2018stopwords\u2019)\n<strong class=\"oo fp\">from<\/strong> nltk.corpus <strong class=\"oo fp\">import<\/strong> stopwords\nlist = stopwords.words(\u2018english\u2019)\nparagraph = \u201cThis is a long paragraph of text. Somtimes important words like Apple and Machine Learning show up. Other words that are not important get removed.\u201d\npostPara = [word for word in paragraph.split() if word not in list]\nprint(postPara)<\/span><\/pre>\n<p id=\"1f23\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">This is perhaps the most complex code in this tutorial, so I\u2019ll run through it piece-by-piece:<\/p>\n<ol class=\"\">\n<li id=\"bf11\" class=\"nk nl fo be b gm of nn no gp og nq nr pv oh nu nv pw oi ny nz px oj oc od oe py pz qa bj\" data-selectable-paragraph=\"\">First, we download the <code class=\"cw ol om on oo b\">stopwords<\/code> from the toolkit.<\/li>\n<li id=\"b35b\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Next, we import all the stop words from <code class=\"cw ol om on oo b\">nltk.corpus<\/code>. A <strong class=\"be qg\">corpus<\/strong> is a large dataset of texts.<\/li>\n<li id=\"09fc\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Next, we define a variable <code class=\"cw ol om on oo b\">list<\/code> and set this to contain all the English stop words.<\/li>\n<li id=\"60fc\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">As with any text wrangling technique, we need a sample text, so we type up a short paragraph and define it as the variable <code class=\"cw ol om on oo b\">paragraph<\/code>.<\/li>\n<li id=\"b983\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">We create a new variable called <code class=\"cw ol om on oo b\">postPara<\/code>, which is an array of all the words in <code class=\"cw ol om on oo b\">paragraph<\/code> split up and not including the words in <code class=\"cw ol om on oo b\">list<\/code>.<\/li>\n<li id=\"1072\" class=\"nk nl fo be b gm qb nn no gp qc nq nr pv qd nu nv pw qe ny nz px qf oc od oe py pz qa bj\" data-selectable-paragraph=\"\">Finally, we print <code class=\"cw ol om on oo b\">postPara<\/code> to our console:<\/li>\n<\/ol>\n<pre class=\"mh mi mj mk ml op oo oq or ax os bj\"><span id=\"8387\" class=\"ot mp fo oo b ia ou ov l iq ow\" data-selectable-paragraph=\"\">[\u2018This\u2019, \u2018long\u2019, \u2018paragraph\u2019, \u2018text.\u2019, \u2018Somtimes\u2019, \u2018important\u2019, \u2018words\u2019, \u2018like\u2019, \u2018Apple\u2019, \u2018Machine\u2019, \u2018Learning\u2019, \u2018show\u2019, \u2018up.\u2019, \u2018Other\u2019, \u2018words\u2019, \u2018important\u2019, \u2018get\u2019, \u2018removed.\u2019]<\/span><\/pre>\n<p id=\"9897\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">As you can see, our text is split up into different words, but the stop words are removed, showing you only the words deemed important. Most articles and prepositions are gone!<\/p>\n<h1 id=\"e49d\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"ddcb\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">As you can see, text wrangling can be essential in making sure you have the best data to work with. With NLTK, it\u2019s easier than ever to run complex algorithms on your text using only a few lines of code. You can split up your text however you want, weed out the unnecessary parts, and even reduce it to make it the most logical form for your computations.<\/p>\n<p id=\"07e2\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">We\u2019ve barely scratched the surface in terms of what can be done with NLTK. I\u2019d suggest taking a look at the <a class=\"af ok\" href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">official NLTK website<\/a>. Feel free to leave me a message in the comments if you\u2019ve got a question or need some help! For reference, here is the link to the <a class=\"af ok\" href=\"https:\/\/colab.research.google.com\/drive\/1WzrQEVr4bjLaNlzE6nXMRFNHAeYT0dFA\" target=\"_blank\" rel=\"noopener ugc nofollow\">complete Colab notebook<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>What is Text Wrangling? Although is has many forms, text wrangling is basically the pre-processing work that\u2019s done to prepare raw text data ready for training. Simply put, it\u2019s the process of cleaning your data to make it readable by your program, and then formatting it as such. Many of you may be wrangling text [&hellip;]<\/p>\n","protected":false},"author":83,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[180],"class_list":["post-7346","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Introduction to Text Wrangling Techniques for Natural Language Processing - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Introduction to Text Wrangling Techniques for Natural Language Processing\" \/>\n<meta property=\"og:description\" content=\"What is Text Wrangling? Although is has many forms, text wrangling is basically the pre-processing work that\u2019s done to prepare raw text data ready for training. Simply put, it\u2019s the process of cleaning your data to make it readable by your program, and then formatting it as such. Many of you may be wrangling text [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-29T21:50:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*GyAwrqkCetMRkGWGibdgOA.jpeg\" \/>\n<meta name=\"author\" content=\"Sai Kambampati\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sai Kambampati\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Introduction to Text Wrangling Techniques for Natural Language Processing - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/","og_locale":"en_US","og_type":"article","og_title":"Introduction to Text Wrangling Techniques for Natural Language Processing","og_description":"What is Text Wrangling? Although is has many forms, text wrangling is basically the pre-processing work that\u2019s done to prepare raw text data ready for training. Simply put, it\u2019s the process of cleaning your data to make it readable by your program, and then formatting it as such. Many of you may be wrangling text [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-29T21:50:38+00:00","article_modified_time":"2025-04-24T17:14:27+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*GyAwrqkCetMRkGWGibdgOA.jpeg","type":"","width":"","height":""}],"author":"Sai Kambampati","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Sai Kambampati","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/"},"author":{"name":"Sai Kambampati","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/687f104c400b25f91a596170b5f4af9d"},"headline":"Introduction to Text Wrangling Techniques for Natural Language Processing","datePublished":"2023-08-29T21:50:38+00:00","dateModified":"2025-04-24T17:14:27+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/"},"wordCount":1722,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*GyAwrqkCetMRkGWGibdgOA.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/","url":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/","name":"Introduction to Text Wrangling Techniques for Natural Language Processing - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*GyAwrqkCetMRkGWGibdgOA.jpeg","datePublished":"2023-08-29T21:50:38+00:00","dateModified":"2025-04-24T17:14:27+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*GyAwrqkCetMRkGWGibdgOA.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*GyAwrqkCetMRkGWGibdgOA.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-text-wrangling-techniques-for-natural-language-processing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Introduction to Text Wrangling Techniques for Natural Language Processing"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/687f104c400b25f91a596170b5f4af9d","name":"Sai Kambampati","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/e37628a7389669a684002af46e2c007b","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/9nO1Y5uI_400x400-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/9nO1Y5uI_400x400-96x96.jpg","caption":"Sai Kambampati"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/saikambampatigmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7346","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/83"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7346"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7346\/revisions"}],"predecessor-version":[{"id":15563,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7346\/revisions\/15563"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7346"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7346"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7346"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7346"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}