{"id":8092,"date":"2023-11-02T10:22:21","date_gmt":"2023-11-02T18:22:21","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=8092"},"modified":"2025-04-24T17:04:43","modified_gmt":"2025-04-24T17:04:43","slug":"natural-language-processing-nlp-concepts-with-nltk","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk\/","title":{"rendered":"Natural Language Processing (NLP) Concepts With NLTK"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk\">\n\n\n\n<div class=\"fk fl fm fn fo\">\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\">\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-MUoHYxWr5DFi5UmJHUePQ.jpeg\" alt=\"\" width=\"700\" height=\"467\"><\/figure><div class=\"mi mj mk\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mx my mz mi mj na nb be b bf z dw\" data-selectable-paragraph=\"\">Photo by <a class=\"af he\" href=\"https:\/\/unsplash.com\/@hostreviews?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Stephen Phillips \u2014 Hostreviews.co.uk<\/a> on <a class=\"af he\" href=\"https:\/\/unsplash.com\/photos\/3Mhgvrk4tjM?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<p id=\"832d\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">At its core, the discipline of <strong class=\"be nx\">Natural Language Processing (NLP)<\/strong> tries to make the human language \u201cpalatable\u201d to computers. Many data we analyze as data scientists consist of a corpus of human-readable text. Before we can feed this data into a computer for analysis, we must preprocess it.<\/p>\n<p id=\"0652\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">In this article, let\u2019s dive deep into the <strong class=\"be nx\">Natural Language Toolkit (NLTK) <\/strong>data processing concepts for NLP data. Before building our model, we will also see how we can visualize this data with <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/kangas-visualize-multimedia-data-at-scale\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nx\">Kangas<\/strong><\/a> as part of exploratory data analysis (EDA).<\/p>\n<p id=\"ca00\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Ultimately, we will create an <strong class=\"be nx\">Email Spam Classifier with TensorFlow and the Keras API<\/strong> and track it with <a class=\"af he\" href=\"https:\/\/www.comet.com\/docs\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nx\">Comet<\/strong><\/a>. In the meantime, if you need to become more familiar with Comet, check out their incredible platform <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/p>\n<h1 id=\"51a3\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Getting started with the NLTK library<\/h1>\n<p id=\"11ad\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">NLTK offers excellent tools for developing Python programs that leverage natural language data. It has several text-processing libraries for tokenization, stemming, part-of-speech tagging, semantic reasoning, and many more tasks.<\/p>\n<p id=\"6b27\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">To start using NLTK, you need to install it. To do this, just run <code class=\"cw oz pa pb pc b\">pip install nltk<\/code>in your command line or <code class=\"cw oz pa pb pc b\">!pip install nltk<\/code> if using an interactive notebook.<\/p>\n<h1 id=\"4686\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Tokenization<\/h1>\n<p id=\"a275\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Tokenization, also referred to as <a class=\"af he\" href=\"https:\/\/en.wikipedia.org\/wiki\/Lexical_analysis\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nx\">lexical analysis<\/strong><\/a>, is common practice in NLP. It involves splitting text data into lexical tokens, which allow us to work with smaller pieces of text (phrases, sentences, words, or paragraphs) that are still relevant, even once separated from the rest of the text. A <strong class=\"be nx\">tokenizer <\/strong>is responsible for this process.<\/p>\n<p id=\"ff1c\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">There are two ways we can tokenize our text data with NLTK. We can either:<\/p>\n<ul class=\"\">\n<li id=\"8764\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Tokenize by word.<\/strong><\/li>\n<li id=\"531b\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Tokenize by sentence.<\/strong><\/li>\n<\/ul>\n<h2 id=\"cd45\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Tokenize by word<\/strong><\/h2>\n<p id=\"1a1e\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Here, we break down the text into individual words. Among other things, this can help identify frequently occurring words for a given analysis.<\/p>\n<p id=\"650e\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">NLTK provides <code class=\"cw oz pa pb pc b\">word_tokenizer<\/code> that allows us to split text into words. Let\u2019s see how:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"46c1\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">import<\/span> nltk\n<span class=\"hljs-keyword\">from<\/span> nltk.tokenize <span class=\"hljs-keyword\">import<\/span> word_tokenize\n\n<span class=\"hljs-comment\"># Text to tokenize<\/span>\ntext_example = <span class=\"hljs-string\">\"\"\"We can do so much with Natural Language Processing ranging\nfrom speech recognition, recommendation systems, spam classifiers, and so\nmany more applications.\"\"\"<\/span>\n\n<span class=\"hljs-comment\"># Tokenizing<\/span>\nword_tokenized_sent = word_tokenize(text_example.casefold())\n<span class=\"hljs-built_in\">print<\/span>(word_tokenized_sent)\n\n<span class=\"hljs-comment\"># RESULTS<\/span>\n<span class=\"hljs-string\">'''\n\n['we', 'can', 'do', 'so', 'much', 'with', 'natural', 'language',\n'processing', 'ranging', 'from', 'speech', 'recognition', ',',\n'recommendation', 'systems', ',', 'spam', 'classifiers', ',',\n'and', 'so', 'many', 'more', 'applications', '.']\n\n'''<\/span><\/span><\/pre>\n<p id=\"8298\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Notice <\/strong>that we used <code class=\"cw oz pa pb pc b\">casefold()<\/code> on the <code class=\"cw oz pa pb pc b\">text_example<\/code>. The method ignores whether the letters in the text are uppercase or lowercase and treats all as lowercase. Alternatively, we could use <code class=\"cw oz pa pb pc b\">text_example.lower()<\/code> to convert the text into lowercase.<\/p>\n<p id=\"9fab\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We must convert the text data into lowercase so all words are in the same case. If this is avoided, the model could interpret words like stock, Stock, and STOCK as unique tokens.<\/p>\n<h2 id=\"0cbe\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Tokenize by sentence<\/strong><\/h2>\n<p id=\"3f1d\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">When we have text with multiple sentences, we can break them into a list of individual sentences.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"e56e\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> nltk.tokenize <span class=\"hljs-keyword\">import<\/span> sent_tokenize\n\ntext_example2 = <span class=\"hljs-string\">\"\"\"We can do so much with Natural Language Processing ranging from\nspeech recognition, recommendation systems, spam classifiers, and so many more applications.\nThese applications also leverage the power of Machine Learning and Deep Learning.\n\"\"\"<\/span>\n\nsentence_tokenized= sent_tokenize(sentences_example.casefold())\n<span class=\"hljs-built_in\">print<\/span>(sentence_tokenized)\n\n<span class=\"hljs-comment\"># RESULTS <\/span>\n<span class=\"hljs-string\">'''\n\n['we can do so much with natural language processing ranging from speech recognition, recommendation systems, spam classifiers, and so many more applications.',\n 'these applications also leverage the power of machine learning and deep learning.'\n\n'''<\/span><\/span><\/pre>\n<p id=\"f0ce\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">With sentence tokenizing, we get a list of two sentences from the text.<\/p>\n<h1 id=\"f73d\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Removing stop words<\/h1>\n<p id=\"d8f2\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">In most cases, this operation occurs after we tokenize the data. Stop words are words that only add to the fluidity of text or speech but contain no meaningful information relevant to our task or the text itself.<\/p>\n<p id=\"1ab3\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">First, we need to <strong class=\"be nx\">download and import the stop words<\/strong> that are provided by NLTK. Alternatively, we can also define our own stop words, but for this example the default list is more than sufficient.<\/p>\n<p id=\"81f6\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Let\u2019s use our word tokenized sentence above:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"57dc\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">nltk.download(<span class=\"hljs-string\">'stopwords'<\/span>) <span class=\"hljs-comment\"># download the stopwords from nltk<\/span>\n<span class=\"hljs-keyword\">from<\/span> nltk.corpus <span class=\"hljs-keyword\">import<\/span> stopwords\n\n<span class=\"hljs-comment\"># get stopwords in english<\/span>\neng_stopwords= stopwords.words(<span class=\"hljs-string\">'english'<\/span>)\n\n<span class=\"hljs-comment\"># filter stopwords with list comprehension<\/span>\ntokens_no_stopwords = [word <span class=\"hljs-keyword\">for<\/span> word <span class=\"hljs-keyword\">in<\/span> word_tokenized_sent <span class=\"hljs-keyword\">if<\/span> word <span class=\"hljs-keyword\">not<\/span> <span class=\"hljs-keyword\">in<\/span> stopwords_eng]\n<span class=\"hljs-built_in\">print<\/span>(tokens_no_stopwords)\n\n<span class=\"hljs-comment\"># RESULTS<\/span>\n<span class=\"hljs-string\">'''\n\n['much', 'natural', 'language', 'processing', 'ranging', 'speech', 'recognition',\n',', 'recommendation', 'systems', ',', 'spam', 'classifiers', ',', 'many',\n'applications', '.']\n\n'''<\/span><\/span><\/pre>\n<p id=\"36f6\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Notice the disappearance of stop words like <em class=\"qn\">\u2018do\u2019, \u2018so\u2019, \u2018from\u2019<\/em> etc.<\/p>\n<h1 id=\"bc64\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Stemming<\/h1>\n<p id=\"6a0e\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Stemming is where we reduce words to their base form or root form, like cutting down tree branches to their stem. Stemming normalizes text and makes processing easier for Information Retrieval (IR) or text mining tasks.<\/p>\n<p id=\"444e\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">A stemmer will apply <a class=\"af he\" href=\"https:\/\/vijinimallawaarachchi.com\/2017\/05\/09\/porter-stemming-algorithm\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">rules<\/a> that remove suffixes and prefixes from a word and reduce it to its root. This, in turn, reduces the time complexity and space. There are various types of stemming algorithms:<\/p>\n<ul class=\"\">\n<li id=\"7247\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Porter Stemmer:<\/strong> We mostly use this algorithm for its speed, minimal error rate, and simplicity. It is based on the fact that suffixes in the English language are composed of smaller and simpler suffixes. It is only limited to English words. <strong class=\"be nx\">Import as:<\/strong> <code class=\"cw oz pa pb pc b\">from nltk.stem.porter import PorterStemmer<\/code>.<\/li>\n<li id=\"3d38\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Snowball stemmer: <\/strong>This is a multilingual stemmer. Thus, it supports other languages. It is more aggressive than the Porter stemmer. <strong class=\"be nx\">Import as<\/strong>: <code class=\"cw oz pa pb pc b\"><a class=\"af he\" href=\"https:\/\/www.nltk.org\/api\/nltk.stem.snowball.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">from nltk.stem import SnowballStemmer<\/a><\/code>.<\/li>\n<li id=\"f88a\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Lancaster stemmer: <\/strong>Despite being more aggressive and dynamic than the other stemmers, it is confusing when small words are involved. This stemmer is also less efficient. <strong class=\"be nx\">Import as<\/strong>: <code class=\"cw oz pa pb pc b\"><a class=\"af he\" href=\"https:\/\/www.nltk.org\/api\/nltk.stem.lancaster.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">from nltk.stem.lancaster import LancasterStemmer<\/a><\/code>.<\/li>\n<\/ul>\n<p id=\"e502\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We will use the Porter stemmer as the commonly preferred choice. The approach is similar if you choose to use the other stemmers, that is, based on what model you are building.<\/p>\n<p id=\"9c21\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We will feed the stemmer the <code class=\"cw oz pa pb pc b\">tokens_no_stopwords<\/code> list derived from removing the stop words section.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"fa24\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> nltk.stem.porter <span class=\"hljs-keyword\">import<\/span> PorterStemmer\n\n<span class=\"hljs-comment\"># Create a stemmer object<\/span>\nstemmer = PorterStemmer()\n\n<span class=\"hljs-comment\"># Stemming<\/span>\nstemmed_tokens = [stemmer.stem(token) <span class=\"hljs-keyword\">for<\/span> token <span class=\"hljs-keyword\">in<\/span> tokens_no_stopwords]\n\n<span class=\"hljs-built_in\">print<\/span>(<span class=\"hljs-string\">f\"===Unstemmed tokens==== <span class=\"hljs-subst\">{tokens_no_stopwords}<\/span>\"<\/span>)\n<span class=\"hljs-built_in\">print<\/span>(<span class=\"hljs-string\">f\"===Stemmed tokens==== <span class=\"hljs-subst\">{stemmed_tokens}<\/span>\"<\/span>)\n\n<span class=\"hljs-comment\"># RESULTS<\/span>\n<span class=\"hljs-string\">'''\n\n===Unstemmed tokens====\n ['much', 'natural', 'language', 'processing', 'ranging', 'speech', 'recognition',\n  ',', 'recommendation', 'systems', ',', 'spam', 'classifiers', ',', 'many', 'applications', '.']\n\n===Stemmed tokens====\n ['much', 'natur', 'languag', 'process', 'rang', 'speech', 'recognit', ',', 'recommend',\n  'system', ',', 'spam', 'classifi', ',', 'mani', 'applic', '.']\n\n'''<\/span><\/span><\/pre>\n<p id=\"cb6e\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Notice how the words were cut. It is easy to notice that the resultant words are a bit confusing and appear meaningless. This poses one of the negative effects of stemming. The readability of the text is compromised, and in some cases, it may not even produce the correct root form of the word.<\/p>\n<p id=\"b7eb\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">A stemmer can also produce inconsistent results. Let\u2019s look at an example:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"96b4\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">some_txt = <span class=\"hljs-string\">\"Discovering the wheel is among the best scientific discoveries ever made.\"<\/span>\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">stem_text<\/span>(<span class=\"hljs-params\">text<\/span>):\n    token_lst = word_tokenize(some_txt.casefold())\n    token_no_stpw = [word <span class=\"hljs-keyword\">for<\/span> word <span class=\"hljs-keyword\">in<\/span> token_lst <span class=\"hljs-keyword\">if<\/span> word <span class=\"hljs-keyword\">not<\/span> <span class=\"hljs-keyword\">in<\/span> eng_stopwords]\n    inco_stemmed_tokens = [stemmer.stem(token) <span class=\"hljs-keyword\">for<\/span> token <span class=\"hljs-keyword\">in<\/span> token_no_stpw]\n\n    <span class=\"hljs-keyword\">return<\/span> inco_stemmed_tokens\n\n<span class=\"hljs-comment\"># RESULT<\/span>\n<span class=\"hljs-string\">'''\n\n['discov', 'wheel', 'among', 'best', 'scientif', 'discoveri', 'ever', 'made', '.']\n\n'''<\/span><\/span><\/pre>\n<p id=\"0e3c\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We notice that the words <em class=\"qn\">\u2018Discovering\u2019 <\/em>and <em class=\"qn\">\u2018discoveries\u2019 <\/em>do not end up in the same base form (<em class=\"qn\">discover<\/em>). There are <strong class=\"be nx\">two main errors<\/strong> we could encounter while stemming:<\/p>\n<ul class=\"\">\n<li id=\"7695\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Over-stemming<\/strong> (False positives): Occurs when the stemmer produces different base forms of two related words that should share a base form.<\/li>\n<li id=\"927d\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Under-stemming<\/strong> (False negatives): When two unrelated words stem to the same base form while they should not.<\/li>\n<\/ul>\n<p id=\"273e\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">In most cases, the above errors can be caused by the following:<\/p>\n<ul class=\"\">\n<li id=\"3a50\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">When the stemmer is too aggressive.<\/li>\n<li id=\"e3f5\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">When the stemmer does not consider the context of the text.<\/li>\n<li id=\"e98d\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">When the stemmer in use is not designed for the particular language.<\/li>\n<\/ul>\n<p id=\"dcde\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">To get around these errors, we could use <strong class=\"be nx\">Lemmatizers<\/strong> instead of stemmers.<\/p>\n<p id=\"da1d\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Hang on for lemmatization!<\/p>\n<h1 id=\"ff47\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Part of speech (PoS) Tagging<\/h1>\n<p id=\"e3fb\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Part of speech tagging involves labeling the words in the text data according to their part of speech. These parts of speech include:<\/p>\n<ul class=\"\">\n<li id=\"f2b5\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Noun (labeled with <strong class=\"be nx\">NN <\/strong>tag)<\/li>\n<li id=\"dda0\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Pronoun (labeled with <strong class=\"be nx\">PRP <\/strong>tag)<\/li>\n<li id=\"96da\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Adjective (labeled with <strong class=\"be nx\">JJ <\/strong>tag)<\/li>\n<li id=\"8c22\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Verb (labeled with <strong class=\"be nx\">VB <\/strong>tag)<\/li>\n<li id=\"7f84\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Adverb (labeled with <strong class=\"be nx\">RB <\/strong>tag)<\/li>\n<li id=\"1ba0\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Preposition<\/li>\n<li id=\"3314\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Conjunction<\/li>\n<li id=\"5b8b\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Interjection<\/li>\n<\/ul>\n<p id=\"d321\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">To see all available tags and their meanings run:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"b9ce\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">nltk.download(<span class=\"hljs-string\">'averaged_perceptron_tagger'<\/span>)\nnltk.download(<span class=\"hljs-string\">'tagsets'<\/span>)\nnltk.<span class=\"hljs-built_in\">help<\/span>.upenn_tagset()<\/span><\/pre>\n<p id=\"1883\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Let\u2019s use the <code class=\"cw oz pa pb pc b\">tokens_no_stopwords<\/code> list derived from removing the stop words section and tag parts of speech.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"6c50\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">pos_tagged_tokens = nltk.pos_tag(tokens_no_stopwords)\n<span class=\"hljs-built_in\">print<\/span>(pos_tagged_tokens)\n\n<span class=\"hljs-comment\"># RESULTS<\/span>\n<span class=\"hljs-string\">'''\n\n[('much', 'JJ'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'),\n ('ranging', 'VBG'), ('speech', 'NN'), ('recognition', 'NN'), (',', ','),\n ('recommendation', 'NN'), ('systems', 'NNS'), (',', ','), ('spam', 'JJ'),\n ('classifiers', 'NNS'), (',', ','), ('many', 'JJ'), ('applications', 'NNS'),\n ('.', '.')]\n\n'''<\/span> <\/span><\/pre>\n<p id=\"8fbc\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">An example of where PoS tagging would be applicable is when there is a need to know a product\u2019s qualities in a review. In this case, we could tag the tokenized data, extract all the adjectives, and evaluate the review\u2019s sentiment.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fk fl fm fn fo\">\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\">\n<blockquote class=\"qw\"><p id=\"45e3\" class=\"qx qy fr be qz ra rb rc rd re rf nw dw\" data-selectable-paragraph=\"\">Standardizing model management can be tricky but there is a solution. <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/investing-in-ai-unlocking-profitable-machine-learning-with-experiment-management\/?utm_source=heartbeat&amp;utm_medium=referral&amp;utm_campaign=AMS_US_EN_AWA_heartbeat_CTA\" target=\"_blank\" rel=\"noopener ugc nofollow\">Learn more about experiment management from Comet\u2019s own Nikolas Laskaris<\/a>.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fk fl fm fn fo\">\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\">\n<h1 id=\"4064\" class=\"ny nz fr be oa ob rg gr od oe rh gu og oh ri oj ok ol rj on oo op rk or os ot bj\" data-selectable-paragraph=\"\">Lemmatizing<\/h1>\n<p id=\"9cd1\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">This is similar to stemming, but with a significant difference. Unlike stemming, lemmatizing produces words to their root form but returns a complete English word (as would appear in a dictionary) that is meaningful on its own, rather than just a fragment of a word like \u2018marbl\u2019 (from marble).<\/p>\n<p id=\"d82f\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">A lemmatizer also takes into consideration the <strong class=\"be nx\">context of a word<\/strong>! It will map words with similar meanings to one word, unlike a stemmer.<\/p>\n<p id=\"6e69\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">When we lemmatize a word, we generate a <strong class=\"be nx\">lemma. A lemma is a word that represents a whole group of words.<\/strong><\/p>\n<p id=\"2daa\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Although processing text data includes stemming and lemmatizing, we prefer lemmatization over stemming in most cases.<\/p>\n<p id=\"2f36\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Note:<\/strong> To use the lemmatizer from NLTK, we need to download <code class=\"cw oz pa pb pc b\">wordnet<\/code> and Open Multilingual Wordnet (<code class=\"cw oz pa pb pc b\">omw<\/code>):<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"8566\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">nltk.download(<span class=\"hljs-string\">'omw-1.4'<\/span>)\nnltk.download(<span class=\"hljs-string\">'wordnet'<\/span>)<\/span><\/pre>\n<p id=\"d666\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Then import <code class=\"cw oz pa pb pc b\">WordNetLemmatizer<\/code> from the wordnet <code class=\"cw oz pa pb pc b\">nltk.stem.wordnet module<\/code>.<\/p>\n<p id=\"aeb2\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We will feed the lemmatizer the <code class=\"cw oz pa pb pc b\">tokens_no_stopwords<\/code> list derived from removing the stop words section.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"2a18\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> nltk.stem <span class=\"hljs-keyword\">import<\/span> WordNetLemmatizer\n\n<span class=\"hljs-comment\"># Create lemmatizer object<\/span>\nlemmatizer = WordNetLemmatizer()\n\n<span class=\"hljs-comment\"># lemmatizing<\/span>\nlemmatized_tokens = [lemmatizer.lemmatize(token) <span class=\"hljs-keyword\">for<\/span> token <span class=\"hljs-keyword\">in<\/span> tokens_no_stopwords]\n\n<span class=\"hljs-comment\"># RESULTS<\/span>\n<span class=\"hljs-string\">'''\n\n===Stemmed tokens====\n['much', 'natur', 'languag', 'process', 'rang', 'speech', 'recognit', ',', 'recommend',\n 'system', ',', 'spam', 'classifi', ',', 'mani', 'applic', '.']\n\n===Lemmatized tokens====\n['much', 'natural', 'language', 'processing', 'ranging', 'speech', 'recognition',\n ',', 'recommendation', 'system', ',', 'spam', 'classifier', ',', 'many', 'application', '.']\n\n'''<\/span><\/span><\/pre>\n<p id=\"85cb\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Observe the difference between Stemmed and Lemmatized tokens. In a word like <em class=\"qn\">\u2018<\/em>language<em class=\"qn\">\u2019<\/em> a Stemmer produces <em class=\"qn\">\u2018languag\u2019<\/em> while a Lemmatizer produces the expected root word <em class=\"qn\">\u2018language.\u2019<\/em><\/p>\n<p id=\"9c53\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">In Lemmatization, in most cases, we do not expect the lemmatized words to be very different from their lemma.<\/p>\n<h1 id=\"b863\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Using Named Entity Recognition (NER)<\/h1>\n<p id=\"be2e\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">A text may contain noun phrases that refer to organizations, people, specific locations, etc. These phrases are called named entities, and we can use <strong class=\"be nx\">named entity recognition<\/strong> to determine what kind of named entities are in your text data.<\/p>\n<p id=\"ae01\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">The <a class=\"af he\" href=\"https:\/\/www.nltk.org\/book\/ch07.html#sec-ner\" target=\"_blank\" rel=\"noopener ugc nofollow\">NLTK book<\/a> lists the common types of named entities, which include:<\/p>\n<ul class=\"\">\n<li id=\"0178\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">ORGANIZATION<\/li>\n<li id=\"d904\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">PERSON<\/li>\n<li id=\"2e1e\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">LOCATION<\/li>\n<li id=\"ce7c\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">DATE<\/li>\n<li id=\"879d\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">TIME<\/li>\n<li id=\"d314\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">MONEY<\/li>\n<li id=\"df77\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">PERCENT<\/li>\n<li id=\"5329\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">FACILITY<\/li>\n<li id=\"a1cf\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">GPE<\/li>\n<\/ul>\n<p id=\"ac0c\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">NLTK provides a pre-trained classifier that can identify the named entities in our text data. We can access this classifier with the <code class=\"cw oz pa pb pc b\">nltk.ne_chunk()<\/code> function.<\/p>\n<p id=\"2a8c\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">The function, when applied to a text, returns a tree.<\/p>\n<p id=\"1265\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">First, we need to download the following for it to work:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"b2f4\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">nltk.download(<span class=\"hljs-string\">'maxent_ne_chunker'<\/span>)\nnltk.download(<span class=\"hljs-string\">'words'<\/span>)<\/span><\/pre>\n<p id=\"aede\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Next, import <code class=\"cw oz pa pb pc b\">Tree<\/code> from <code class=\"cw oz pa pb pc b\">nltk.tree module<\/code> that will help visualize the named entities once we apply the <code class=\"cw oz pa pb pc b\">ne_chunk()<\/code> function.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"7965\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> nltk.tree <span class=\"hljs-keyword\">import<\/span> Tree<\/span><\/pre>\n<p id=\"9735\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Then let\u2019s identify the named entities:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"b818\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># Let's use the follwoing text as example<\/span>\nnamed_ent_txt = <span class=\"hljs-string\">\"I'm trying to track down a guy named Josh Doew who worked in mining in Ouray back in the 1960's.\"<\/span>\ntree = nltk.ne_chunk(nltk.pos_tag(word_tokenize(named_ent_txt)))\n\ntree<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*BDlyqK7ZbJZ2EdSE52pUxw.png\" alt=\"\" width=\"700\" height=\"116\"><\/figure><div class=\"mi mj rl\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*BDlyqK7ZbJZ2EdSE52pUxw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"20bf\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Notice that the named entities are labeled with their types of entities. For instance, <strong class=\"be nx\">Ouray <\/strong>is tagged with <strong class=\"be nx\">GPE<\/strong>, which stands for <strong class=\"be nx\">geo-political entities.<\/strong><\/p>\n<p id=\"7223\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">The <code class=\"cw oz pa pb pc b\">ne_chunk()<\/code> function also has a <code class=\"cw oz pa pb pc b\">binary=True<\/code> argument where, if specified, we only get the named entities labeled with <strong class=\"be nx\">NE<\/strong> showing that they are named entities rather than what type they are.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"107d\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">named_ent_txt = <span class=\"hljs-string\">\"I'm trying to track down a guy named Josh Doew who worked in mining in Ouray back in the 1960's.\"<\/span>\ntree = nltk.ne_chunk(nltk.pos_tag(word_tokenize(named_ent_txt)), binary=True)\n\ntree<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*PHm_y-NqfrS6T03SWoOATw.png\" alt=\"\" width=\"700\" height=\"117\"><\/figure><div class=\"mi mj rm\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*PHm_y-NqfrS6T03SWoOATw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*PHm_y-NqfrS6T03SWoOATw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*PHm_y-NqfrS6T03SWoOATw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*PHm_y-NqfrS6T03SWoOATw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*PHm_y-NqfrS6T03SWoOATw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*PHm_y-NqfrS6T03SWoOATw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*PHm_y-NqfrS6T03SWoOATw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*PHm_y-NqfrS6T03SWoOATw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*PHm_y-NqfrS6T03SWoOATw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*PHm_y-NqfrS6T03SWoOATw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*PHm_y-NqfrS6T03SWoOATw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*PHm_y-NqfrS6T03SWoOATw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*PHm_y-NqfrS6T03SWoOATw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*PHm_y-NqfrS6T03SWoOATw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"513c\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Rather than displaying a tree of all the named entities, we can create a function to extract them without any repeats and store them as a list.<\/p>\n<p id=\"19ef\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">To do so, we will need to tokenize the text, tag the words with their respective PoS and then extract the named entities based on those PoS tags without repetition if a word exists multiple times.<\/p>\n<p id=\"2373\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">In the function, we loop to find the presence of the <a class=\"af he\" href=\"https:\/\/www.nltk.org\/api\/nltk.chunk.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">chunk structure<\/a> where the chunk structure is of type <code class=\"cw oz pa pb pc b\">Tree<\/code> with tokens and chunks (subtree with tokens). In our case, these chunks are:<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*AvUbabETBjBZ19XtC2nAXQ.png\" alt=\"\" width=\"700\" height=\"118\"><\/figure><div class=\"mi mj rn\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*AvUbabETBjBZ19XtC2nAXQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*AvUbabETBjBZ19XtC2nAXQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*AvUbabETBjBZ19XtC2nAXQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*AvUbabETBjBZ19XtC2nAXQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*AvUbabETBjBZ19XtC2nAXQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*AvUbabETBjBZ19XtC2nAXQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*AvUbabETBjBZ19XtC2nAXQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*AvUbabETBjBZ19XtC2nAXQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*AvUbabETBjBZ19XtC2nAXQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*AvUbabETBjBZ19XtC2nAXQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*AvUbabETBjBZ19XtC2nAXQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*AvUbabETBjBZ19XtC2nAXQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*AvUbabETBjBZ19XtC2nAXQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*AvUbabETBjBZ19XtC2nAXQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"706f\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">So, we need to convert them back into a list of tokens with the<code class=\"cw oz pa pb pc b\"><a class=\"af he\" href=\"https:\/\/www.nltk.org\/api\/nltk.chunk.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">leaves()<\/a><\/code>method and add them back into the <code class=\"cw oz pa pb pc b\">contionus_chunk<\/code>.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"9ce7\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">extract_named_entities<\/span>(<span class=\"hljs-params\">text<\/span>):\n  tree = nltk.ne_chunk(nltk.pos_tag(word_tokenize(text)))\n  continuous_chunk = []\n  current_chunk = []\n\n  <span class=\"hljs-keyword\">for<\/span> chunk <span class=\"hljs-keyword\">in<\/span> tree:\n    <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-built_in\">type<\/span>(chunk) == Tree:\n      current_chunk.append(<span class=\"hljs-string\">\" \"<\/span>.join([token <span class=\"hljs-keyword\">for<\/span> token, pos <span class=\"hljs-keyword\">in<\/span> chunk.leaves()]))\n      <span class=\"hljs-keyword\">if<\/span> current_chunk:\n        named_entity = <span class=\"hljs-string\">\" \"<\/span>.join(current_chunk)\n        <span class=\"hljs-keyword\">if<\/span> named_entity <span class=\"hljs-keyword\">not<\/span> <span class=\"hljs-keyword\">in<\/span> continuous_chunk:\n          continuous_chunk.append(named_entity)\n          current_chunk = []\n      <span class=\"hljs-keyword\">else<\/span>:\n        <span class=\"hljs-keyword\">continue<\/span>\n  <span class=\"hljs-keyword\">return<\/span> continuous_chunk<\/span><\/pre>\n<pre class=\"ro qf pc qg bo qh ba bj\"><span id=\"2c9f\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># call the function<\/span>\nextract_named_entities(named_ent_txt)\n\n<span class=\"hljs-comment\"># RESULTS<\/span>\n\n[<span class=\"hljs-string\">'Josh Doew'<\/span>, <span class=\"hljs-string\">'Ouray'<\/span>]<\/span><\/pre>\n<p id=\"6960\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Great!<\/p>\n<h1 id=\"5070\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Word Frequency Distribution<\/h1>\n<p id=\"0c13\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">We can identify words frequently appearing in our text data by building a frequency distribution. To do this, we use the <code class=\"cw oz pa pb pc b\">FreqDist <\/code>module in NLTK.<\/p>\n<p id=\"9d52\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Let\u2019s use the <code class=\"cw oz pa pb pc b\">tokens_no_stopwords<\/code> list derived from removing the stop words section.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"caf7\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> nltk <span class=\"hljs-keyword\">import<\/span> FreqDist\nfreq_distribution = FreqDist(tokens_no_stopwords)\n\n<span class=\"hljs-comment\"># extract the 10 frequent words in the text<\/span>\nfreq_distribution.most_common(<span class=\"hljs-number\">10<\/span>)\n\n<span class=\"hljs-comment\"># RESULTS<\/span>\n\n<span class=\"hljs-string\">'''\n[(',', 3),\n ('much', 1),\n ('natural', 1),\n ('language', 1),\n ('processing', 1),\n ('ranging', 1),\n ('speech', 1),\n ('recognition', 1),\n ('recommendation', 1),\n ('systems', 1)]\n\n'''<\/span><\/span><\/pre>\n<p id=\"d37b\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We can also visualize the distribution:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"3ad4\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">freq_distribution.plot(<span class=\"hljs-number\">20<\/span>, cumulative=<span class=\"hljs-literal\">True<\/span>)<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:562\/1*wGPwl3buggszMzAu9vI6HQ.png\" alt=\"\" width=\"562\" height=\"371\"><\/figure><div class=\"mi mj rp\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*wGPwl3buggszMzAu9vI6HQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*wGPwl3buggszMzAu9vI6HQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*wGPwl3buggszMzAu9vI6HQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*wGPwl3buggszMzAu9vI6HQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*wGPwl3buggszMzAu9vI6HQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*wGPwl3buggszMzAu9vI6HQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1124\/format:webp\/1*wGPwl3buggszMzAu9vI6HQ.png 1124w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 562px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*wGPwl3buggszMzAu9vI6HQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*wGPwl3buggszMzAu9vI6HQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*wGPwl3buggszMzAu9vI6HQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*wGPwl3buggszMzAu9vI6HQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*wGPwl3buggszMzAu9vI6HQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*wGPwl3buggszMzAu9vI6HQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1124\/1*wGPwl3buggszMzAu9vI6HQ.png 1124w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 562px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"2999\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">With the knowledge we have gained on text preprocessing with NLTK, it\u2019s time to put some of it into use. Next, we will build an Email Spam Classifier.<\/p>\n<h1 id=\"568e\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Email spam classification with TensorFlow, Keras, and NLTK.<\/h1>\n<p id=\"9254\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">This section will create a spam classification model with TensorFlow and Keras. The ability to classify a text\/email sent to you as spam or ham (not harmful) is a crucial functional detail in many messaging applications we use today, like Gmail or Apple\u2019s official messaging app.<\/p>\n<p id=\"08ae\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">The following are the steps we will take:<\/p>\n<ul class=\"\">\n<li id=\"b33e\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">First, initialize <a class=\"af he\" href=\"https:\/\/www.comet.com\/b-capt#projects\" target=\"_blank\" rel=\"noopener ugc nofollow\">Comet<\/a> tracking<\/li>\n<li id=\"f06b\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Take care of all the necessary imports<\/li>\n<li id=\"5c79\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Grab a dataset and visualize it with <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/kangas-visualize-multimedia-data-at-scale\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Kangas<\/a>.<\/li>\n<li id=\"6908\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Preprocessing the data<br>\n&#8211; Removing stop words.<br>\n&#8211; Tokenizing the text data.<br>\n&#8211; Lemmatization \u2014 We will choose Lemmatization over Stemming for the best performance of our model.<br>\n&#8211; Vectorizing the data.<\/li>\n<li id=\"bdc4\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Building the model.<\/li>\n<li id=\"474d\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Evaluating the model.<\/li>\n<\/ul>\n<h2 id=\"0691\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\">Initialize Comet tracking<\/h2>\n<p id=\"8fd7\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">To use comet, we will need a personal API key that will be used to communicate between your experiment and the comet platform. Sign up and find the API key under your Account Settings under your profile.<\/p>\n<p id=\"48a4\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Also, install <code class=\"cw oz pa pb pc b\">comet_ml<\/code> with pip, if not installed in your environment:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"39cb\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># on Juypyter<\/span>\n\n%pip install comet_ml\n\n<span class=\"hljs-comment\"># on command line<\/span>\n\npip install comet_ml<\/span><\/pre>\n<pre class=\"ro qf pc qg bo qh ba bj\"><span id=\"6d35\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># initialize comet<\/span>\n\n<span class=\"hljs-keyword\">import<\/span> comet_ml\n\nexperiment = comet_ml.Experiment(\n    api_key=<span class=\"hljs-string\">\"your_API_Key\"<\/span>, <span class=\"hljs-comment\">#Use your api_key from your comet_account<\/span>\n    project_name=<span class=\"hljs-string\">\"Email Spam Classifier\"<\/span>,\n    log_code=<span class=\"hljs-literal\">True<\/span>,\n    auto_metric_logging=<span class=\"hljs-literal\">True<\/span>,\n    auto_param_logging=<span class=\"hljs-literal\">True<\/span>,\n    auto_histogram_weight_logging=<span class=\"hljs-literal\">True<\/span>,\n    auto_histogram_gradient_logging=<span class=\"hljs-literal\">True<\/span>,\n    auto_histogram_activation_logging=<span class=\"hljs-literal\">True<\/span>,\n)<\/span><\/pre>\n<p id=\"bc31\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We are good to go!<\/p>\n<p id=\"75cf\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Let\u2019s take care of all the necessary imports:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"6825\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># for data<\/span>\n<span class=\"hljs-keyword\">import<\/span> numpy <span class=\"hljs-keyword\">as<\/span> np\n<span class=\"hljs-keyword\">import<\/span> pandas <span class=\"hljs-keyword\">as<\/span> pd\n\n<span class=\"hljs-comment\"># for visuals<\/span>\n<span class=\"hljs-keyword\">from<\/span> matplotlib.pyplot <span class=\"hljs-keyword\">import<\/span> figure, plt\n<span class=\"hljs-keyword\">import<\/span> seaborn <span class=\"hljs-keyword\">as<\/span> sns\n<span class=\"hljs-keyword\">from<\/span> wordcloud <span class=\"hljs-keyword\">import<\/span> WordCloud\n\n<span class=\"hljs-comment\"># from nltk for NLP<\/span>\n<span class=\"hljs-keyword\">import<\/span> nltk\n<span class=\"hljs-keyword\">import<\/span> string <span class=\"hljs-comment\"># for use in punctuation removal<\/span>\n<span class=\"hljs-keyword\">from<\/span> nltk.corpus <span class=\"hljs-keyword\">import<\/span> stopwords\n<span class=\"hljs-keyword\">from<\/span> nltk.tokenize <span class=\"hljs-keyword\">import<\/span> word_tokenize\n<span class=\"hljs-keyword\">from<\/span> nltk.stem <span class=\"hljs-keyword\">import<\/span> WordNetLemmatizer\n\n<span class=\"hljs-comment\"># text preprocessing from sklearn<\/span>\n<span class=\"hljs-keyword\">from<\/span> sklearn.feature_extraction.text <span class=\"hljs-keyword\">import<\/span> TfidfVectorizer\n<span class=\"hljs-keyword\">from<\/span> sklearn.model_selection <span class=\"hljs-keyword\">import<\/span> train_test_split\n\n<span class=\"hljs-comment\"># for the NN<\/span>\n<span class=\"hljs-keyword\">import<\/span> tensorflow <span class=\"hljs-keyword\">as<\/span> tf\n<span class=\"hljs-keyword\">from<\/span> tensorflow <span class=\"hljs-keyword\">import<\/span> keras\n<span class=\"hljs-keyword\">from<\/span> tensorflow.keras.models <span class=\"hljs-keyword\">import<\/span> Sequential\n<span class=\"hljs-keyword\">from<\/span> keras.layers <span class=\"hljs-keyword\">import<\/span> Dense\n\n<span class=\"hljs-comment\"># Evaluation<\/span>\n<span class=\"hljs-keyword\">from<\/span> sklearn.metrics <span class=\"hljs-keyword\">import<\/span> confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score<\/span><\/pre>\n<h1 id=\"7bfa\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Working on the dataset for NLP<\/h1>\n<p id=\"b030\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">We will use the <a class=\"af he\" href=\"https:\/\/www.kaggle.com\/datasets\/uciml\/sms-spam-collection-dataset\" target=\"_blank\" rel=\"noopener ugc nofollow\">spam collection dataset<\/a> from <a class=\"af he\" href=\"https:\/\/www.kaggle.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Kaggle<\/a>. We will read the data using <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/kangas-visualize-multimedia-data-at-scale\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Kangas <\/a><code class=\"cw oz pa pb pc b\"><a class=\"af he\" href=\"https:\/\/github.com\/comet-ml\/kangas\/wiki\/DataGrid#read_csv\" target=\"_blank\" rel=\"noopener ugc nofollow\">Datagrid.read_csv()<\/a><\/code> class method.<\/p>\n<p id=\"39ab\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">I know we are used to Pandas, but this time let\u2019s do it the <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/kangas-visualize-multimedia-data-at-scale\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Kangas<\/a> way.<\/p>\n<p id=\"4649\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Simply install <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/kangas-visualize-multimedia-data-at-scale\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Kangas <\/a>with <code class=\"cw oz pa pb pc b\">%pip install kangas<\/code> on your notebook.<\/p>\n<h2 id=\"85c6\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\">Reading and visualizing the data with Kangas<\/h2>\n<p id=\"b97b\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">First, we need to import the <code class=\"cw oz pa pb pc b\">kangas <\/code>with alias <code class=\"cw oz pa pb pc b\">kg<\/code>:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"d976\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">import<\/span> kangas <span class=\"hljs-keyword\">as<\/span> kg<\/span><\/pre>\n<pre class=\"ro qf pc qg bo qh ba bj\"><span id=\"e750\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># Read the data using Kangas<\/span>\ndg = kg.DataGrid.read_csv(<span class=\"hljs-string\">\"spam_data.csv\"<\/span>)\ndg.save()<\/span><\/pre>\n<p id=\"e3c9\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">When we read the data with Kangas, we get a <code class=\"cw oz pa pb pc b\">DataGrid<\/code>, unlike in Pandas with <code class=\"cw oz pa pb pc b\">DataFrame<\/code>. We can perform most of the functions of Pandas, like head(), tail(), etc, with the <code class=\"cw oz pa pb pc b\">DataGrid<\/code>.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"6e6f\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># For example get the first ten rows<\/span>\n\ndg.head(<span class=\"hljs-number\">10<\/span>)<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:449\/1*KeB6f8ukt2riN-eLvw2HUg.png\" alt=\"\" width=\"449\" height=\"225\"><\/figure><div class=\"mi mj rq\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*KeB6f8ukt2riN-eLvw2HUg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*KeB6f8ukt2riN-eLvw2HUg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*KeB6f8ukt2riN-eLvw2HUg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*KeB6f8ukt2riN-eLvw2HUg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*KeB6f8ukt2riN-eLvw2HUg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*KeB6f8ukt2riN-eLvw2HUg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:898\/format:webp\/1*KeB6f8ukt2riN-eLvw2HUg.png 898w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 449px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*KeB6f8ukt2riN-eLvw2HUg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*KeB6f8ukt2riN-eLvw2HUg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*KeB6f8ukt2riN-eLvw2HUg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*KeB6f8ukt2riN-eLvw2HUg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*KeB6f8ukt2riN-eLvw2HUg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*KeB6f8ukt2riN-eLvw2HUg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:898\/1*KeB6f8ukt2riN-eLvw2HUg.png 898w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 449px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"e784\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># Get the columns<\/span>\n\ndg.get_columns()\n\n<span class=\"hljs-comment\"># ['label', 'message']<\/span><\/span><\/pre>\n<p id=\"2168\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">However, this is not the goal of Kangas. <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/kangas-visualize-multimedia-data-at-scale\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Kangas <\/a>has a user interface that we can see more clearly and beautifully visualize our data. To fire up the UI, we call the <code class=\"cw oz pa pb pc b\">show()<\/code> method on the <code class=\"cw oz pa pb pc b\">DataGrid<\/code>:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"b877\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\">#Fire up the Kangas UI<\/span>\n\ndg.show()<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*QQIY7ep25xmpLlP36qmm4A.png\" alt=\"\" width=\"700\" height=\"530\"><\/figure><div class=\"mi mj rr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*QQIY7ep25xmpLlP36qmm4A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*QQIY7ep25xmpLlP36qmm4A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*QQIY7ep25xmpLlP36qmm4A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*QQIY7ep25xmpLlP36qmm4A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*QQIY7ep25xmpLlP36qmm4A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*QQIY7ep25xmpLlP36qmm4A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*QQIY7ep25xmpLlP36qmm4A.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*QQIY7ep25xmpLlP36qmm4A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*QQIY7ep25xmpLlP36qmm4A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*QQIY7ep25xmpLlP36qmm4A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*QQIY7ep25xmpLlP36qmm4A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*QQIY7ep25xmpLlP36qmm4A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*QQIY7ep25xmpLlP36qmm4A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*QQIY7ep25xmpLlP36qmm4A.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mx my mz mi mj na nb be b bf z dw\" data-selectable-paragraph=\"\">Kangas UI<\/figcaption>\n<\/figure>\n<p id=\"bac1\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Clearly, we can visualize:<\/p>\n<ul class=\"\">\n<li id=\"023d\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">The first ten rows of our data(<strong class=\"be nx\">dg.head(10)<\/strong>).<\/li>\n<li id=\"94c4\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">The columns (ROW-ID, LABEL, MESSAGE)(<strong class=\"be nx\">dg.get_columns()<\/strong>).<\/li>\n<\/ul>\n<p id=\"e496\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Visualize the last rows of the data(<strong class=\"be nx\">dg.tail()<\/strong>):<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*bX7ckh0KVKpIrLd_pnkRPw.png\" alt=\"\" width=\"700\" height=\"532\"><\/figure><div class=\"mi mj rs\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*bX7ckh0KVKpIrLd_pnkRPw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*bX7ckh0KVKpIrLd_pnkRPw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*bX7ckh0KVKpIrLd_pnkRPw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*bX7ckh0KVKpIrLd_pnkRPw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*bX7ckh0KVKpIrLd_pnkRPw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*bX7ckh0KVKpIrLd_pnkRPw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*bX7ckh0KVKpIrLd_pnkRPw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*bX7ckh0KVKpIrLd_pnkRPw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*bX7ckh0KVKpIrLd_pnkRPw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*bX7ckh0KVKpIrLd_pnkRPw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*bX7ckh0KVKpIrLd_pnkRPw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*bX7ckh0KVKpIrLd_pnkRPw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*bX7ckh0KVKpIrLd_pnkRPw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*bX7ckh0KVKpIrLd_pnkRPw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mx my mz mi mj na nb be b bf z dw\" data-selectable-paragraph=\"\">Last rows of the data(Kangas UI)<\/figcaption>\n<\/figure>\n<p id=\"fc0c\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We can check the value counts of the spam and ham messages. Use Group by labels on the UI:<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*SIoV9JBsgOWzy_Rc9D85Tg.png\" alt=\"\" width=\"700\" height=\"270\"><\/figure><div class=\"mi mj rt\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*SIoV9JBsgOWzy_Rc9D85Tg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mx my mz mi mj na nb be b bf z dw\" data-selectable-paragraph=\"\">Spam and Ham messages count(Kangas UI)<\/figcaption>\n<\/figure>\n<p id=\"b0a1\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We have <strong class=\"be nx\">4825 messages<\/strong> that are <strong class=\"be nx\">ham <\/strong>and <strong class=\"be nx\">747 messages<\/strong> that are <strong class=\"be nx\">spam.<\/strong><\/p>\n<p id=\"de7f\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We can also check if we have <strong class=\"be nx\">null values<\/strong> in the data. To do this on the UI, <code class=\"cw oz pa pb pc b\">DataGrid<\/code> has filter expressions that we can use.<\/p>\n<p id=\"9949\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We use <code class=\"cw oz pa pb pc b\">is None<\/code>to check for null values where we will combine expressions with <code class=\"cw oz pa pb pc b\">and <\/code>and use parenthesis to force evaluation.<\/p>\n<p id=\"aadc\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">To check if we have null values:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"bf80\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># paste this on the UI's Filter input box<\/span>\n\n(({<span class=\"hljs-string\">\"message\"<\/span>} <span class=\"hljs-keyword\">is<\/span> <span class=\"hljs-literal\">None<\/span>) <span class=\"hljs-keyword\">and<\/span> ({<span class=\"hljs-string\">\"label\"<\/span>} <span class=\"hljs-keyword\">is<\/span> <span class=\"hljs-literal\">None<\/span>))<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*mkAcxCr_HEjJz3DFFl55Ew.png\" alt=\"\" width=\"700\" height=\"155\"><\/figure><div class=\"mi mj ru\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*mkAcxCr_HEjJz3DFFl55Ew.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*mkAcxCr_HEjJz3DFFl55Ew.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*mkAcxCr_HEjJz3DFFl55Ew.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*mkAcxCr_HEjJz3DFFl55Ew.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*mkAcxCr_HEjJz3DFFl55Ew.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*mkAcxCr_HEjJz3DFFl55Ew.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*mkAcxCr_HEjJz3DFFl55Ew.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*mkAcxCr_HEjJz3DFFl55Ew.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*mkAcxCr_HEjJz3DFFl55Ew.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*mkAcxCr_HEjJz3DFFl55Ew.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*mkAcxCr_HEjJz3DFFl55Ew.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*mkAcxCr_HEjJz3DFFl55Ew.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*mkAcxCr_HEjJz3DFFl55Ew.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*mkAcxCr_HEjJz3DFFl55Ew.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mx my mz mi mj na nb be b bf z dw\" data-selectable-paragraph=\"\">Checking for Null Values (Kangas UI)<\/figcaption>\n<\/figure>\n<p id=\"9991\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Since we have no matching rows for our criteria it means we have no null values.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"ab ca qo qp qq qr\" role=\"separator\"><\/div>\n\n\n\n<div class=\"fk fl fm fn fo\">\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\">\n<p id=\"e050\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\"><em class=\"qn\">To quench your curiosity, here is an example where we would have null values filtered from another dataset using the criteria in the UI:<\/em><\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*FhsRHWZHhXmIb9LshwKulQ.png\" alt=\"\" width=\"700\" height=\"196\"><\/figure><div class=\"mi mj rv\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*FhsRHWZHhXmIb9LshwKulQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*FhsRHWZHhXmIb9LshwKulQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*FhsRHWZHhXmIb9LshwKulQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*FhsRHWZHhXmIb9LshwKulQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*FhsRHWZHhXmIb9LshwKulQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*FhsRHWZHhXmIb9LshwKulQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*FhsRHWZHhXmIb9LshwKulQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*FhsRHWZHhXmIb9LshwKulQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*FhsRHWZHhXmIb9LshwKulQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*FhsRHWZHhXmIb9LshwKulQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*FhsRHWZHhXmIb9LshwKulQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*FhsRHWZHhXmIb9LshwKulQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*FhsRHWZHhXmIb9LshwKulQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*FhsRHWZHhXmIb9LshwKulQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"ab ca qo qp qq qr\" role=\"separator\"><\/div>\n\n\n\n<div class=\"fk fl fm fn fo\">\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\">\n<p id=\"5038\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">It is also good that we check for empty strings. Sometimes, a database will fill with empty strings when no data is found somewhere.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"71f3\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">empty_strs = []\n<span class=\"hljs-keyword\">for<\/span> <span class=\"hljs-built_in\">iter<\/span>, label, msg <span class=\"hljs-keyword\">in<\/span> df.itertuples(name=<span class=\"hljs-string\">\"Ham_spam\"<\/span>):\n  <span class=\"hljs-keyword\">if<\/span> msg.isspace():\n    empty_strs.append(<span class=\"hljs-number\">1<\/span>)\n\n<span class=\"hljs-comment\"># Result is a empty list as we have no empty strings<\/span>\n[]<\/span><\/pre>\n<p id=\"ab4b\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Just to be \u201cfancy\u201d, let\u2019s check the percentage distribution of each label (ham and spam):<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"a5a0\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">category_count = df[<span class=\"hljs-string\">'label'<\/span>].value_counts()\n\nfig , ax = plt.subplots(figsize=(<span class=\"hljs-number\">15<\/span>, <span class=\"hljs-number\">5<\/span>))\nexplode = [<span class=\"hljs-number\">0.1<\/span>, <span class=\"hljs-number\">0.3<\/span>]\nax.pie(category_count, labels=category_count.index, explode=explode, autopct=<span class=\"hljs-string\">'%1.2f%%'<\/span>)\nax.set_title(<span class=\"hljs-string\">'Distibution of spam and ham'<\/span>)\nplt.show()<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:527\/1*U225_S4VYffrhKdmAr-nUw.png\" alt=\"\" width=\"527\" height=\"280\"><\/figure><div class=\"mi mj rw\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*U225_S4VYffrhKdmAr-nUw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*U225_S4VYffrhKdmAr-nUw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*U225_S4VYffrhKdmAr-nUw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*U225_S4VYffrhKdmAr-nUw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*U225_S4VYffrhKdmAr-nUw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*U225_S4VYffrhKdmAr-nUw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1054\/format:webp\/1*U225_S4VYffrhKdmAr-nUw.png 1054w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 527px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*U225_S4VYffrhKdmAr-nUw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*U225_S4VYffrhKdmAr-nUw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*U225_S4VYffrhKdmAr-nUw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*U225_S4VYffrhKdmAr-nUw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*U225_S4VYffrhKdmAr-nUw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*U225_S4VYffrhKdmAr-nUw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1054\/1*U225_S4VYffrhKdmAr-nUw.png 1054w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 527px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"72a1\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\">Preprocessing the text data with NLTK<\/h2>\n<p id=\"7e5e\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Here we will:<\/p>\n<ul class=\"\">\n<li id=\"6ccb\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Tokenize each message in the message column.<\/li>\n<li id=\"8059\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Convert all text to lowercase and remove stop words and punctuations.<\/li>\n<li id=\"666e\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Lemmatize the message.<\/li>\n<\/ul>\n<p id=\"ab0a\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We have chosen to use lemmatization on the dataset since it performs better and gives meaningful words.<\/p>\n<p id=\"dec5\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Let\u2019s create a function that does the above processes for us.:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"d053\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">import<\/span> re\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">preprocess_mesg<\/span>(<span class=\"hljs-params\">message<\/span>):\n\n  <span class=\"hljs-comment\"># define list to hold all the preprocessed words<\/span>\n  preprocessed_msg = []\n\n  <span class=\"hljs-comment\"># first lets convert the messages into lower case<\/span>\n  message_lower = message.lower()\n  message = re.sub(<span class=\"hljs-string\">r'[^a-zA-Z0-9\\s]'<\/span>, <span class=\"hljs-string\">' '<\/span>, message_lower)\n\n  <span class=\"hljs-comment\"># combine stopwords, and punctuations<\/span>\n  stopword_and_punctiation = <span class=\"hljs-built_in\">set<\/span>(stopwords.words(<span class=\"hljs-string\">'english'<\/span>) + <span class=\"hljs-built_in\">list<\/span>(string.punctuation))\n\n  <span class=\"hljs-comment\"># tokenize each message with word_tokenize<\/span>\n  message = word_tokenize(message)\n\n  <span class=\"hljs-comment\"># Initialize the lemmatizer<\/span>\n  lemmatizer = WordNetLemmatizer()\n\n  <span class=\"hljs-comment\"># clean the message<\/span>\n  cleaned_message = [preprocessed_msg.append(lemmatizer.lemmatize(word)) <span class=\"hljs-keyword\">for<\/span> word <span class=\"hljs-keyword\">in<\/span> message <span class=\"hljs-keyword\">if<\/span> word <span class=\"hljs-keyword\">not<\/span> <span class=\"hljs-keyword\">in<\/span> stopword_and_punctiation]\n  cleaned_message = <span class=\"hljs-string\">\" \"<\/span>.join(preprocessed_msg)\n\n  <span class=\"hljs-keyword\">return<\/span> cleaned_message\n\ndf[<span class=\"hljs-string\">'message'<\/span>] = df[<span class=\"hljs-string\">'message'<\/span>].apply(preprocess_mesg)<\/span><\/pre>\n<p id=\"4f56\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">If we display the DataFrame, we get a column with cleaned messages with no stop words, lowercase, and lemmatized.<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:507\/1*CfY7_XjV3B-_PKH_mG84Nw.png\" alt=\"\" width=\"507\" height=\"197\"><\/figure><div class=\"mi mj rx\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*CfY7_XjV3B-_PKH_mG84Nw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*CfY7_XjV3B-_PKH_mG84Nw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*CfY7_XjV3B-_PKH_mG84Nw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*CfY7_XjV3B-_PKH_mG84Nw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*CfY7_XjV3B-_PKH_mG84Nw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*CfY7_XjV3B-_PKH_mG84Nw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1014\/format:webp\/1*CfY7_XjV3B-_PKH_mG84Nw.png 1014w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 507px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*CfY7_XjV3B-_PKH_mG84Nw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*CfY7_XjV3B-_PKH_mG84Nw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*CfY7_XjV3B-_PKH_mG84Nw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*CfY7_XjV3B-_PKH_mG84Nw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*CfY7_XjV3B-_PKH_mG84Nw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*CfY7_XjV3B-_PKH_mG84Nw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1014\/1*CfY7_XjV3B-_PKH_mG84Nw.png 1014w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 507px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"f914\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Since our data is now clean, we need to grab the predictor and the target variables.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"6883\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">messages = df[<span class=\"hljs-string\">'message'<\/span>].values <span class=\"hljs-comment\"># X<\/span>\nlabels = df[<span class=\"hljs-string\">'label'<\/span>].values <span class=\"hljs-comment\"># y<\/span><\/span><\/pre>\n<p id=\"a54a\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We will convert the <em class=\"qn\">labels <\/em>into numerical types with Keras\u2019 one-hot-encoding using <code class=\"cw oz pa pb pc b\">to_categorical()<\/code> method.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"4b95\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> keras.utils <span class=\"hljs-keyword\">import<\/span> to_categorical\n\nlabel2int = {<span class=\"hljs-string\">'ham'<\/span>: <span class=\"hljs-number\">1<\/span>, <span class=\"hljs-string\">'spam'<\/span>:<span class=\"hljs-number\">0<\/span>}\nlabels = [label2int[label] <span class=\"hljs-keyword\">for<\/span> label <span class=\"hljs-keyword\">in<\/span> labels]\nlabels = to_categorical(labels)\n\nlabels<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:406\/1*iIFZ38YDefWjp5Mriipd3w.png\" alt=\"\" width=\"406\" height=\"125\"><\/figure><div class=\"mi mj ry\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*iIFZ38YDefWjp5Mriipd3w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*iIFZ38YDefWjp5Mriipd3w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*iIFZ38YDefWjp5Mriipd3w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*iIFZ38YDefWjp5Mriipd3w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*iIFZ38YDefWjp5Mriipd3w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*iIFZ38YDefWjp5Mriipd3w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:812\/format:webp\/1*iIFZ38YDefWjp5Mriipd3w.png 812w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 406px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*iIFZ38YDefWjp5Mriipd3w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*iIFZ38YDefWjp5Mriipd3w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*iIFZ38YDefWjp5Mriipd3w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*iIFZ38YDefWjp5Mriipd3w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*iIFZ38YDefWjp5Mriipd3w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*iIFZ38YDefWjp5Mriipd3w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:812\/1*iIFZ38YDefWjp5Mriipd3w.png 812w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 406px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"53ca\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\">Splitting the data into training and test sets<\/h2>\n<p id=\"6dc8\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">We need to split and shuffle the data into training sets for training the model and test sets for evaluating the model\u2019s performance.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"d798\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">X_train, X_test, y_train, y_test = train_test_split(messages, labels, test_size=<span class=\"hljs-number\">0.30<\/span>, random_state=<span class=\"hljs-number\">42<\/span>)<\/span><\/pre>\n<pre class=\"ro qf pc qg bo qh ba bj\"><span id=\"02d2\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">X_train.shape, X_test.shape, y_train.shape, y_test.shape\n\n<span class=\"hljs-comment\"># Shapes<\/span>\n((<span class=\"hljs-number\">3900<\/span>,), (<span class=\"hljs-number\">1672<\/span>,), (<span class=\"hljs-number\">3900<\/span>, <span class=\"hljs-number\">2<\/span>), (<span class=\"hljs-number\">1672<\/span>, <span class=\"hljs-number\">2<\/span>))<\/span><\/pre>\n<h2 id=\"1f23\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\">Feature extraction with TfidfVectorizer<\/h2>\n<p id=\"2c68\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Before training our model, we need to convert our training samples into computer-understandable numerical data.<\/p>\n<p id=\"7a17\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">The common way we do this is by using the <code class=\"cw oz pa pb pc b\"><a class=\"af he\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.TfidfVectorizer.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">TfidfVectorizer<\/a><\/code> from Scikit-learn. We could use the <code class=\"cw oz pa pb pc b\"><a class=\"af he\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.CountVectorizer.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">CounterVectorizer<\/a><\/code> for this, but as the <code class=\"cw oz pa pb pc b\">CountVectorizer<\/code> only returns a sparse matrix of the only integer counts of the words, <code class=\"cw oz pa pb pc b\">TfidfVectorizer<\/code> returns only floats as scores as it takes into consideration the importance of each word. This means that less important words will have very low scores, thus helping improve our model.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"00a5\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># initialize the vecotrizer<\/span>\n\ntfidfV = TfidfVectorizer()\n\n<span class=\"hljs-comment\"># transform the train and test sets and transform to array<\/span>\nX_train_tfidfV = tfidfV.fit_transform(X_train).toarray()\nX_test_tfidfV = tfidfV.transform(X_test).toarray()<\/span><\/pre>\n<h1 id=\"1fff\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Building the model<\/h1>\n<p id=\"3c47\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Here we will use the Keras Sequential model and set the input dimensions equal to the training data we have vectorized.<\/p>\n<p id=\"2de1\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We will also use <a class=\"af he\" href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/keras\/activations\/relu\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be nx\">rectified linear activation function(<\/strong>relu)<\/a> activation function for our hidden layers and the <a class=\"af he\" href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/keras\/activations\/softmax\" target=\"_blank\" rel=\"noopener ugc nofollow\">softmax <\/a>function, which will convert the vectors into probability distributions for our output layer.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"c4c6\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># initialize the Sequential model<\/span>\nmodel = Sequential()\n\ninput_dim = X_train_tfidfV.shape[<span class=\"hljs-number\">1<\/span>]\n\n<span class=\"hljs-comment\"># add Dense layers to the neural network<\/span>\nmodel.add(Dense(<span class=\"hljs-number\">8<\/span>, input_dim=input_dim, activation=<span class=\"hljs-string\">'relu'<\/span>))\nmodel.add(Dense(<span class=\"hljs-number\">8<\/span>, input_dim=input_dim, activation=<span class=\"hljs-string\">'relu'<\/span>))\nmodel.add(Dense(<span class=\"hljs-number\">2<\/span>, activation=<span class=\"hljs-string\">'softmax'<\/span>))\n\nmodel.<span class=\"hljs-built_in\">compile<\/span>(loss=<span class=\"hljs-string\">'categorical_crossentropy'<\/span>, optimizer=<span class=\"hljs-string\">'adam'<\/span>, metrics=<span class=\"hljs-string\">'accuracy'<\/span>)<\/span><\/pre>\n<pre class=\"ro qf pc qg bo qh ba bj\"><span id=\"90a9\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># call the summary method of the model<\/span>\n\nmodel.summary()<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:587\/1*qurXCygMFFfJk_Wa5zQQ_A.png\" alt=\"\" width=\"587\" height=\"257\"><\/figure><div class=\"mi mj rz\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*qurXCygMFFfJk_Wa5zQQ_A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*qurXCygMFFfJk_Wa5zQQ_A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*qurXCygMFFfJk_Wa5zQQ_A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*qurXCygMFFfJk_Wa5zQQ_A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*qurXCygMFFfJk_Wa5zQQ_A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*qurXCygMFFfJk_Wa5zQQ_A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1174\/format:webp\/1*qurXCygMFFfJk_Wa5zQQ_A.png 1174w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 587px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*qurXCygMFFfJk_Wa5zQQ_A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*qurXCygMFFfJk_Wa5zQQ_A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*qurXCygMFFfJk_Wa5zQQ_A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*qurXCygMFFfJk_Wa5zQQ_A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*qurXCygMFFfJk_Wa5zQQ_A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*qurXCygMFFfJk_Wa5zQQ_A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1174\/1*qurXCygMFFfJk_Wa5zQQ_A.png 1174w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 587px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"8e48\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Next, we need to train the model for the fixed number of <code class=\"cw oz pa pb pc b\">epochs<\/code> or iterations on the dataset using the <code class=\"cw oz pa pb pc b\">fit<\/code> method of the Sequential model class.<\/p>\n<p id=\"e8d7\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We will iterate through the training set 10 times(epochs), from which we will pick the validation data before the model is trained. The validation data will help gauge the model\u2019s performance and losses, which can help us identify if it is overfitting or not.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"39ad\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">model.fit<span class=\"hljs-punctuation\">(<\/span>X_train_tfidfV, y_train, epochs<span class=\"hljs-punctuation\">=<\/span><span class=\"hljs-number\">10<\/span>, verbose<span class=\"hljs-punctuation\">=<\/span><span class=\"hljs-literal\">True<\/span>, validation_data<span class=\"hljs-punctuation\">=<\/span><span class=\"hljs-punctuation\">(<\/span>X_test_tfidfV, y_test<span class=\"hljs-punctuation\">)<\/span> ,batch_size<span class=\"hljs-punctuation\">=<\/span><span class=\"hljs-number\">10<\/span><span class=\"hljs-punctuation\">)<\/span><\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*qqRKMkcsIM-6aIp8sNkaqQ.png\" alt=\"\" width=\"700\" height=\"250\"><\/figure><div class=\"mi mj sa\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*qqRKMkcsIM-6aIp8sNkaqQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"e63c\" class=\"po nz fr be oa pp pq pr od ps pt pu og nk pv pw px no py pz qa ns qb qc qd qe bj\" data-selectable-paragraph=\"\">Evaluating the model<\/h2>\n<p id=\"d8ba\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Now that our model is trained, we can create the confusion matrix and classification report. We will also view the performance in Comet.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"db3a\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">preds = (model.predict(X_test_tfidfV) &gt; <span class=\"hljs-number\">0.5<\/span>).astype(<span class=\"hljs-string\">\"int32\"<\/span>)\n\n<span class=\"hljs-comment\"># round the testing and predictions data<\/span>\n\ny_test = np.argmax(y_test, axis=<span class=\"hljs-number\">1<\/span>)\npreds = np.argmax(preds, axis=<span class=\"hljs-number\">1<\/span>)<\/span><\/pre>\n<p id=\"88d1\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Classification report:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"efba\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\"><span class=\"hljs-built_in\">print<\/span>(classification_report(y_test, preds))<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:522\/1*tt3ILXPmyFupwEQPhWtB_A.png\" alt=\"\" width=\"522\" height=\"157\"><\/figure><div class=\"mi mj sb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*tt3ILXPmyFupwEQPhWtB_A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*tt3ILXPmyFupwEQPhWtB_A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*tt3ILXPmyFupwEQPhWtB_A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*tt3ILXPmyFupwEQPhWtB_A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*tt3ILXPmyFupwEQPhWtB_A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*tt3ILXPmyFupwEQPhWtB_A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1044\/format:webp\/1*tt3ILXPmyFupwEQPhWtB_A.png 1044w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 522px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*tt3ILXPmyFupwEQPhWtB_A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*tt3ILXPmyFupwEQPhWtB_A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*tt3ILXPmyFupwEQPhWtB_A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*tt3ILXPmyFupwEQPhWtB_A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*tt3ILXPmyFupwEQPhWtB_A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*tt3ILXPmyFupwEQPhWtB_A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1044\/1*tt3ILXPmyFupwEQPhWtB_A.png 1044w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 522px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"aa6b\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Confusion matrix:<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"f6b4\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">figure(figsize=(<span class=\"hljs-number\">7<\/span>, <span class=\"hljs-number\">5<\/span>), dpi=<span class=\"hljs-number\">80<\/span>)\nsns.heatmap(pd.DataFrame(cf_matrix), annot=<span class=\"hljs-literal\">True<\/span>, cmap=<span class=\"hljs-string\">\"YlGnBu\"<\/span> ,fmt=<span class=\"hljs-string\">'g'<\/span>, cbar=<span class=\"hljs-literal\">False<\/span>)\nplt.title(<span class=\"hljs-string\">'Confusion matrix'<\/span>, y=<span class=\"hljs-number\">1.1<\/span>)\nplt.ylabel(<span class=\"hljs-string\">'Actual label'<\/span>)\nplt.xlabel(<span class=\"hljs-string\">'Predicted label'<\/span>)<\/span><\/pre>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:568\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png\" alt=\"\" width=\"568\" height=\"316\"><\/figure><div class=\"mi mj sc\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1136\/format:webp\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 1136w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 568px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1136\/1*JMzEHZ7_Gb1d59U7t7iO_Q.png 1136w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 568px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"51dc\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Looking at the confusion matrix, we can see that the model has <strong class=\"be nx\">high TP (188) <\/strong>and <strong class=\"be nx\">TN (1451) <\/strong>rates and <strong class=\"be nx\">low FP and FN rates<\/strong>. Thus, it is predicted remarkably.<\/p>\n<p id=\"c59c\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\"><strong class=\"be nx\">Comet<\/strong> time!<\/p>\n<p id=\"192b\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">To view the experiment on Comet, include the following line in the last cell of your notebook or code.<\/p>\n<pre class=\"ml mm mn mo mp qf pc qg bo qh ba bj\"><span id=\"802e\" class=\"qi nz fr pc b bf qj qk l ql qm\" data-selectable-paragraph=\"\">experiment.end()<\/span><\/pre>\n<p id=\"4ddc\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">The accuracy and the loss and validation loss are logged automatically on Comet:<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*5JrLiQ365EvCz9yzIQrtUQ.png\" alt=\"\" width=\"700\" height=\"418\"><\/figure><div class=\"mi mj sd\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*5JrLiQ365EvCz9yzIQrtUQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*5JrLiQ365EvCz9yzIQrtUQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*5JrLiQ365EvCz9yzIQrtUQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*5JrLiQ365EvCz9yzIQrtUQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*5JrLiQ365EvCz9yzIQrtUQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*5JrLiQ365EvCz9yzIQrtUQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*5JrLiQ365EvCz9yzIQrtUQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*5JrLiQ365EvCz9yzIQrtUQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*5JrLiQ365EvCz9yzIQrtUQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*5JrLiQ365EvCz9yzIQrtUQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*5JrLiQ365EvCz9yzIQrtUQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*5JrLiQ365EvCz9yzIQrtUQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*5JrLiQ365EvCz9yzIQrtUQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*5JrLiQ365EvCz9yzIQrtUQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mx my mz mi mj na nb be b bf z dw\" data-selectable-paragraph=\"\">Comet tracking<\/figcaption>\n<\/figure>\n<p id=\"adf9\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Comet will also show the hyperparameters on the trained model automatically.<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:657\/1*sUPOIDf-UeFB9d_5JFIGaA.png\" alt=\"\" width=\"657\" height=\"385\"><\/figure><div class=\"mi mj se\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*sUPOIDf-UeFB9d_5JFIGaA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*sUPOIDf-UeFB9d_5JFIGaA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*sUPOIDf-UeFB9d_5JFIGaA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*sUPOIDf-UeFB9d_5JFIGaA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*sUPOIDf-UeFB9d_5JFIGaA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*sUPOIDf-UeFB9d_5JFIGaA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1314\/format:webp\/1*sUPOIDf-UeFB9d_5JFIGaA.png 1314w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 657px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*sUPOIDf-UeFB9d_5JFIGaA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*sUPOIDf-UeFB9d_5JFIGaA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*sUPOIDf-UeFB9d_5JFIGaA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*sUPOIDf-UeFB9d_5JFIGaA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*sUPOIDf-UeFB9d_5JFIGaA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*sUPOIDf-UeFB9d_5JFIGaA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1314\/1*sUPOIDf-UeFB9d_5JFIGaA.png 1314w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 657px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mx my mz mi mj na nb be b bf z dw\" data-selectable-paragraph=\"\">Comet logged hyperparameters<\/figcaption>\n<\/figure>\n<p id=\"6e1e\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Metrics:<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*_2lPgSZQJMHg9cTYMGlLVg.png\" alt=\"\" width=\"700\" height=\"240\"><\/figure><div class=\"mi mj sf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*_2lPgSZQJMHg9cTYMGlLVg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*_2lPgSZQJMHg9cTYMGlLVg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*_2lPgSZQJMHg9cTYMGlLVg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*_2lPgSZQJMHg9cTYMGlLVg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*_2lPgSZQJMHg9cTYMGlLVg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*_2lPgSZQJMHg9cTYMGlLVg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*_2lPgSZQJMHg9cTYMGlLVg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*_2lPgSZQJMHg9cTYMGlLVg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*_2lPgSZQJMHg9cTYMGlLVg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*_2lPgSZQJMHg9cTYMGlLVg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*_2lPgSZQJMHg9cTYMGlLVg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*_2lPgSZQJMHg9cTYMGlLVg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*_2lPgSZQJMHg9cTYMGlLVg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*_2lPgSZQJMHg9cTYMGlLVg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"1961\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">We can also view visualizations for the gradients, activations, weights, and biases. For instance, we can see the biases over each epoch in the visualizations below.<\/p>\n<figure class=\"ml mm mn mo mp mq mi mj paragraph-image\">\n<div class=\"mr ms ee mt bg mu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mv mw c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png\" alt=\"\" width=\"700\" height=\"750\"><\/figure><div class=\"mi mj sg\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*6OCDBvn9Qeb2Yu7Mij5DFA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"c0b6\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Comet is a beneficial platform for building models that entirely encourages you to focus on developing the particular model, and the rest will be logged out for you.<\/p>\n<h1 id=\"ee02\" class=\"ny nz fr be oa ob oc gr od oe of gu og oh oi oj ok ol om on oo op oq or os ot bj\" data-selectable-paragraph=\"\">Final thoughts<\/h1>\n<p id=\"1672\" class=\"pw-post-body-paragraph nc nd fr be b gp ou nf ng gs ov ni nj nk ow nm nn no ox nq nr ns oy nu nv nw fk bj\" data-selectable-paragraph=\"\">Natural language processing is a broad field and is not limited to classifying spam in text. You can ideally use the knowledge you have gained on text preprocessing with NLTK to explore other NLP tasks.<\/p>\n<p id=\"6cdd\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">Also, we have not exhausted all the capabilities of NLTK in text preprocessing. However, we have covered the most common concepts you will begin with when doing NLP.<\/p>\n<p id=\"4ed1\" class=\"pw-post-body-paragraph nc nd fr be b gp ne nf ng gs nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw fk bj\" data-selectable-paragraph=\"\">This is an all-in-one article where you have learned.<\/p>\n<ul class=\"\">\n<li id=\"896b\" class=\"nc nd fr be b gp ne nf ng gs nh ni nj nk pd nm nn no pe nq nr ns pf nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">NLTK text preprocessing concepts.<\/li>\n<li id=\"2f94\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Visualizing data with <a class=\"af he\" href=\"https:\/\/www.comet.com\/site\/blog\/kangas-visualize-multimedia-data-at-scale\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Kangas<\/a>.<\/li>\n<li id=\"9fce\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Vectorizing data with <a class=\"af he\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.TfidfVectorizer.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">TFidfVectorizer<\/a>. Also, explore other methods like <a class=\"af he\" href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.CountVectorizer.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">CounterVectorizer <\/a>for the same function.<\/li>\n<li id=\"916d\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Building an email spam classifier using TensorFlow and Keras.<\/li>\n<li id=\"8c0d\" class=\"nc nd fr be b gp pj nf ng gs pk ni nj nk pl nm nn no pm nq nr ns pn nu nv nw pg ph pi bj\" data-selectable-paragraph=\"\">Tracking your experiment with <a class=\"af he\" href=\"https:\/\/www.comet.com\/docs\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Comet <\/a>and viewing the various metrics and visualizations produced by the platform.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Stephen Phillips \u2014 Hostreviews.co.uk on Unsplash At its core, the discipline of Natural Language Processing (NLP) tries to make the human language \u201cpalatable\u201d to computers. Many data we analyze as data scientists consist of a corpus of human-readable text. Before we can feed this data into a computer for analysis, we must preprocess [&hellip;]<\/p>\n","protected":false},"author":108,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7],"tags":[],"coauthors":[206],"class_list":["post-8092","post","type-post","status-publish","format-standard","hentry","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Natural Language Processing (NLP) Concepts With NLTK - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Natural Language Processing (NLP) Concepts With NLTK\" \/>\n<meta property=\"og:description\" content=\"Photo by Stephen Phillips \u2014 Hostreviews.co.uk on Unsplash At its core, the discipline of Natural Language Processing (NLP) tries to make the human language \u201cpalatable\u201d to computers. Many data we analyze as data scientists consist of a corpus of human-readable text. Before we can feed this data into a computer for analysis, we must preprocess [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-02T18:22:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:04:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-MUoHYxWr5DFi5UmJHUePQ.jpeg\" \/>\n<meta name=\"author\" content=\"Brian Mutea\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Brian Mutea\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"22 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Natural Language Processing (NLP) Concepts With NLTK - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk","og_locale":"en_US","og_type":"article","og_title":"Natural Language Processing (NLP) Concepts With NLTK","og_description":"Photo by Stephen Phillips \u2014 Hostreviews.co.uk on Unsplash At its core, the discipline of Natural Language Processing (NLP) tries to make the human language \u201cpalatable\u201d to computers. Many data we analyze as data scientists consist of a corpus of human-readable text. Before we can feed this data into a computer for analysis, we must preprocess [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-11-02T18:22:21+00:00","article_modified_time":"2025-04-24T17:04:43+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-MUoHYxWr5DFi5UmJHUePQ.jpeg","type":"","width":"","height":""}],"author":"Brian Mutea","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Brian Mutea","Est. reading time":"22 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk\/"},"author":{"name":"Brian Mutea","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/45acdda6535e03a9542e665f23953c3b"},"headline":"Natural Language Processing (NLP) Concepts With NLTK","datePublished":"2023-11-02T18:22:21+00:00","dateModified":"2025-04-24T17:04:43+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk\/"},"wordCount":2784,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-MUoHYxWr5DFi5UmJHUePQ.jpeg","articleSection":["Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk\/","url":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk","name":"Natural Language Processing (NLP) Concepts With NLTK - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-MUoHYxWr5DFi5UmJHUePQ.jpeg","datePublished":"2023-11-02T18:22:21+00:00","dateModified":"2025-04-24T17:04:43+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-MUoHYxWr5DFi5UmJHUePQ.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-MUoHYxWr5DFi5UmJHUePQ.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/natural-language-processing-nlp-concepts-with-nltk#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Natural Language Processing (NLP) Concepts With NLTK"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/45acdda6535e03a9542e665f23953c3b","name":"Brian Mutea","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/0008644e1041f4f2e48e3566c59bc055","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/11\/1652705747012-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/11\/1652705747012-96x96.jpg","caption":"Brian Mutea"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/brianmuteakgmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8092","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/108"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=8092"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8092\/revisions"}],"predecessor-version":[{"id":15463,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8092\/revisions\/15463"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=8092"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=8092"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=8092"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=8092"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}