{"id":7126,"date":"2023-08-14T04:46:47","date_gmt":"2023-08-14T12:46:47","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7126"},"modified":"2025-04-24T17:14:51","modified_gmt":"2025-04-24T17:14:51","slug":"spam-filtering-using-bag-of-words","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/","title":{"rendered":"Spam Filtering Using Bag-of-Words"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\">\n\n\n\n<div class=\"mf bg\">\n<figure class=\"mg mh mi mj mk mf bg paragraph-image\"><picture><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*VN_iumSQ1jKm_Gq1F5PXAQ.jpeg\" alt=\"\" width=\"2400\" height=\"1643\"><\/picture><figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mu\" href=\"https:\/\/unsplash.com\/@minarikd?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Daniel Min\u00e1rik<\/a> on <a class=\"af mu\" href=\"https:\/\/unsplash.com\/s\/photos\/word-pile?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"eafb\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">In this post, we\u2019re going to employ one simple natural language processing (NLP) algorithm known as <em class=\"nq\">bag-of-words<\/em> to classify messages as ham or spam. Using bag of words and feature engineering related to NLP, we\u2019ll get hands-on experience on a small dataset for SMS classification.<\/p>\n<p id=\"438a\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\"><em class=\"nq\">So, what are we waiting for?<\/em><\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:642\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg\" alt=\"\" width=\"642\" height=\"428\"><\/figure><div class=\"mq mr nr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1284\/format:webp\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 1284w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 642px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1284\/1*74cMLAx3N6nHwdX0MvKXTg.jpeg 1284w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 642px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af mu\" href=\"https:\/\/adage.com\/article\/digitalnext\/holiday-email-practices-avoiding-spam-filter\/306437\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/adage.com\/article\/digitalnext\/holiday-email-practices-avoiding-spam-filter\/306437<\/a><\/figcaption>\n<\/figure>\n<h2 id=\"6b77\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">The Problem: Spam Messages<\/h2>\n<p id=\"7b6d\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Spam emails or messages belong to the broad category of unsolicited messages received by a user. Spam occupies unwanted space and bandwidth, amplifies the threat of viruses like trojans, and in general exploits a user\u2019s connection to social networks.<\/p>\n<p id=\"ff77\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Spam can also be used in Denial of Service (DOS) or Distributed Denial of Service (DDOS) attacks. Various techniques are employed to filter out spam messages, usually centered on content-based filtering. This is because specific keywords, links, or websites are repeatedly sent in bulk to users, characterizing them as spam.<\/p>\n<h2 id=\"63d9\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">The Solution: Text Classification<\/h2>\n<p id=\"5c36\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Comparatively speaking, languages are harder for algorithms to interpret and analyze than numeric data. This is true for a few reasons:<\/p>\n<ol class=\"\">\n<li id=\"97e0\" class=\"mv mw fo be b gm mx my mz gp na nb nc os ne nf ng ot ni nj nk ou nm nn no np ov ow ox bj\" data-selectable-paragraph=\"\">Sentences are not of fixed lengths, but most algorithms require a standard input vector size. Thus, padding is required, corresponding to the largest sentence in the corpora.<\/li>\n<li id=\"b2a4\" class=\"mv mw fo be b gm oy my mz gp oz nb nc os pa nf ng ot pb nj nk ou pc nn no np ov ow ox bj\" data-selectable-paragraph=\"\">ML algorithms cannot understand words as input: hence, each word needs to be represented by some numeric value.<\/li>\n<\/ol>\n<h1 id=\"5098\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Bag-of-Words Model<\/h1>\n<p id=\"569f\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">A bag-of-words model allows us to extract features from textual data. As we know, an algorithm doesn\u2019t understand language. Thus, we need to use a numeric representation for the words in the corpus. This numeric representation can later be fed to any algorithm for further analysis.<\/p>\n<p id=\"de99\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">It\u2019s called \u201cbag-of-words\u201d because the order of the words or the structure of the sentence is lost in this model. Only the occurrence or presence of a word matters.<\/p>\n<p id=\"c9d6\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">We can think of the model in such a way \u2014 we have a big bag, empty at the start, and a vocabulary or a corpus. We pick up words one by one and put them in the bag, adding to the frequency of their occurrence, and then select the most common words as features for passing through our algorithm of choice.<\/p>\n<p id=\"8865\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Thus, it promotes the view that similar documents consist of similar kinds of words.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:661\/1*WN18F5oVHKzf_DXcCpSFiQ.png\" alt=\"\" width=\"661\" height=\"380\"><\/figure><div class=\"mq mr pv\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*WN18F5oVHKzf_DXcCpSFiQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*WN18F5oVHKzf_DXcCpSFiQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*WN18F5oVHKzf_DXcCpSFiQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*WN18F5oVHKzf_DXcCpSFiQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*WN18F5oVHKzf_DXcCpSFiQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*WN18F5oVHKzf_DXcCpSFiQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1322\/format:webp\/1*WN18F5oVHKzf_DXcCpSFiQ.png 1322w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 661px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*WN18F5oVHKzf_DXcCpSFiQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*WN18F5oVHKzf_DXcCpSFiQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*WN18F5oVHKzf_DXcCpSFiQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*WN18F5oVHKzf_DXcCpSFiQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*WN18F5oVHKzf_DXcCpSFiQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*WN18F5oVHKzf_DXcCpSFiQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1322\/1*WN18F5oVHKzf_DXcCpSFiQ.png 1322w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 661px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af mu\" href=\"https:\/\/dudeperf3ct.github.io\/lstm\/gru\/nlp\/2019\/01\/28\/Force-of-LSTM-and-GRU\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/dudeperf3ct.github.io\/lstm\/gru\/nlp\/2019\/01\/28\/Force-of-LSTM-and-GRU\/<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"a5a3\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Dataset<\/h1>\n<p id=\"0a92\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">The dataset that we\u2019re going to use in this article is an SMS spam collection dataset. It contains over 5500 messages in English, with each message in a column, with the corresponding column next to it specifying whether the text is ham or spam.<\/p>\n<p id=\"6a83\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">You can find the dataset <a class=\"af mu\" href=\"https:\/\/github.com\/nikitaa30\/Spam-Filtering-techniques\/tree\/master\/dataset%20for%20bag%20of%20words\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>. The complete source code can be found in <a class=\"af mu\" href=\"https:\/\/github.com\/nikitaa30\/Spam-Filtering-techniques\/blob\/master\/bag%20of%20words.py\" target=\"_blank\" rel=\"noopener ugc nofollow\">this<\/a>repository.<\/p>\n<h2 id=\"2ab9\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Importing the dataset<\/h2>\n<p id=\"49dd\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">To import the dataset into a Pandas dataframe, we use the couple of lines written below:<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"261e\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">import pandas as pd\ndataset = pd.read_csv('spam.csv', encoding='ISO-8859-1');<\/span><\/pre>\n<p id=\"f2e6\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Here\u2019s a glimpse of the dataset we are working on. We, later convert the labels into dummy variables.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:435\/1*0CAJJpxnTGK7v9z2hsD0NA.png\" alt=\"\" width=\"435\" height=\"211\"><\/figure><div class=\"mq mr qe\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*0CAJJpxnTGK7v9z2hsD0NA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*0CAJJpxnTGK7v9z2hsD0NA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*0CAJJpxnTGK7v9z2hsD0NA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*0CAJJpxnTGK7v9z2hsD0NA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*0CAJJpxnTGK7v9z2hsD0NA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*0CAJJpxnTGK7v9z2hsD0NA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:870\/format:webp\/1*0CAJJpxnTGK7v9z2hsD0NA.png 870w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 435px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*0CAJJpxnTGK7v9z2hsD0NA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*0CAJJpxnTGK7v9z2hsD0NA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*0CAJJpxnTGK7v9z2hsD0NA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*0CAJJpxnTGK7v9z2hsD0NA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*0CAJJpxnTGK7v9z2hsD0NA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*0CAJJpxnTGK7v9z2hsD0NA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:870\/1*0CAJJpxnTGK7v9z2hsD0NA.png 870w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 435px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h1 id=\"0270\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Pre-requisites<\/h1>\n<ol class=\"\">\n<li id=\"3956\" class=\"mv mw fo be b gm on my mz gp oo nb nc os op nf ng ot oq nj nk ou or nn no np ov ow ox bj\" data-selectable-paragraph=\"\"><a class=\"af mu\" href=\"https:\/\/heartbeat.comet.ml\/nlp-chronicles-intro-to-nlp-with-nltk-b2c369fbb9a7\" target=\"_blank\" rel=\"noopener ugc nofollow\">NLTK<\/a> for NLP-related tasks<\/li>\n<li id=\"d6a4\" class=\"mv mw fo be b gm oy my mz gp oz nb nc os pa nf ng ot pb nj nk ou pc nn no np ov ow ox bj\" data-selectable-paragraph=\"\">NumPy and <a class=\"af mu\" href=\"https:\/\/heartbeat.comet.ml\/tips-and-tricks-for-data-analysis-with-pandas-dc0ae909e6be\" target=\"_blank\" rel=\"noopener ugc nofollow\">Pandas<\/a> for mathematical operations<\/li>\n<li id=\"c849\" class=\"mv mw fo be b gm oy my mz gp oz nb nc os pa nf ng ot pb nj nk ou pc nn no np ov ow ox bj\" data-selectable-paragraph=\"\">Scikit-learn for tokenization and tf-idf model<\/li>\n<li id=\"ce98\" class=\"mv mw fo be b gm oy my mz gp oz nb nc os pa nf ng ot pb nj nk ou pc nn no np ov ow ox bj\" data-selectable-paragraph=\"\"><a class=\"af mu\" href=\"https:\/\/heartbeat.comet.ml\/introduction-to-matplotlib-data-visualization-in-python-d9143287ae39\" target=\"_blank\" rel=\"noopener ugc nofollow\">Matplotlib<\/a> and <a class=\"af mu\" href=\"https:\/\/heartbeat.comet.ml\/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07\" target=\"_blank\" rel=\"noopener ugc nofollow\">Seaborn<\/a> for visualisation tasks<\/li>\n<\/ol>\n<h1 id=\"e086\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Data Pre-processing<\/h1>\n<h2 id=\"edee\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Removing Stopwords<\/h2>\n<p id=\"df84\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Stopwords refer to the words in the statement which add no specific meaning to it. They often involve prepositions, helping verbs, and articles (i.e. in, the, an, is). Since these add no value to our model, we need to eradicate them.<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"7cc5\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">nltk.download('stopwords')\nfrom nltk.corpus import stopwords\nstopwords.words('english')<\/span><\/pre>\n<h2 id=\"a5c5\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Removing non-alphabetic characters<\/h2>\n<p id=\"c4f2\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Since we use only words for text classification, we need to get rid of punctuation and numbers. For this, we use string matching or regex in Python. The below regex only preserves alphabetic words, discarding the rest.<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"da6b\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">text = re.sub('[^A-Za-z]', ' ', text)<\/span><\/pre>\n<h2 id=\"7ff6\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Changing all to lower case<\/h2>\n<p id=\"12ee\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">The model cannot distinguish between lowercase and uppercase, treating \u2018Text\u2019 and \u2018text\u2019 as different words. We certainly don\u2019t want that; thus, we change the case of all words to lowercase for simplicity.<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"e695\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">text = text.lower()<\/span><\/pre>\n<h2 id=\"6cd1\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Spelling Corrections<\/h2>\n<p id=\"409d\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Sometimes, people write abbreviations or misspell words by mistake. To correct these instances, we use the autocorrect package and its spell corrector.<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"2fa4\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">from autocorrect import spell\ntext.append(spell(word))<\/span><\/pre>\n<h2 id=\"e21a\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Stemming and Lemmatization<\/h2>\n<p id=\"d15e\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Words like act, actor, and acting all are for of the same root word (act).same Stemming and lemmatization are techniques used to truncate words in order to to get the stem or the base word.<\/p>\n<p id=\"471f\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">The difference between these is that after stemming, the stem may not be an actual word, whereas lemmatization always produces a real world, which results in better interpretation of the corpora by humans.<\/p>\n<p id=\"1c77\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">For example, studies could be stemmed as studi (not a word), but will be lemmatized as study (an existing word).<\/p>\n<p id=\"607c\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Here\u2019s is a comparison of the dataset after and before stemming.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*1KFpmWTPvGo1Azzn78Pyiw.png\" alt=\"\" width=\"700\" height=\"245\"><\/figure><div class=\"mq mr qf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*1KFpmWTPvGo1Azzn78Pyiw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*1KFpmWTPvGo1Azzn78Pyiw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*1KFpmWTPvGo1Azzn78Pyiw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*1KFpmWTPvGo1Azzn78Pyiw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*1KFpmWTPvGo1Azzn78Pyiw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*1KFpmWTPvGo1Azzn78Pyiw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*1KFpmWTPvGo1Azzn78Pyiw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*1KFpmWTPvGo1Azzn78Pyiw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*1KFpmWTPvGo1Azzn78Pyiw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*1KFpmWTPvGo1Azzn78Pyiw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*1KFpmWTPvGo1Azzn78Pyiw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*1KFpmWTPvGo1Azzn78Pyiw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*1KFpmWTPvGo1Azzn78Pyiw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*1KFpmWTPvGo1Azzn78Pyiw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:169\/1*KtfgBjoT9YbHycthEcTfTw.png\" alt=\"\" width=\"169\" height=\"426\"><\/figure><div class=\"mq mr qk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*KtfgBjoT9YbHycthEcTfTw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*KtfgBjoT9YbHycthEcTfTw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*KtfgBjoT9YbHycthEcTfTw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*KtfgBjoT9YbHycthEcTfTw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*KtfgBjoT9YbHycthEcTfTw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*KtfgBjoT9YbHycthEcTfTw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:338\/format:webp\/1*KtfgBjoT9YbHycthEcTfTw.png 338w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 169px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*KtfgBjoT9YbHycthEcTfTw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*KtfgBjoT9YbHycthEcTfTw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*KtfgBjoT9YbHycthEcTfTw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*KtfgBjoT9YbHycthEcTfTw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*KtfgBjoT9YbHycthEcTfTw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*KtfgBjoT9YbHycthEcTfTw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:338\/1*KtfgBjoT9YbHycthEcTfTw.png 338w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 169px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"2aff\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">You can notice the highlighted words are among a few to be stemmed. Also, stemmed words are not actual words most of the time.<\/p>\n<h2 id=\"262d\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Visualizing Spam Keywords<\/h2>\n<p id=\"3407\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Data visualization is a handy way of better understanding the text data involved in our dataset. For example, we can make a wordcloud, which represents most common words in a space, with the size of each word proportional to the frequency of its occurrence. A few other visualization techniques are discussed <a class=\"af mu\" href=\"https:\/\/machinelearningmastery.com\/data-visualization-methods-in-python\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"e89e\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">from wordcloud import WordCloud\nimport matplotlib.pyplot as plt\nspam_wc = WordCloud(width = 600,height = 512).generate(spam_words)\nplt.figure(figsize = (12, 8), facecolor = 'k')\nplt.imshow(spam_wc)\nplt.show()<\/span><\/pre>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*5Ymqkx0YiLcqfFtvwUP70Q.png\" alt=\"\" width=\"700\" height=\"589\"><\/figure><div class=\"mq mr ql\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*5Ymqkx0YiLcqfFtvwUP70Q.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Wordcloud of spam words from our dataset.<\/figcaption>\n<\/figure>\n<h1 id=\"2d0b\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Feature Engineering<\/h1>\n<p id=\"5c8f\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Now we need to perform manipulation on the cleaned, pre-processed dataset to transform it into a form more suitable for applying a machine learning algorithm.<\/p>\n<h2 id=\"47db\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Tokenisation<\/h2>\n<p id=\"c17a\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">For the bag of words implementation, we use <code class=\"cw qm qn qo px b\">CountVectorizer<\/code> from scikit-learn, which counts the frequency of each word present in our pre-processed dataset, and takes the n most common words as features.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*GIb1j98BvG5oLjMqWMC3NA.png\" alt=\"\" width=\"700\" height=\"275\"><\/figure><div class=\"mq mr qq\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*GIb1j98BvG5oLjMqWMC3NA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*GIb1j98BvG5oLjMqWMC3NA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*GIb1j98BvG5oLjMqWMC3NA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*GIb1j98BvG5oLjMqWMC3NA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*GIb1j98BvG5oLjMqWMC3NA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*GIb1j98BvG5oLjMqWMC3NA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*GIb1j98BvG5oLjMqWMC3NA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*GIb1j98BvG5oLjMqWMC3NA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*GIb1j98BvG5oLjMqWMC3NA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*GIb1j98BvG5oLjMqWMC3NA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*GIb1j98BvG5oLjMqWMC3NA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*GIb1j98BvG5oLjMqWMC3NA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*GIb1j98BvG5oLjMqWMC3NA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*GIb1j98BvG5oLjMqWMC3NA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af mu\" href=\"https:\/\/towardsdatascience.com\/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e\" target=\"_blank\" rel=\"noopener\">https:\/\/towardsdatascience.com\/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e<\/a><\/figcaption>\n<\/figure>\n<p id=\"aad7\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\"><code class=\"cw qm qn qo px b\">CountVectorizer<\/code> returns a matrix, where the rows contain the count of messages containing the word, and columns are the top selected features.<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"aac4\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">from sklearn.feature_extraction.text import CountVectorizer\ndata = CountVectorizer(max_features=2000)\nX = data.fit_transform(dataset).toarray()\n<\/span><\/pre>\n<p id=\"d618\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Since the countvectoriser contains 2000 features, they are hard to depict here. Thus, <strong class=\"be qp\">for an example below,we take the first 25 words of our dataset, tokenise them and select 10 of the most frequently used ones.<\/strong><\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*34ryEQ10eX_Be19ktXp1ZQ.png\" alt=\"\" width=\"700\" height=\"80\"><\/figure><div class=\"mq mr qr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*34ryEQ10eX_Be19ktXp1ZQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*34ryEQ10eX_Be19ktXp1ZQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*34ryEQ10eX_Be19ktXp1ZQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*34ryEQ10eX_Be19ktXp1ZQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*34ryEQ10eX_Be19ktXp1ZQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*34ryEQ10eX_Be19ktXp1ZQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*34ryEQ10eX_Be19ktXp1ZQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*34ryEQ10eX_Be19ktXp1ZQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*34ryEQ10eX_Be19ktXp1ZQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*34ryEQ10eX_Be19ktXp1ZQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*34ryEQ10eX_Be19ktXp1ZQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*34ryEQ10eX_Be19ktXp1ZQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*34ryEQ10eX_Be19ktXp1ZQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*34ryEQ10eX_Be19ktXp1ZQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"a2a6\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">The matrix that represents the frequency of each of these features in our messages(dataset) is given below:<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:548\/1*7Baa5XqFyLhlVBqcrfMIAg.png\" alt=\"\" width=\"548\" height=\"593\"><\/figure><div class=\"mq mr qs\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*7Baa5XqFyLhlVBqcrfMIAg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*7Baa5XqFyLhlVBqcrfMIAg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*7Baa5XqFyLhlVBqcrfMIAg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*7Baa5XqFyLhlVBqcrfMIAg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*7Baa5XqFyLhlVBqcrfMIAg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*7Baa5XqFyLhlVBqcrfMIAg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1096\/format:webp\/1*7Baa5XqFyLhlVBqcrfMIAg.png 1096w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 548px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*7Baa5XqFyLhlVBqcrfMIAg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*7Baa5XqFyLhlVBqcrfMIAg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*7Baa5XqFyLhlVBqcrfMIAg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*7Baa5XqFyLhlVBqcrfMIAg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*7Baa5XqFyLhlVBqcrfMIAg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*7Baa5XqFyLhlVBqcrfMIAg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1096\/1*7Baa5XqFyLhlVBqcrfMIAg.png 1096w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 548px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">z is a sparse matrix consisting of mostly zeros.<\/figcaption>\n<\/figure>\n<h1 id=\"581f\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Developing the Model<\/h1>\n<p id=\"a31c\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Now that our dataset is ready with its attributes, we pass it through any algorithm of our choice. Here, after splitting the dataset into training and test sets, I\u2019ve used a simple <a class=\"af mu\" href=\"https:\/\/heartbeat.comet.ml\/understanding-the-mathematics-behind-naive-bayes-ab6ee85f50d0\" target=\"_blank\" rel=\"noopener ugc nofollow\">Naive Bayes<\/a> classifier for demonstration. You can use any algorithm of your choice depending on the dataset.<\/p>\n<pre class=\"mg mh mi mj mk pw px py pz ax qa bj\"><span id=\"fbcf\" class=\"ns nt fo px b ia qb qc l iq qd\" data-selectable-paragraph=\"\">from sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X, y)<\/span><span id=\"370e\" class=\"ns nt fo px b ia qt qc l iq qd\" data-selectable-paragraph=\"\">from sklearn.naive_bayes import GaussianNB\nclassifier = GaussianNB()\nclassifier.fit(X_train, y_train)\ny_pred = classifier.predict(X_test)<\/span><\/pre>\n<h1 id=\"f7ca\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Results<\/h1>\n<p id=\"611d\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Let\u2019s see how our simple model works on a test set:<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:596\/1*FrixIVa3qa_UliVVVDJBXg.png\" alt=\"\" width=\"596\" height=\"147\"><\/figure><div class=\"mq mr qu\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*FrixIVa3qa_UliVVVDJBXg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*FrixIVa3qa_UliVVVDJBXg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*FrixIVa3qa_UliVVVDJBXg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*FrixIVa3qa_UliVVVDJBXg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*FrixIVa3qa_UliVVVDJBXg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*FrixIVa3qa_UliVVVDJBXg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1192\/format:webp\/1*FrixIVa3qa_UliVVVDJBXg.png 1192w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 596px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*FrixIVa3qa_UliVVVDJBXg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*FrixIVa3qa_UliVVVDJBXg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*FrixIVa3qa_UliVVVDJBXg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*FrixIVa3qa_UliVVVDJBXg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*FrixIVa3qa_UliVVVDJBXg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*FrixIVa3qa_UliVVVDJBXg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1192\/1*FrixIVa3qa_UliVVVDJBXg.png 1192w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 596px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"1836\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Looks like our model crossed the finish line with a decent accuracy of ~80%. Not something to boast about, but still pretty decent given the simplicity of our model and its drawbacks, which are discussed in the next section. Thus, we can say that our model differentiates between ham and spam with a good confidence level.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<div class=\"qg qh eb qi bg qj\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*vkFAPzI2X1UQQTaVcrV7NA.png\" alt=\"\" width=\"700\" height=\"176\"><\/figure><div class=\"mq mr qv\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*vkFAPzI2X1UQQTaVcrV7NA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*vkFAPzI2X1UQQTaVcrV7NA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*vkFAPzI2X1UQQTaVcrV7NA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*vkFAPzI2X1UQQTaVcrV7NA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*vkFAPzI2X1UQQTaVcrV7NA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*vkFAPzI2X1UQQTaVcrV7NA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*vkFAPzI2X1UQQTaVcrV7NA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*vkFAPzI2X1UQQTaVcrV7NA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*vkFAPzI2X1UQQTaVcrV7NA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*vkFAPzI2X1UQQTaVcrV7NA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*vkFAPzI2X1UQQTaVcrV7NA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*vkFAPzI2X1UQQTaVcrV7NA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*vkFAPzI2X1UQQTaVcrV7NA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*vkFAPzI2X1UQQTaVcrV7NA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"04f8\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Here is an image of the confusion matrix depicting the true positives and false positives in the first row and false negatives and true negatives in the next row respectively.<\/p>\n<h1 id=\"e633\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Drawbacks of the Bag-of-Words Model<\/h1>\n<p id=\"711c\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">The bag-of-words model assumes that the words are independent. Thus, it doesn\u2019t take into account any relationship between words. Hence, the meaning of sentences is lost.<\/p>\n<p id=\"65a5\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Also, the structure of the sentence has no importance in the eyes of our model Two sentences like \u201cThese clams are good\u201d and \u201cAre these clams good?\u201d mean the same to the of bag-of-words model, though one is a claims and one is a question. Additionally, for a large vocabulary, bag-of-words result in a very high-dimensional vector.<\/p>\n<h1 id=\"90e7\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Improvements to the above model<\/h1>\n<p id=\"30e6\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">A few ways to improve the accuracy of the above model include:<\/p>\n<ol class=\"\">\n<li id=\"67c1\" class=\"mv mw fo be b gm mx my mz gp na nb nc os ne nf ng ot ni nj nk ou nm nn no np ov ow ox bj\" data-selectable-paragraph=\"\">Using custom-made stopwords, as per the dataset\u2019s requirements(what language or lingo your dataset use), you can add other words according to the language of the corpus. Some of the text or links are specific to spam mails, or some lingo is added to pass generic spam filters. This can be avoided by analyzing the dataset well and knowing about the structure and content of spam messages specifically, tailoring stopwords to our needs.<\/li>\n<li id=\"fea3\" class=\"mv mw fo be b gm oy my mz gp oz nb nc os pa nf ng ot pb nj nk ou pc nn no np ov ow ox bj\" data-selectable-paragraph=\"\">Instead of using uni-grams (individual words), using bi-grams and tri-grams can be beneficial to better understand a message\u2019s meaning. For example, take a message \u2014 \u201cgift card worth millions\u201d. Here, instead of using uni-grams, which would give us \u2018gift\u2019, \u2018card\u2019, \u2018worth\u2019, \u2018millions\u2019, we can use bi-grams that give us \u2018gift card\u2019 or \u2018worth millions\u2019 together as one feature. As you can see, this could clearly indicate a spam message, whereas \u2018gift\u2019, \u2018card\u2019, \u2018worth\u2019, and \u2018millions\u2019 could individually be a part of any day-to-day conversation.<\/li>\n<li id=\"af68\" class=\"mv mw fo be b gm oy my mz gp oz nb nc os pa nf ng ot pb nj nk ou pc nn no np ov ow ox bj\" data-selectable-paragraph=\"\">Apply a vector space model like cosine similarity between messages, and use tf-idf vectorization to better understand the relative weight of a word to that document\u2019s importance. You can read more about this here in <a class=\"af mu\" href=\"https:\/\/heartbeat.fritz.ai\/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831\" target=\"_blank\" rel=\"noopener ugc nofollow\">this<\/a> article.<\/li>\n<\/ol>\n<h2 id=\"1871\" class=\"ns nt fo be nu nv nw nx ny nz oa ob oc nd od oe of nh og oh oi nl oj ok ol om bj\" data-selectable-paragraph=\"\">Repository<\/h2>\n<p id=\"b93e\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">Here\u2019s the <a href=\"https:\/\/github.com\/nikitaa30\/Spam-Filtering-techniques\">link<\/a> to my repository, where you can find the complete source code for this tutorial. Also, I will keep on adding code on spam filtering using other techniques soon, so stay connected.<\/p>\n<h1 id=\"2fc6\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Sources to get started with NLP<\/h1>\n<div class=\"qw qx qy qz ra rb\">\n<div class=\"rc ab ik\">\n<div class=\"rd ab cn ca re rf\"><\/div>\n<\/div>\n<\/div>\n<div class=\"qw qx qy qz ra rb\">\n<div class=\"rc ab ik\">\n<div class=\"rk l\">\n<ul>\n<li class=\"rr l rm rn ro rk rp ml rb\"><a href=\"https:\/\/www.kaggle.com\/code\/abhishek\/approaching-almost-any-nlp-problem-on-kaggle\/notebook\">Approaching (Almost) Any NLP Problem on Kaggle<\/a><\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/code\/philculliton\/nlp-getting-started-tutorial\/notebook\">NLP Getting Started Tutorial<\/a><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<h1 id=\"ee94\" class=\"pd nt fo be nu pe pf go ny pg ph gr oc pi pj pk pl pm pn po pp pq pr ps pt pu bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"982a\" class=\"pw-post-body-paragraph mv mw fo be b gm on my mz gp oo nb nc nd op nf ng nh oq nj nk nl or nn no np fh bj\" data-selectable-paragraph=\"\">In this post, we implemented a spam text classifier using a bag-of-words model. We learned how to work efficiently with text data and develop a reliable model using a few of NLP concepts.<\/p>\n<p id=\"f3e2\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">There is a lot more to NLP, and spam filtering in general is a mature field, with various machine learning and deep learning techniques commonly used to improve model results. In future posts, I\u2019ll try to approach spam filtering with different techniques. All feedback is welcome. Please help me improve!<\/p>\n<p id=\"cb9a\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\"><em class=\"nq\">Until next time!<\/em>\ud83d\ude01<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Daniel Min\u00e1rik on Unsplash In this post, we\u2019re going to employ one simple natural language processing (NLP) algorithm known as bag-of-words to classify messages as ham or spam. Using bag of words and feature engineering related to NLP, we\u2019ll get hands-on experience on a small dataset for SMS classification. So, what are we [&hellip;]<\/p>\n","protected":false},"author":36,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[114],"class_list":["post-7126","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Spam Filtering Using Bag-of-Words - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Spam Filtering Using Bag-of-Words\" \/>\n<meta property=\"og:description\" content=\"Photo by Daniel Min\u00e1rik on Unsplash In this post, we\u2019re going to employ one simple natural language processing (NLP) algorithm known as bag-of-words to classify messages as ham or spam. Using bag of words and feature engineering related to NLP, we\u2019ll get hands-on experience on a small dataset for SMS classification. So, what are we [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-14T12:46:47+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:51+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*VN_iumSQ1jKm_Gq1F5PXAQ.jpeg\" \/>\n<meta name=\"author\" content=\"Nikita Sharma\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nikita Sharma\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Spam Filtering Using Bag-of-Words - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/","og_locale":"en_US","og_type":"article","og_title":"Spam Filtering Using Bag-of-Words","og_description":"Photo by Daniel Min\u00e1rik on Unsplash In this post, we\u2019re going to employ one simple natural language processing (NLP) algorithm known as bag-of-words to classify messages as ham or spam. Using bag of words and feature engineering related to NLP, we\u2019ll get hands-on experience on a small dataset for SMS classification. So, what are we [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-14T12:46:47+00:00","article_modified_time":"2025-04-24T17:14:51+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*VN_iumSQ1jKm_Gq1F5PXAQ.jpeg","type":"","width":"","height":""}],"author":"Nikita Sharma","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Nikita Sharma","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/"},"author":{"name":"Nikita Sharma","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/ddaf0d52f59c3a234abfa717ee44af05"},"headline":"Spam Filtering Using Bag-of-Words","datePublished":"2023-08-14T12:46:47+00:00","dateModified":"2025-04-24T17:14:51+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/"},"wordCount":1600,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*VN_iumSQ1jKm_Gq1F5PXAQ.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/","url":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/","name":"Spam Filtering Using Bag-of-Words - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*VN_iumSQ1jKm_Gq1F5PXAQ.jpeg","datePublished":"2023-08-14T12:46:47+00:00","dateModified":"2025-04-24T17:14:51+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*VN_iumSQ1jKm_Gq1F5PXAQ.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*VN_iumSQ1jKm_Gq1F5PXAQ.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/spam-filtering-using-bag-of-words\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Spam Filtering Using Bag-of-Words"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/ddaf0d52f59c3a234abfa717ee44af05","name":"Nikita Sharma","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/19634f9b328196b88e91a242ab1b3576","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1688123806627-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1688123806627-96x96.jpg","caption":"Nikita Sharma"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/nikitasharma\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7126","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/36"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7126"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7126\/revisions"}],"predecessor-version":[{"id":15584,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7126\/revisions\/15584"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7126"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7126"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7126"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7126"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}