{"id":7852,"date":"2023-10-06T12:56:06","date_gmt":"2023-10-06T20:56:06","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7852"},"modified":"2025-04-24T17:05:53","modified_gmt":"2025-04-24T17:05:53","slug":"vectorization-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/","title":{"rendered":"Vectorization In Machine Learning"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\">\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*OaJ1HuEz12IaUPR6\" alt=\"\" width=\"700\" height=\"525\"><\/figure><div class=\"mf mg mh\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">Photo by <a class=\"af mz\" href=\"https:\/\/unsplash.com\/@sure_mp?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Surendran MP<\/a> on <a class=\"af mz\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<p id=\"d94a\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Natural language processing is a subfield of artificial intelligence that combines computational linguistics, statistics, machine learning, and deep learning models to allow computers to process human language and understand its context, intent, and sentiment.<\/p>\n<p id=\"3fbe\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">A generic natural language processing (NLP) model is a combination of multiple mathematical and statistical steps. It usually starts with raw text and ends with a model that can predict outcomes. In between, there are multiple steps that include text cleaning, modeling, and hyperparameter tuning.<\/p>\n<p id=\"2879\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Text cleaning, or normalization, is one of the most important steps of any NLP task. It includes removing unwanted data, converting words to their base forms (<a class=\"af mz\" href=\"https:\/\/en.wikipedia.org\/wiki\/Stemming\" target=\"_blank\" rel=\"noopener ugc nofollow\">stemming<\/a> or <a class=\"af mz\" href=\"https:\/\/en.wikipedia.org\/wiki\/Lemmatisation\" target=\"_blank\" rel=\"noopener ugc nofollow\">lemmatization<\/a>), and vectorization.<\/p>\n<p id=\"00d9\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned. It can help improve the execution speed and reduce the training time of your code. In this article, we will discuss some of the best techniques to perform vectorization.<\/p>\n<h1 id=\"2566\" class=\"nv nw fp be nx ny nz gp oa ob oc gs od oe of og oh oi oj ok ol om on oo op oq bj\" data-selectable-paragraph=\"\">Vectorization techniques<\/h1>\n<p id=\"9701\" class=\"pw-post-body-paragraph na nb fp be b gn or nd ne gq os ng nh ni ot nk nl nm ou no np nq ov ns nt nu fi bj\" data-selectable-paragraph=\"\">There are three major methods for performing vectorization on text data:<br>\n1. CountVectorizer<br>\n2. TF-IDF<br>\n3. Word2Vec<\/p>\n<h2 id=\"026a\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">1. CountVectorizer<\/h2>\n<p id=\"d3ae\" class=\"pw-post-body-paragraph na nb fp be b gn or nd ne gq os ng nh ni ot nk nl nm ou no np nq ov ns nt nu fi bj\" data-selectable-paragraph=\"\">CountVectorizer is one of the simplest techniques that is used for converting text into vectors. It starts by tokenizing the document into a list of tokens (words). It selects the unique tokens from the token list and creates a vocabulary of words. Finally, a sparse matrix is created containing the frequency of words, where each row represents different sentences and each column represents unique words.<\/p>\n<p id=\"838b\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Python\u2019s scikit-learn has a class named <code class=\"cw pn po pp pq b\">CountVectorizer<\/code> that provides a simple implementation for performing vectorization on text data. Let\u2019s create a sample corpus of food reviews and convert them into vectors using <code class=\"cw pn po pp pq b\">CountVectorizer<\/code>.<\/p>\n<pre class=\"mi mj mk ml mm pr pq ps pt ax pu bj\"><span id=\"4dbe\" class=\"ow nw fp pq b ia pv pw l iq px\" data-selectable-paragraph=\"\">corpus = ['Food is Bad',\n          'Bad Service Bad Food',\n          'Food is Good',\n          'Good Service With Good Food.',\n          'Service is Bad but Food is Good.' ]<\/span><\/pre>\n<p id=\"17d5\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The first step would be to clean the data by converting text into lowercase and removing stopwords.<\/p>\n<pre>import nltk\nnltk.download('stopwords')\n\nfrom nltk.corpus import stopwords\nenglish_stopwords = set(stopwords.words('english'))\n\ncorpus = ['Food is Bad',\n          'Bad Service Bad Food',\n          'Food is Good',\n          'Good Service With Good Food.',\n          'Service is Bad but Food is Good.']\n\ncleaned_corpus = []\nfor sent in corpus:\n  sent = sent.lower()\n  cleaned_corpus.append(' '.join([word for word in sent.split() if word not in english_stopwords]))\nprint(cleaned_corpus)<\/pre>\n<pre class=\"mi mj mk ml mm pr pq ps pt ax pu bj\"><span id=\"377f\" class=\"ow nw fp pq b ia pv pw l iq px\" data-selectable-paragraph=\"\">cleaned_corpus = ['food bad',\n          'bad service bad food',\n          'food good',\n          'good service good food.',\n          'service bad food good.'\n         ]<\/span><\/pre>\n<p id=\"a59b\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">First, we will import the <code class=\"cw pn po pp pq b\">CountVectorizer<\/code> class:<\/p>\n<pre>from sklearn.feature_extraction.text import CountVectorizer\nimport pandas as pd\nvectorizer = CountVectorizer()\nX = vectorizer.fit_transform(cleaned_corpus)  ## passing cleaned corpus\ndoc_term_matrix = pd.DataFrame(X.toarray(),columns= vectorizer.get_feature_names())\nprint(doc_term_matrix)<\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:231\/1*QfThtDCGpofk21Z42EuEQA.png\" alt=\"\" width=\"231\" height=\"196\"><\/figure><div class=\"mf mg qb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*QfThtDCGpofk21Z42EuEQA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*QfThtDCGpofk21Z42EuEQA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*QfThtDCGpofk21Z42EuEQA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*QfThtDCGpofk21Z42EuEQA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*QfThtDCGpofk21Z42EuEQA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*QfThtDCGpofk21Z42EuEQA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:462\/format:webp\/1*QfThtDCGpofk21Z42EuEQA.png 462w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 231px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*QfThtDCGpofk21Z42EuEQA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*QfThtDCGpofk21Z42EuEQA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*QfThtDCGpofk21Z42EuEQA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*QfThtDCGpofk21Z42EuEQA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*QfThtDCGpofk21Z42EuEQA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*QfThtDCGpofk21Z42EuEQA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:462\/1*QfThtDCGpofk21Z42EuEQA.png 462w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 231px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">[1.1] \u2014 Document Term Matrix of the Corpus \u2014 Screenshot By Author<\/figcaption>\n<\/figure>\n<p id=\"bdc0\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">We can easily interpret the Document Term Matrix, which is the output of CountVectorizer. It contains all the unique words and their frequency count in different sentences. For example, the \u2018bad\u2019 word count is 2 in the second row because it appeared twice in the second review.<\/p>\n<p id=\"f1a2\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">There are also many parameters of CountVectorizer class that can help process the data before applying the vectorization process.<\/p>\n<h2 id=\"5724\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">Important parameters of CountVectorizer<\/h2>\n<ul class=\"\">\n<li id=\"c4cf\" class=\"na nb fp be b gn or nd ne gq os ng nh ni qc nk nl nm qd no np nq qe ns nt nu qf qg qh bj\" data-selectable-paragraph=\"\"><code class=\"cw pn po pp pq b\"><strong class=\"be qi\">lowercase<\/strong><\/code> : convert all characters to lowercase.<\/li>\n<li id=\"e5a1\" class=\"na nb fp be b gn qj nd ne gq qk ng nh ni ql nk nl nm qm no np nq qn ns nt nu qf qg qh bj\" data-selectable-paragraph=\"\"><code class=\"cw pn po pp pq b\"><strong class=\"be qi\">stop_words<\/strong><\/code> : remove stopwords from the text before vectorization. {english},list, <strong class=\"be qi\"><em class=\"qo\">default=None. <\/em><\/strong>If \u2018english\u2019, a built-in stop word list is used.<\/li>\n<li id=\"6e6b\" class=\"na nb fp be b gn qj nd ne gq qk ng nh ni ql nk nl nm qm no np nq qn ns nt nu qf qg qh bj\" data-selectable-paragraph=\"\"><code class=\"cw pn po pp pq b\"><strong class=\"be qi\">strip_accents<\/strong><\/code><strong class=\"be qi\"> : <em class=\"qo\">{\u2018ascii\u2019, \u2018unicode\u2019}, default=None, <\/em><\/strong>Remove accents and perform other character normalization during the preprocessing step.<\/li>\n<li id=\"a683\" class=\"na nb fp be b gn qj nd ne gq qk ng nh ni ql nk nl nm qm no np nq qn ns nt nu qf qg qh bj\" data-selectable-paragraph=\"\"><code class=\"cw pn po pp pq b\"><strong class=\"be qi\">ngram_range<\/strong><\/code><strong class=\"be qi\"> : <em class=\"qo\">tuple (min_n, max_n), default=(1, 1), <\/em><\/strong>range of n-values for different word n-grams to be extracted.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<blockquote class=\"qx\"><p id=\"3b36\" class=\"qy qz fp be ra rb rc rd re rf rg nu dw\" data-selectable-paragraph=\"\">Isolating difficult data samples? Comet can do that. <a class=\"af mz\" href=\"https:\/\/www.comet.com\/site\/blog\/debugging-your-machine-learning-models-with-comet-artifacts\/?utm_source=heartbeat&amp;utm_medium=referral&amp;utm_campaign=AMS_US_EN_AWA_heartbeat_CTA\" target=\"_blank\" rel=\"noopener ugc nofollow\">Learn more with our PetCam scenario and discover Comet Artifacts.<\/a><\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fi fj fk fl fm\">\n<div class=\"ab ca\">\n<div class=\"ch bg eu ev ew ex\">\n<h2 id=\"c51d\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">2. TF-IDF<\/h2>\n<p id=\"3990\" class=\"pw-post-body-paragraph na nb fp be b gn or nd ne gq os ng nh ni ot nk nl nm ou no np nq ov ns nt nu fi bj\" data-selectable-paragraph=\"\">TF-IDF or <a class=\"af mz\" href=\"https:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be qi\">Term Frequency\u2013Inverse Document Frequency<\/strong><\/a>, is a statistical measure that tells how relevant a word is to a document. It combines two metrics \u2014 term frequency and inverse document frequency \u2014 to produce a relevance score.<\/p>\n<p id=\"b64d\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The <strong class=\"be qi\">Term Frequency<\/strong> is the frequency of a word in a document. It is calculated by dividing the occurrence of a word inside a document by the total number of words in that document.<\/p>\n<p id=\"7b13\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The <strong class=\"be qi\">Inverse Document Frequency<\/strong> is a measure of how much information a word provides. Words like \u201cthe,\u201d for example, occur very frequently but provide little context or value to a sentence. It is calculated by taking the inverse log of document frequency, that is the proportion of documents that contain a particular word.<\/p>\n<p id=\"1f57\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">TF-IDF scores range from 0 to 1. A score closer to 1 is higher the importance of a word to a document.<\/p>\n<pre class=\"mi mj mk ml mm pr pq ps pt ax pu bj\"><span id=\"e1e9\" class=\"ow nw fp pq b ia pv pw l iq px\" data-selectable-paragraph=\"\">from sklearn.feature_extraction.text import TfidfVectorizer<\/span><\/pre>\n<p id=\"87cf\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Let\u2019s use the same cleaned corpus from the previous example and perform vectorization using TF-IDF.<\/p>\n<pre class=\"mi mj mk ml mm pr pq ps pt ax pu bj\"><span id=\"2f75\" class=\"ow nw fp pq b ia pv pw l iq px\" data-selectable-paragraph=\"\">corpus = ['food bad',\n          'bad service bad food',\n          'food good',\n          'good service good food.',\n          'service bad food good.'\n         ]<\/span><\/pre>\n<div>\n<pre>from sklearn.feature_extraction.text import TfidfVectorizer\ntf_idf = TfidfVectorizer()\nvectors = tf_idf.fit_transform(corpus)<\/pre>\n<\/div>\n<figure class=\"mi mj mk ml mm mn\">\n<div class=\"py is l ec\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" style=\"color: var(--wpex-text-2); font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:401\/1*7Ekk49LH0U6qsmF5aNuGDw.png\" alt=\"\" width=\"401\" height=\"155\"><\/figure><div class=\"afm qa l\"><\/div>\n<\/div>\n<\/figure>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">[2.1] \u2014 Combined Output of TF-IDF<\/figcaption>\n<\/figure>\n<p id=\"4bbc\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">So, according to the TF-IDF algorithm words, \u2018bad\u2019 and \u2018good\u2019 are the most important words in the whole document, which is some intent to be true. Since, our data contains reviews impact of words that represent a sentiment (good, bad, etc.) tends to be higher.<\/p>\n<p id=\"58a3\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Unlike a bag of words, TF-IDF not only works upon frequency but also retrieves their importance towards the document.<\/p>\n<h2 id=\"08cc\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">3. Word2Vec<\/h2>\n<p id=\"b06e\" class=\"pw-post-body-paragraph na nb fp be b gn or nd ne gq os ng nh ni ot nk nl nm ou no np nq ov ns nt nu fi bj\" data-selectable-paragraph=\"\">Word2Vec is a word embedding technique that makes use of neural networks to convert words into corresponding vectors in a way that semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector. This technique was first implemented by <a class=\"af mz\" href=\"https:\/\/en.wikipedia.org\/wiki\/Tomas_Mikolov\" target=\"_blank\" rel=\"noopener ugc nofollow\">Tomas Mikolov<\/a> at <a class=\"af mz\" href=\"https:\/\/en.wikipedia.org\/wiki\/Google\" target=\"_blank\" rel=\"noopener ugc nofollow\">Google<\/a> back in 2013.<\/p>\n<p id=\"d286\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Word2Vec has the ability to maintain semantic relations between words. It can be understood by a simple example where if we have a \u201cking\u201d vector and we remove the vector \u201cman\u201d from the king and add a \u201cwomen\u201d vector, then we get a vector close to the \u201cqueen\u201d vector in N-dimensional space.<\/p>\n<pre class=\"mi mj mk ml mm pr pq ps pt ax pu bj\"><span id=\"6ee8\" class=\"ow nw fp pq b ia pv pw l iq px\" data-selectable-paragraph=\"\">king - man + women = queen<\/span><\/pre>\n<p id=\"ba39\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">There are two ways to implement Word2Vec techniques: Skip-Gram and CBOW, both of which we\u2019ll cover below.<\/p>\n<h2 id=\"143a\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">1. Skip-Gram<\/h2>\n<p id=\"1a7f\" class=\"pw-post-body-paragraph na nb fp be b gn or nd ne gq os ng nh ni ot nk nl nm ou no np nq ov ns nt nu fi bj\" data-selectable-paragraph=\"\">Skip-Gram tries to predict several context words from a single input word.<\/p>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:462\/1*EqJ3OlSF97cPO0qoGUCX8A.png\" alt=\"\" width=\"462\" height=\"440\"><\/figure><div class=\"mf mg ri\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*EqJ3OlSF97cPO0qoGUCX8A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*EqJ3OlSF97cPO0qoGUCX8A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*EqJ3OlSF97cPO0qoGUCX8A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*EqJ3OlSF97cPO0qoGUCX8A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*EqJ3OlSF97cPO0qoGUCX8A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*EqJ3OlSF97cPO0qoGUCX8A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:924\/format:webp\/1*EqJ3OlSF97cPO0qoGUCX8A.png 924w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 462px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*EqJ3OlSF97cPO0qoGUCX8A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*EqJ3OlSF97cPO0qoGUCX8A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*EqJ3OlSF97cPO0qoGUCX8A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*EqJ3OlSF97cPO0qoGUCX8A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*EqJ3OlSF97cPO0qoGUCX8A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*EqJ3OlSF97cPO0qoGUCX8A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:924\/1*EqJ3OlSF97cPO0qoGUCX8A.png 924w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 462px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">[3.1] \u2014 Skip-gram model architecture<\/figcaption>\n<\/figure>\n<p id=\"28d7\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Here w[i] is the input word at position \u2018i\u2019 in the sentence. The output of the model contains two preceding words and two succeeding words with respect to location \u2018i\u2019.<\/p>\n<h2 id=\"7401\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">2. CBOW<\/h2>\n<p id=\"2c18\" class=\"pw-post-body-paragraph na nb fp be b gn or nd ne gq os ng nh ni ot nk nl nm ou no np nq ov ns nt nu fi bj\" data-selectable-paragraph=\"\">CBOW stands for Continuous Bag of Words, trained to predict a single word from a series of context words. It is the mirror image of the Skip-Gram technique.<\/p>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*jvvkSPkcbFh3m0p4PJcJMw.png\" alt=\"\" width=\"700\" height=\"438\"><\/figure><div class=\"mf mg rj\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*jvvkSPkcbFh3m0p4PJcJMw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*jvvkSPkcbFh3m0p4PJcJMw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*jvvkSPkcbFh3m0p4PJcJMw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*jvvkSPkcbFh3m0p4PJcJMw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*jvvkSPkcbFh3m0p4PJcJMw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*jvvkSPkcbFh3m0p4PJcJMw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*jvvkSPkcbFh3m0p4PJcJMw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*jvvkSPkcbFh3m0p4PJcJMw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*jvvkSPkcbFh3m0p4PJcJMw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*jvvkSPkcbFh3m0p4PJcJMw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*jvvkSPkcbFh3m0p4PJcJMw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*jvvkSPkcbFh3m0p4PJcJMw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*jvvkSPkcbFh3m0p4PJcJMw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*jvvkSPkcbFh3m0p4PJcJMw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">[3.2] \u2014 CBOW model architecture<\/figcaption>\n<\/figure>\n<p id=\"bb29\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Both techniques are good and can generate vectors from the text by considering semantic similarity. Skip-Gram works well with small-size datasets and can find rare words as well. However, CBOW trains faster and can better represent frequent words. According to the original paper, CBOW takes a few hours to train whereas Skip-Gram needs a few days to understand patterns from the data.<\/p>\n<pre>import re\nimport nltk\nfrom nltk.corpus import stopwords\nfrom nltk.stem import WordNetLemmatizer\nwordnet_lemmatizer = WordNetLemmatizer()\n\n## Loading dataset from the github repository raw url\ndf = pd.read_csv('https:\/\/raw.githubusercontent.com\/Abhayparashar31\/datasets\/master\/twitter.csv')\n\n## cleaning the text with the help of an external python file containing cleaning function\ncorpus = []\nfor i in range(0,len(df)):   #we have 1000 reviews\n    corpus.append(clean_text(df['text'][i]))   ## 'clean_text' is a separate python file containing a function for cleaning this data. You can find it here : https:\/\/gist.github.com\/Abhayparashar31\/81997c2e2268338809c46a220d08649f\ncorpus_splitted = [i.split() for i in corpus]\n\n## Generating Word Embeddings\nfrom gensim import models\nw2v = models.Word2Vec(corpus_splitted)\n\n## vector representation of  word 'flood'\nprint(w2v['flood'])\n\n## 5 most similar words for word 'flood'\nprint(w2v.wv.most_similar('flood')[:5)<\/pre>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<div class=\"mo mp ec mq bg mr\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*0PfwgyIpZAoJ5DQztaK30A.png\" alt=\"\" width=\"700\" height=\"160\"><\/figure><div class=\"mf mg rk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*0PfwgyIpZAoJ5DQztaK30A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*0PfwgyIpZAoJ5DQztaK30A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*0PfwgyIpZAoJ5DQztaK30A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*0PfwgyIpZAoJ5DQztaK30A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*0PfwgyIpZAoJ5DQztaK30A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*0PfwgyIpZAoJ5DQztaK30A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*0PfwgyIpZAoJ5DQztaK30A.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*0PfwgyIpZAoJ5DQztaK30A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*0PfwgyIpZAoJ5DQztaK30A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*0PfwgyIpZAoJ5DQztaK30A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*0PfwgyIpZAoJ5DQztaK30A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*0PfwgyIpZAoJ5DQztaK30A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*0PfwgyIpZAoJ5DQztaK30A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*0PfwgyIpZAoJ5DQztaK30A.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">[3.3] \u2014 Vectorization of the Word \u2018flood\u2019<\/figcaption>\n<\/figure>\n<figure class=\"mi mj mk ml mm mn mf mg paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ms mt c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:333\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png\" alt=\"\" width=\"333\" height=\"126\"><\/figure><div class=\"mf mg rl\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:666\/format:webp\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 666w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 333px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:666\/1*yQ0Yl7NsdPGmWIP1JTXcrQ.png 666w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 333px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mu mv mw mf mg mx my be b bf z dw\" data-selectable-paragraph=\"\">[3.4] \u2014 5 Most similar words to the word \u2018flood\u2019<\/figcaption>\n<\/figure>\n<p id=\"c162\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">The default dimension of word embeddings is 100. we can increase or decrease this number by changing <code class=\"cw pn po pp pq b\">vector_size<\/code> parameter. You can find more details about different parameters <a class=\"af mz\" href=\"https:\/\/radimrehurek.com\/gensim\/models\/word2vec.html#:~:text=KeyedVectors.load_word2vec_format().-,Parameters\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be qi\">here<\/strong><\/a>.<\/p>\n<p id=\"c321\" class=\"pw-post-body-paragraph na nb fp be b gn nc nd ne gq nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu fi bj\" data-selectable-paragraph=\"\">Word2Vec offers different built-in functions to get more insights into the data. <code class=\"cw pn po pp pq b\">most_simiar<\/code> is one of them that extracts words that have the highest semantic similarity to the input word. Picture <em class=\"qo\">3.4 <\/em>is a screenshot of the 5 most similar words to the word \u2018<em class=\"qo\">flood<\/em>\u2019 based on our data. If we validate the output words [<em class=\"qo\">injury, death, people, world, life<\/em>] are somehow related to \u2018<em class=\"qo\">flood\u2019<\/em>. This clearly indicates that our algorithm has done its job.<\/p>\n<h2 id=\"1a12\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">Recommended readings<\/h2>\n<pre class=\"mi mj mk ml mm pr pq ps pt ax pu bj\"><span id=\"8871\" class=\"ow nw fp pq b ia pv pw l iq px\" data-selectable-paragraph=\"\">1. <strong class=\"pq fq\">Efficient Estimation of Word Representations in Vector Space<\/strong>\n[<a class=\"af mz\" href=\"https:\/\/arxiv.org\/pdf\/1301.3781.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"qo\">word2vec original paper<\/em><\/a>]<\/span><span id=\"0fe8\" class=\"ow nw fp pq b ia rm pw l iq px\" data-selectable-paragraph=\"\">2. <strong class=\"pq fq\">Text Similarity in Vector Space Models: A Comparative Study<\/strong>\n[<a class=\"af mz\" href=\"https:\/\/arxiv.org\/pdf\/1810.00664.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">By Omid Shahmirzadi and others<\/a>]<\/span><\/pre>\n<h2 id=\"2972\" class=\"ow nw fp be nx ox oy oz oa pa pb pc od ni pd pe pf nm pg ph pi nq pj pk pl pm bj\" data-selectable-paragraph=\"\">Conclusion<\/h2>\n<p id=\"8f95\" class=\"pw-post-body-paragraph na nb fp be b gn or nd ne gq os ng nh ni ot nk nl nm ou no np nq ov ns nt nu fi bj\" data-selectable-paragraph=\"\">Vectorization is an important step that is mostly gets ignored by most beginner data scientists. It is a crucial phase of text cleaning that converts your text data into vectors. Later, these vectors are used for training a machine learning algorithm for generating useful predictions. In short, we can say that vectorization is as important as removing unwanted data from the raw text or training an ml model on the data.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Surendran MP on Unsplash Natural language processing is a subfield of artificial intelligence that combines computational linguistics, statistics, machine learning, and deep learning models to allow computers to process human language and understand its context, intent, and sentiment. A generic natural language processing (NLP) model is a combination of multiple mathematical and statistical [&hellip;]<\/p>\n","protected":false},"author":47,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[120],"class_list":["post-7852","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Vectorization In Machine Learning - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Vectorization In Machine Learning\" \/>\n<meta property=\"og:description\" content=\"Photo by Surendran MP on Unsplash Natural language processing is a subfield of artificial intelligence that combines computational linguistics, statistics, machine learning, and deep learning models to allow computers to process human language and understand its context, intent, and sentiment. A generic natural language processing (NLP) model is a combination of multiple mathematical and statistical [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-10-06T20:56:06+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:05:53+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*OaJ1HuEz12IaUPR6\" \/>\n<meta name=\"author\" content=\"Abhay Parashar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Abhay Parashar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Vectorization In Machine Learning - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/","og_locale":"en_US","og_type":"article","og_title":"Vectorization In Machine Learning","og_description":"Photo by Surendran MP on Unsplash Natural language processing is a subfield of artificial intelligence that combines computational linguistics, statistics, machine learning, and deep learning models to allow computers to process human language and understand its context, intent, and sentiment. A generic natural language processing (NLP) model is a combination of multiple mathematical and statistical [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-10-06T20:56:06+00:00","article_modified_time":"2025-04-24T17:05:53+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*OaJ1HuEz12IaUPR6","type":"","width":"","height":""}],"author":"Abhay Parashar","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Abhay Parashar","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/"},"author":{"name":"Abhay Parashar","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/efd71dff0d86bae98e6ccfafd79e6280"},"headline":"Vectorization In Machine Learning","datePublished":"2023-10-06T20:56:06+00:00","dateModified":"2025-04-24T17:05:53+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/"},"wordCount":1146,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*OaJ1HuEz12IaUPR6","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/","url":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/","name":"Vectorization In Machine Learning - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*OaJ1HuEz12IaUPR6","datePublished":"2023-10-06T20:56:06+00:00","dateModified":"2025-04-24T17:05:53+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*OaJ1HuEz12IaUPR6","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*OaJ1HuEz12IaUPR6"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/vectorization-in-machine-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Vectorization In Machine Learning"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/efd71dff0d86bae98e6ccfafd79e6280","name":"Abhay Parashar","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/48a73d1fb964b15ec72122f8815ad8af","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1654615642757-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1654615642757-96x96.jpg","caption":"Abhay Parashar"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/parasharabhay13gmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7852","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/47"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7852"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7852\/revisions"}],"predecessor-version":[{"id":15513,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7852\/revisions\/15513"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7852"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7852"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7852"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7852"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}