{"id":7413,"date":"2023-09-11T09:23:30","date_gmt":"2023-09-11T17:23:30","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7413"},"modified":"2025-04-24T17:14:17","modified_gmt":"2025-04-24T17:14:17","slug":"tokenization-techniques-in-nlp","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/","title":{"rendered":"Tokenization Techniques in NLP"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<figure class=\"lw lx ly lz ma mb lt lu paragraph-image\">\n<div class=\"mc md eb me bg mf\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mg mh c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*pxP4JMEHzcgs-yoU\" alt=\"\" width=\"700\" height=\"406\"><\/figure><div class=\"lt lu lv\"><picture><\/picture><\/div>\n<\/div><figcaption class=\"mi mj mk lt lu ml mm be b bf z dv\" data-selectable-paragraph=\"\">Image by <a class=\"af mn\" href=\"https:\/\/www.kaggle.com\/code\/shivanirana63\/beginner-s-guide-to-word-tokenization\" target=\"_blank\" rel=\"noopener ugc nofollow\">Shivani Rana<\/a> and <a class=\"af mn\" href=\"https:\/\/towardsdatascience.com\/tokenization-for-natural-language-processing-a179a891bad4\" target=\"_blank\" rel=\"noopener\">Srinivas Chakravarthy<\/a><\/figcaption><\/figure>\n<p id=\"db1d\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">When you learn about a specific topic, there\u2019s always more to it. In the world of data, there\u2019s always more math behind it, more processes to unravel, and more to learn. In this article, I will go through the different techniques you can use for tokenization in NLP.<\/p>\n<p id=\"5bf8\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">Let\u2019s start off with some definitions\u2026<\/p>\n<h1 id=\"4302\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">Definitions<\/h1>\n<p id=\"e9d8\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\"><strong class=\"be oo\">Natural Language Processing (NLP)<\/strong> is a computer\/software\/application\u2019s ability to detect and understand human language through speech and text, just as we humans can.<\/p>\n<p id=\"5b5a\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\"><strong class=\"be oo\">Tokenization<\/strong> is the process of breaking down or splitting paragraphs and sentences into smaller units so that they can be easily defined to be used for NLP models. The raw text is broken down into smaller units called tokens.<\/p>\n<p id=\"d3a7\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\"><strong class=\"be oo\">Delimiter<\/strong> is a sequence of one or more characters that mark the beginning or end of a unit of data.<\/p>\n<h1 id=\"7b4d\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">Tokenization Techniques<\/h1>\n<p id=\"793b\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">If tokenization is done to split up words, it is called <strong class=\"be oo\">word tokenization<\/strong>. If tokenization is done to split up sentences, it is <strong class=\"be oo\">called sentence tokenization<\/strong>.<\/p>\n<p id=\"fa5b\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">Tokenization is the first step when working on your NLP pipeline; it has a domino effect on the rest of your pipeline. Tokens can help us to understand the frequency of a particular word in the data and can be used directly as a vector representing the data. In this way, the string goes from being unstructured into a numerical data structure which helps your NLP pipeline run smoothly.<\/p>\n<p id=\"7641\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">There are different types of tokenization techniques, so let\u2019s dive in!<\/p>\n<h1 id=\"80be\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">White Space Tokenization<\/h1>\n<p id=\"a754\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">This is known as one of the simplest tokenization techniques as it uses whitespace within the string as the delimiter of words. Wherever the white space is, it will split the data at that point.<\/p>\n<p id=\"ba92\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">Although this is one of the simplest and fastest tokenization techniques, the key to this is that it only effectively works with languages where white space is used to break apart words and sentences with meaning, such as English.<\/p>\n<p id=\"88b3\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">This technique can be easily executed using Python\u2019s built-in functions.<\/p>\n<figure class=\"oq or os ot ou mb lt lu paragraph-image\">\n<div class=\"mc md eb me bg mf\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mg mh c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*0pxTg0t9Z_Nh-dAI\" alt=\"\" width=\"700\" height=\"259\"><\/figure><div class=\"lt lu op\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*0pxTg0t9Z_Nh-dAI 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*0pxTg0t9Z_Nh-dAI 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*0pxTg0t9Z_Nh-dAI 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*0pxTg0t9Z_Nh-dAI 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*0pxTg0t9Z_Nh-dAI 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*0pxTg0t9Z_Nh-dAI 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*0pxTg0t9Z_Nh-dAI 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*0pxTg0t9Z_Nh-dAI 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*0pxTg0t9Z_Nh-dAI 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*0pxTg0t9Z_Nh-dAI 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*0pxTg0t9Z_Nh-dAI 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*0pxTg0t9Z_Nh-dAI 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*0pxTg0t9Z_Nh-dAI 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*0pxTg0t9Z_Nh-dAI 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"a043\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">Regular Expression Tokenizer<\/h1>\n<p id=\"c823\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">The regular expression tokenizer is a rule-based tokenizer technique and should be used when other techniques are not serving a specific purpose. For example, there may be punctuation in the text that is causing the data to be unclean, therefore it uses regular expression to split a string into substrings.<\/p>\n<p id=\"b4d3\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">This can be easily executed with <a class=\"af mn\" href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">NLTK<\/a>, a Python toolkit built for working with NLP. The <code class=\"cw ov ow ox oy b\">nltk.tokenize.regexp<\/code> module splits the string into substrings. For example:<\/p>\n<figure class=\"oq or os ot ou mb lt lu paragraph-image\">\n<div class=\"mc md eb me bg mf\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mg mh c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*qKHintTDzw4ut3y5\" alt=\"\" width=\"700\" height=\"388\"><\/figure><div class=\"lt lu oz\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*qKHintTDzw4ut3y5 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*qKHintTDzw4ut3y5 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*qKHintTDzw4ut3y5 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*qKHintTDzw4ut3y5 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*qKHintTDzw4ut3y5 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*qKHintTDzw4ut3y5 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*qKHintTDzw4ut3y5 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*qKHintTDzw4ut3y5 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*qKHintTDzw4ut3y5 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*qKHintTDzw4ut3y5 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*qKHintTDzw4ut3y5 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*qKHintTDzw4ut3y5 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*qKHintTDzw4ut3y5 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*qKHintTDzw4ut3y5 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<blockquote class=\"pi\"><p id=\"5a58\" class=\"pj pk fo be pl pm pn po pp pq pr nk dv\" data-selectable-paragraph=\"\">Different teams have different needs. Comet\u2019s got you covered. <a class=\"af mn\" href=\"https:\/\/www.comet.com\/site\/customers\/uber\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Learn how the team at Uber uses Comet\u2019s experiment management to perform real-time model tuning<\/a>.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"92e3\" class=\"nl nm fo be nn no ps nq nr ns pt nu nv nw pu ny nz oa pv oc od oe pw og oh oi bj\" data-selectable-paragraph=\"\">Penn TreeBank Tokenization<\/h1>\n<p id=\"508c\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">The Penn TreeBank Tokenization reads in raw text and outputs tokens of classes based on <a class=\"af mn\" href=\"https:\/\/paperswithcode.com\/dataset\/penn-treebank\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be oo\">Penn Treebank<\/strong><\/a>, one of the largest treebanks published. The TreeBank gives the semantic and syntactic annotation of a language.<\/p>\n<p id=\"d70d\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">Similar to the above technique, it uses regular expression to tokenize text which is similar to the tokenization used in the Penn Treebank and is also part of the <a class=\"af mn\" href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">NLTK<\/a> Python toolkit. For example, you can see the slight difference in the example above and below:<\/p>\n<figure class=\"oq or os ot ou mb lt lu paragraph-image\">\n<div class=\"mc md eb me bg mf\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mg mh c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*K_47eN3SOa9GxmYC\" alt=\"\" width=\"700\" height=\"337\"><\/figure><div class=\"lt lu px\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*K_47eN3SOa9GxmYC 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*K_47eN3SOa9GxmYC 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*K_47eN3SOa9GxmYC 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*K_47eN3SOa9GxmYC 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*K_47eN3SOa9GxmYC 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*K_47eN3SOa9GxmYC 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*K_47eN3SOa9GxmYC 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*K_47eN3SOa9GxmYC 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*K_47eN3SOa9GxmYC 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*K_47eN3SOa9GxmYC 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*K_47eN3SOa9GxmYC 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*K_47eN3SOa9GxmYC 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*K_47eN3SOa9GxmYC 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*K_47eN3SOa9GxmYC 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"a99c\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">SpaCy<\/h1>\n<p id=\"7164\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\"><a class=\"af mn\" href=\"https:\/\/spacy.io\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">SpaCy<\/a> is an open-source Python library that can provide great flexibility. It is a modern version of tokenization in NLP, which is faster and can be simply customized. It has the ability to understand large volumes of text that do not need to be segmented using specific rules. For example:<\/p>\n<figure class=\"oq or os ot ou mb lt lu paragraph-image\">\n<div class=\"mc md eb me bg mf\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mg mh c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*wJid1Z4gth03EUNl\" alt=\"\" width=\"700\" height=\"517\"><\/figure><div class=\"lt lu py\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*wJid1Z4gth03EUNl 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*wJid1Z4gth03EUNl 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*wJid1Z4gth03EUNl 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*wJid1Z4gth03EUNl 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*wJid1Z4gth03EUNl 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*wJid1Z4gth03EUNl 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*wJid1Z4gth03EUNl 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*wJid1Z4gth03EUNl 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*wJid1Z4gth03EUNl 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*wJid1Z4gth03EUNl 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*wJid1Z4gth03EUNl 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*wJid1Z4gth03EUNl 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*wJid1Z4gth03EUNl 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*wJid1Z4gth03EUNl 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"dfed\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">Moses Tokenizer<\/h1>\n<p id=\"7131\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">The Moses Tokenizer technique, similar to SpaCy, is also a rule-based tokenizer. It has the ability to separate punctuation from words, while still being able to preserve special tokens such as a date, all whilst normalizing characters with the input of segmentation logic. For example:<\/p>\n<figure class=\"oq or os ot ou mb lt lu paragraph-image\">\n<div class=\"mc md eb me bg mf\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mg mh c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*_zbaz_9mwRUSM3oS\" alt=\"\" width=\"700\" height=\"365\"><\/figure><div class=\"lt lu pz\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*_zbaz_9mwRUSM3oS 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*_zbaz_9mwRUSM3oS 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*_zbaz_9mwRUSM3oS 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*_zbaz_9mwRUSM3oS 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*_zbaz_9mwRUSM3oS 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*_zbaz_9mwRUSM3oS 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*_zbaz_9mwRUSM3oS 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*_zbaz_9mwRUSM3oS 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*_zbaz_9mwRUSM3oS 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*_zbaz_9mwRUSM3oS 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*_zbaz_9mwRUSM3oS 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*_zbaz_9mwRUSM3oS 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*_zbaz_9mwRUSM3oS 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*_zbaz_9mwRUSM3oS 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"c4da\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">Subword Tokenization<\/h1>\n<p id=\"07d1\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">The subword tokenization technique is based on the fact that frequently occurring words should be located in the vocabulary, such as \u201cthere\u201d, \u201chelping\u201d, etc. However, words that aren\u2019t that common will be split into frequent sub words. For example, the word \u2018reiterate\u2019 can be split into subwords of \u201cre\u201d, and \u201citerate\u201d. These subwords are used more frequently and the meaning is still kept intact. These subwords are assigned to a unique ID, so that the model can learn better over time during the training phase.<\/p>\n<p id=\"ff2a\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">There are different types of subword tokenization:<\/p>\n<ul class=\"\">\n<li id=\"73e3\" class=\"mo mp fo be b mq mr ms mt mu mv mw mx my qa na nb nc qb ne nf ng qc ni nj nk qd qe qf bj\" data-selectable-paragraph=\"\">Byte-Pair Encoding (BPE)<\/li>\n<li id=\"e92c\" class=\"mo mp fo be b mq qg ms mt mu qh mw mx my qi na nb nc qj ne nf ng qk ni nj nk qd qe qf bj\" data-selectable-paragraph=\"\">WordPiece<\/li>\n<li id=\"a59f\" class=\"mo mp fo be b mq qg ms mt mu qh mw mx my qi na nb nc qj ne nf ng qk ni nj nk qd qe qf bj\" data-selectable-paragraph=\"\">Unigram Language Model<\/li>\n<li id=\"992f\" class=\"mo mp fo be b mq qg ms mt mu qh mw mx my qi na nb nc qj ne nf ng qk ni nj nk qd qe qf bj\" data-selectable-paragraph=\"\">SentencePiece<\/li>\n<\/ul>\n<h2 id=\"947d\" class=\"ql nm fo be nn qm qn qo nr qp qq qr nv my qs qt qu nc qv qw qx ng qy qz ra rb bj\" data-selectable-paragraph=\"\">Byte-Pair Encoding<\/h2>\n<p id=\"85e6\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">BPE was first described in the article \u201c<a class=\"af mn\" href=\"https:\/\/www.derczynski.com\/papers\/archive\/BPE_Gage.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">A New Algorithm for Data Compression<\/a>\u201d which was published in 1994. BPE is a data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data.<\/p>\n<h2 id=\"338a\" class=\"ql nm fo be nn qm qn qo nr qp qq qr nv my qs qt qu nc qv qw qx ng qy qz ra rb bj\" data-selectable-paragraph=\"\">WordPiece<\/h2>\n<p id=\"e860\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">WordPiece is the tokenization algorithm that was developed by Google to pre-train BERT. WordPiece acts by first pre-tokenizing the text data into words, as it splits on punctuation and whitespaces. At this point, it then moves on to tokenizing each word into subword units \u2014 which are called wordpieces.<\/p>\n<figure class=\"oq or os ot ou mb lt lu paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mg mh c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*ZMXUnI9YxBb1lOTm\" alt=\"\" width=\"640\" height=\"230\"><\/figure><div class=\"lt lu rc\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*ZMXUnI9YxBb1lOTm 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*ZMXUnI9YxBb1lOTm 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*ZMXUnI9YxBb1lOTm 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*ZMXUnI9YxBb1lOTm 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*ZMXUnI9YxBb1lOTm 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*ZMXUnI9YxBb1lOTm 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1280\/0*ZMXUnI9YxBb1lOTm 1280w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 640px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*ZMXUnI9YxBb1lOTm 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*ZMXUnI9YxBb1lOTm 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*ZMXUnI9YxBb1lOTm 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*ZMXUnI9YxBb1lOTm 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*ZMXUnI9YxBb1lOTm 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*ZMXUnI9YxBb1lOTm 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1280\/0*ZMXUnI9YxBb1lOTm 1280w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 640px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mi mj mk lt lu ml mm be b bf z dv\" data-selectable-paragraph=\"\">Source: <a class=\"af mn\" href=\"https:\/\/ai.googleblog.com\/2021\/12\/a-fast-wordpiece-tokenization-system.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">GoogleBlog<\/a><\/figcaption>\n<\/figure>\n<h2 id=\"32d0\" class=\"ql nm fo be nn qm qn qo nr qp qq qr nv my qs qt qu nc qv qw qx ng qy qz ra rb bj\" data-selectable-paragraph=\"\">Unigram Language Model<\/h2>\n<p id=\"4116\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">The Unigram language model considers each token independent irrespective of the tokens before it. What the model does is that it starts its base vocabulary off with a large number of symbols and eventually trims each of these symbols down to generate a smaller vocabulary. It focuses on the fraction of time a specific word appears in comparison to the words in the training text.<\/p>\n<p id=\"7ce7\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">During the training phase, the Unigram language model computes a loss (log-likelihood) on how much the overall loss would be if a specific symbol was removed from the vocabulary. The model will then remove the symbols that have the lowest loss increase and continues to do this until it obtains its desired vocabulary.<\/p>\n<p id=\"9183\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">It is important to know that the model will always contain the base characters so that any work can be tokenized.<\/p>\n<h2 id=\"ad7a\" class=\"ql nm fo be nn qm qn qo nr qp qq qr nv my qs qt qu nc qv qw qx ng qy qz ra rb bj\" data-selectable-paragraph=\"\">SentencePiece<\/h2>\n<p id=\"02cc\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">Sounds very similar to WordPiece right? However, it is not actually a Tokenizer \u2014 it is a method used for selecting tokens from a predefined list. This supplied corpus allows for it to optimize the tokenization process by implementing the Subword Regularization algorithm.<\/p>\n<p id=\"22a5\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">SentencePiece processes the sentences just as sequences of Unicode characters and then trains tokenization and detokenization models from the sentences.<\/p>\n<h1 id=\"cddd\" class=\"nl nm fo be nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi bj\" data-selectable-paragraph=\"\">Wrapping it up<\/h1>\n<p id=\"463e\" class=\"pw-post-body-paragraph mo mp fo be b mq oj ms mt mu ok mw mx my ol na nb nc om ne nf ng on ni nj nk fh bj\" data-selectable-paragraph=\"\">It\u2019s interesting how something that comes to us naturally can be processed in many different ways for computers.<\/p>\n<p id=\"8a11\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">However, tokenization comes with its own challenges. The biggest challenge is the language input itself. Let\u2019s take English for example \u2014 most words are separated by using a space, and punctuation can give us a better understanding of the context. But not every language is like this. For example, the Mandarin language does not have the most identifiable boundaries between its symbols.<\/p>\n<p id=\"eed6\" class=\"pw-post-body-paragraph mo mp fo be b mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk fh bj\" data-selectable-paragraph=\"\">Language is difficult to learn, especially for computers&#8230;<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Image by Shivani Rana and Srinivas Chakravarthy When you learn about a specific topic, there\u2019s always more to it. In the world of data, there\u2019s always more math behind it, more processes to unravel, and more to learn. In this article, I will go through the different techniques you can use for tokenization in NLP. [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[139],"class_list":["post-7413","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Tokenization Techniques in NLP - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Tokenization Techniques in NLP\" \/>\n<meta property=\"og:description\" content=\"Image by Shivani Rana and Srinivas Chakravarthy When you learn about a specific topic, there\u2019s always more to it. In the world of data, there\u2019s always more math behind it, more processes to unravel, and more to learn. In this article, I will go through the different techniques you can use for tokenization in NLP. [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-11T17:23:30+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*pxP4JMEHzcgs-yoU\" \/>\n<meta name=\"author\" content=\"Nisha Arya Ahmed\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nisha Arya Ahmed\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Tokenization Techniques in NLP - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/","og_locale":"en_US","og_type":"article","og_title":"Tokenization Techniques in NLP","og_description":"Image by Shivani Rana and Srinivas Chakravarthy When you learn about a specific topic, there\u2019s always more to it. In the world of data, there\u2019s always more math behind it, more processes to unravel, and more to learn. In this article, I will go through the different techniques you can use for tokenization in NLP. [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-09-11T17:23:30+00:00","article_modified_time":"2025-04-24T17:14:17+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*pxP4JMEHzcgs-yoU","type":"","width":"","height":""}],"author":"Nisha Arya Ahmed","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Nisha Arya Ahmed","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"Tokenization Techniques in NLP","datePublished":"2023-09-11T17:23:30+00:00","dateModified":"2025-04-24T17:14:17+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/"},"wordCount":1134,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*pxP4JMEHzcgs-yoU","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/","url":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/","name":"Tokenization Techniques in NLP - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*pxP4JMEHzcgs-yoU","datePublished":"2023-09-11T17:23:30+00:00","dateModified":"2025-04-24T17:14:17+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*pxP4JMEHzcgs-yoU","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*pxP4JMEHzcgs-yoU"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/tokenization-techniques-in-nlp\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Tokenization Techniques in NLP"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7413","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7413"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7413\/revisions"}],"predecessor-version":[{"id":15553,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7413\/revisions\/15553"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7413"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7413"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7413"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7413"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}