{"id":9067,"date":"2024-01-30T06:00:54","date_gmt":"2024-01-30T14:00:54","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=9067"},"modified":"2025-04-24T17:03:23","modified_gmt":"2025-04-24T17:03:23","slug":"the-tokenization-concept-in-nlp-using-python","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python\/","title":{"rendered":"The Tokenization Concept in NLP Using\u00a0Python"},"content":{"rendered":"\n<section class=\"section section--body\">\n<p class=\"section-divider\">Tokenization is one of the main concepts of NLP. By definition, it is the process of breaking down given text in natural language processing into the smallest unit in a sentence, called a token. The smallest unit can be considered a word, not an individual character. A sentence\u2019s lowest and smallest unit can be regarded as a word and separate special characters like an exclamation, dot, etc.<\/p>\n<p class=\"section-divider\">In this article, we will learn the following kinds of tokenization:<\/p>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<ul class=\"postList\">\n<li class=\"graf graf--li\">Text Split<\/li>\n<li class=\"graf graf--li\">Sentence Tokenization<\/li>\n<li class=\"graf graf--li\">Word Tokenization<\/li>\n<\/ul>\n<p class=\"graf graf--p\">Let&#8217;s start with our first tokenization type.<\/p>\n<\/div>\n<\/div>\n<\/section>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Text Split Example:<\/strong><\/p>\n<p class=\"graf graf--p\">Let us write the text:<code class=\"markup--code markup--p-code\">\u201cHi Everyone! We are learning NLP.\u201d<\/code> These are the words we have and the whole text. If you want just to split up the words based on the spaces, the result will look like this:<\/p>\n<figure class=\"graf graf--figure\"><img loading=\"lazy\" decoding=\"async\" class=\"graf-image alignnone\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YgsRYkI4kV9Fc-gyuOEWlw.png\" alt=\"code screenshot of tokenization\" width=\"800\" height=\"720\" data-image-id=\"1*YgsRYkI4kV9Fc-gyuOEWlw.png\" data-width=\"1224\" data-height=\"1102\" data-is-featured=\"true\"><figcaption class=\"imageCaption\">Example Text&nbsp;Split<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p class=\"graf graf--p\">We just split the text with dots and space in the above code. You can see from the above screenshot it split every word, like <code class=\"markup--code markup--p-code\">\u201cHi\u201d<\/code> and <code class=\"markup--code markup--p-code\">\u201cEveryone\u201d<\/code>.<\/p>\n<p class=\"graf graf--p\">As evident, the presence of an exclamation mark adds a special character, making the entire expression a token in text processing.<\/p>\n<\/div>\n<\/div>\n<\/section>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Example of Sentence Tokenization:&nbsp;<\/strong><\/p>\n<p class=\"graf graf--p\">Sentence tokenization will separate the sentences. Let&#8217;s see an example with code:<\/p>\n<figure class=\"graf graf--figure\"><img loading=\"lazy\" decoding=\"async\" class=\"graf-image alignnone\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*F2RbeJs4qNCIOttMOUQ52A.png\" alt=\"code screenshot of tokenization\" width=\"800\" height=\"741\" data-image-id=\"1*F2RbeJs4qNCIOttMOUQ52A.png\" data-width=\"1226\" data-height=\"1136\"><p><\/p>\n<figcaption class=\"imageCaption\">Example Sentence Tokenization<\/figcaption>\n<\/figure>\n<p>&nbsp;<\/p>\n<p class=\"graf graf--p\">In the above code, we have imported <code class=\"markup--code markup--p-code\">sent_tokenization<\/code> and <code class=\"markup--code markup--p-code\">word_tokenization<\/code> from the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.nltk.org\/install.html\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.nltk.org\/install.html\"><strong class=\"markup--strong markup--p-strong\"><em class=\"markup--em markup--p-em\">nltk<\/em><\/strong><\/a> and passed the text. We will just display the send tokens.<\/p>\n<p class=\"graf graf--p\">Now you can see it split the sentence correspondingly, like <strong class=\"markup--strong markup--p-strong\">Hi Everyone<\/strong>, <strong class=\"markup--strong markup--p-strong\">We are learning NLP. <\/strong>This is how you can break up text, such as a big article or paragraphs, into a list of sentences.<\/p>\n<\/div>\n<\/div>\n<\/section>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Example of Word Tokenization:<\/strong><\/p>\n<p class=\"graf graf--p\">Let\u2019s explore the process of breaking down that sentence into individual words. This represents a granular level, where tokens, specifically word tokens, correspond to the word tokenize of the given text.<\/p>\n<p class=\"graf graf--p\">Let&#8217;s jump to the code:<\/p>\n<figure class=\"graf graf--figure\"><img loading=\"lazy\" decoding=\"async\" class=\"graf-image alignnone\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*ocfnlBUcrAu-2iukTF6eAg.png\" alt=\"code screenshot of tokenization\" width=\"800\" height=\"741\" data-image-id=\"1*ocfnlBUcrAu-2iukTF6eAg.png\" data-width=\"1226\" data-height=\"1136\"><p><\/p>\n<figcaption class=\"imageCaption\">Example Word Tokenization<\/figcaption>\n<\/figure>\n<p>&nbsp;<\/p>\n<p class=\"graf graf--p\">Now, let\u2019s execute this. You can distinctly observe the contrast between a straightforward text-dot-split and the word tokenize. In the former, each word is treated as an individual token, including the exclamation mark. In the latter, \u201chi,\u201d \u201ceveryone,\u201d and the exclamation mark are treated as separate tokens. This approach considers special characters as distinct tokens, facilitating accurate word matching against the dictionary.<\/p>\n<p class=\"graf graf--p\">This will make it easier to process and avoid unnecessary, meaningless words. That&#8217;s why we use tokenization in every pre-processing step of NLP projects.<\/p>\n<\/div>\n<\/div>\n<\/section>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<h3 class=\"graf graf--h3\">Conclusion<\/h3>\n<p class=\"graf graf--p\">This article taught you the concept of tokenization in NLP. We discussed splitting the text, sentence tokenization, and word tokenization.<\/p>\n<p class=\"graf graf--p\">If you want to explore more about tokenization, then check out this source:<\/p>\n<ul>\n<li class=\"graf graf--mixtapeEmbed\"><a class=\"markup--anchor markup--mixtapeEmbed-anchor\" title=\"https:\/\/www.tokenex.com\/blog\/ab-what-is-nlp-natural-language-processing-tokenization\/\" href=\"https:\/\/www.tokenex.com\/blog\/ab-what-is-nlp-natural-language-processing-tokenization\/\" data-href=\"https:\/\/www.tokenex.com\/blog\/ab-what-is-nlp-natural-language-processing-tokenization\/\"><strong class=\"markup--strong markup--mixtapeEmbed-strong\">What is NLP (Natural Language Processing) Tokenization?<\/strong><br>\n<em class=\"markup--em markup--mixtapeEmbed-em\">Natural Language Processing (NLP) enables machine learning algorithms to organize and understand human language.<\/em><\/a><\/li>\n<\/ul>\n<p class=\"graf graf--p\">I hope this article was helpful. If you think something is missing, have questions, or would like to offer any thoughts or suggestions, go ahead and leave a comment below.<\/p>\n<p class=\"graf graf--p\">I&#8217;ve written some other Android-related content, and if you liked what you read here, you&#8217;ll probably also enjoy my <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/daniamjad.medium.com\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/daniamjad.medium.com\/\">Medium page<\/a>.<\/p>\n<p class=\"graf graf--p\">Sharing (knowledge) is caring \ud83d\ude0a Thanks for reading this article. Be sure to clap or recommend this article if you found it helpful. It means a lot to me.<\/p>\n<p class=\"graf graf--p\">If you need any help, join me on <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/twitter.com\/DanishAmjad10\" target=\"_blank\" rel=\"noopener ugc nofollow\" data-href=\"https:\/\/twitter.com\/DanishAmjad10\">Twitter<\/a>, <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.linkedin.com\/in\/danish-amjad-06a43090\/\" target=\"_blank\" rel=\"noopener ugc nofollow\" data-href=\"https:\/\/www.linkedin.com\/in\/danish-amjad-06a43090\/\">LinkedIn<\/a>, <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/DanishAmjad12\" target=\"_blank\" rel=\"noopener ugc nofollow\" data-href=\"https:\/\/github.com\/DanishAmjad12\">GitHub<\/a>, and subscribe to my <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.youtube.com\/channel\/UC06GphxCS1gzZhdT9dn6kQA?view_as=subscriber\" target=\"_blank\" rel=\"noopener ugc nofollow\" data-href=\"https:\/\/www.youtube.com\/channel\/UC06GphxCS1gzZhdT9dn6kQA?view_as=subscriber\">YouTube Channel<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>Tokenization is one of the main concepts of NLP. By definition, it is the process of breaking down given text in natural language processing into the smallest unit in a sentence, called a token. The smallest unit can be considered a word, not an individual character. A sentence\u2019s lowest and smallest unit can be regarded [&hellip;]<\/p>\n","protected":false},"author":30,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[111],"class_list":["post-9067","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The Tokenization Concept in NLP Using\u00a0Python - Comet<\/title>\n<meta name=\"description\" content=\"Learn about tokenization in NLP, discussing splitting the text, sentence tokenization, and word tokenization.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The Tokenization Concept in NLP Using\u00a0Python\" \/>\n<meta property=\"og:description\" content=\"Learn about tokenization in NLP, discussing splitting the text, sentence tokenization, and word tokenization.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-01-30T14:00:54+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:03:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YgsRYkI4kV9Fc-gyuOEWlw.png\" \/>\n<meta name=\"author\" content=\"Danish Amjad\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Danish Amjad\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"The Tokenization Concept in NLP Using\u00a0Python - Comet","description":"Learn about tokenization in NLP, discussing splitting the text, sentence tokenization, and word tokenization.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python","og_locale":"en_US","og_type":"article","og_title":"The Tokenization Concept in NLP Using\u00a0Python","og_description":"Learn about tokenization in NLP, discussing splitting the text, sentence tokenization, and word tokenization.","og_url":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-01-30T14:00:54+00:00","article_modified_time":"2025-04-24T17:03:23+00:00","og_image":[{"url":"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YgsRYkI4kV9Fc-gyuOEWlw.png","type":"","width":"","height":""}],"author":"Danish Amjad","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Danish Amjad","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python\/"},"author":{"name":"Danish Amjad","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/390a28cbafbb9f3eff1d8af3eebe1242"},"headline":"The Tokenization Concept in NLP Using\u00a0Python","datePublished":"2024-01-30T14:00:54+00:00","dateModified":"2025-04-24T17:03:23+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python\/"},"wordCount":556,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python#primaryimage"},"thumbnailUrl":"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YgsRYkI4kV9Fc-gyuOEWlw.png","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python\/","url":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python","name":"The Tokenization Concept in NLP Using\u00a0Python - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python#primaryimage"},"thumbnailUrl":"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YgsRYkI4kV9Fc-gyuOEWlw.png","datePublished":"2024-01-30T14:00:54+00:00","dateModified":"2025-04-24T17:03:23+00:00","description":"Learn about tokenization in NLP, discussing splitting the text, sentence tokenization, and word tokenization.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python#primaryimage","url":"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YgsRYkI4kV9Fc-gyuOEWlw.png","contentUrl":"https:\/\/cdn-images-1.medium.com\/max\/800\/1*YgsRYkI4kV9Fc-gyuOEWlw.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/the-tokenization-concept-in-nlp-using-python#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"The Tokenization Concept in NLP Using\u00a0Python"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/390a28cbafbb9f3eff1d8af3eebe1242","name":"Danish Amjad","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/52f0c6afeaa011255a815a1584fe9e6f","url":"https:\/\/secure.gravatar.com\/avatar\/67a8202dec3be1f1cbd3b7997971e45865e8c4b9aab1c83113ce81fde4e9694d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/67a8202dec3be1f1cbd3b7997971e45865e8c4b9aab1c83113ce81fde4e9694d?s=96&d=mm&r=g","caption":"Danish Amjad"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/danishamjad\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9067","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/30"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=9067"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9067\/revisions"}],"predecessor-version":[{"id":15396,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9067\/revisions\/15396"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=9067"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=9067"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=9067"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=9067"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}