{"id":7417,"date":"2023-09-11T09:53:44","date_gmt":"2023-09-11T17:53:44","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7417"},"modified":"2025-04-24T17:14:15","modified_gmt":"2025-04-24T17:14:15","slug":"bert-state-of-the-art-model-for-natural-language-processing","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/","title":{"rendered":"BERT: State-of-the-Art Model for Natural Language Processing"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"e930\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">BERT is among those developments proposed by the Google research team that shifted machine learning standards by demonstrating outstanding results in NLP tasks like question-answering in chatbot applications, computer translation, language interpretation, next sentence prediction, and much, much more.<\/p>\n<h2 id=\"aca0\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">Overview<\/h2>\n<p id=\"1148\" class=\"pw-post-body-paragraph lt lu fo be b lv nl lx ly lz nm mb mc md nn mf mg mh no mj mk ml np mn mo mp fh bj\" data-selectable-paragraph=\"\">BERT (Bidirectional Encoder representation from Transformers) is an open-source machine learning framework designed to assist computers in comprehending the context of ambiguous language and learn to derive knowledge and patterns from the sequence of words.<\/p>\n<p id=\"ce1e\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">What gives BERT an edge over other NLP models is its bidirectional training which is in contrast to the earlier models that looked at a text sequence only in one direction, either from left to right or vice versa. The BERT model is pre-trained using the Wikipedia corpus and the BooksCorpus. It can also be later fine-tuned and adapted to the vocabulary and datasets of our choice.<\/p>\n<h2 id=\"edfb\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">Introduction of BERT model<\/h2>\n<p id=\"f283\" class=\"pw-post-body-paragraph lt lu fo be b lv nl lx ly lz nm mb mc md nn mf mg mh no mj mk ml np mn mo mp fh bj\" data-selectable-paragraph=\"\">BERT is a transformer model that includes an attention mechanism. It learns to derive contextual relations using a series of encoders that derive feature representations or embeddings from the text.<\/p>\n<p id=\"f07a\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The use of bidirectional training helps BERT stand out from naive language models that fail to derive the features from the text from both directions simultaneously; hence the factor of correlation between words is missing from those modes.<\/p>\n<figure class=\"nt nu nv nw nx ny nq nr paragraph-image\">\n<div class=\"nz oa eb ob bg oc\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg od oe c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*9LX_2r3W_4V1g6mU.jpg\" alt=\"\" width=\"700\" height=\"404\"><\/figure><div class=\"nq nr ns\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*9LX_2r3W_4V1g6mU.jpg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*9LX_2r3W_4V1g6mU.jpg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*9LX_2r3W_4V1g6mU.jpg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*9LX_2r3W_4V1g6mU.jpg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*9LX_2r3W_4V1g6mU.jpg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*9LX_2r3W_4V1g6mU.jpg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*9LX_2r3W_4V1g6mU.jpg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*9LX_2r3W_4V1g6mU.jpg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*9LX_2r3W_4V1g6mU.jpg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*9LX_2r3W_4V1g6mU.jpg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*9LX_2r3W_4V1g6mU.jpg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*9LX_2r3W_4V1g6mU.jpg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*9LX_2r3W_4V1g6mU.jpg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*9LX_2r3W_4V1g6mU.jpg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div><figcaption class=\"of og oh nq nr oi oj be b bf z dv\" data-selectable-paragraph=\"\">Self-attention enabled embeddings generated from the BERT model [<a class=\"af ok\" href=\"https:\/\/www.geeksforgeeks.org\/explanation-of-bert-model-nlp\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Source<\/a>]<\/figcaption><\/figure>\n<p id=\"d5b8\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">A <code class=\"cw ol om on oo b\">max_len<\/code> bounds the input text to the BERT model to train the model. If the value of <code class=\"cw ol om on oo b\">max_len<\/code> is set to 512, the 512 tokens will be fed to the input layer padded by tokens if the actual tokens are less than 512 or trimmed if the number of tokens exceeds 512.<\/p>\n<p id=\"4ba3\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Tokenizing sentences and deriving features from the sequence of words longer than the training corpus is very hard to cope with due to the loss of contextual information. In transformers, when the model tries to predict the next word in the sequence, it searches for the positions where the most information is concentrated using the attention blocks. The attention vector acts as a middleman between the input text and the output value. It convolves the input text vector so that some timestamps that have more hidden information are given high importance by masking with 1, else, given no importance, and masked by 0. BERT usually applies self-attention in between the encoder blocks to give importance to important tokens.<\/p>\n<figure class=\"nt nu nv nw nx ny nq nr paragraph-image\">\n<div class=\"nz oa eb ob bg oc\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg od oe c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*qbs7ghGnbQgAGWGg.jpg\" alt=\"\" width=\"700\" height=\"273\"><\/figure><div class=\"nq nr ns\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*qbs7ghGnbQgAGWGg.jpg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*qbs7ghGnbQgAGWGg.jpg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*qbs7ghGnbQgAGWGg.jpg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*qbs7ghGnbQgAGWGg.jpg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*qbs7ghGnbQgAGWGg.jpg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*qbs7ghGnbQgAGWGg.jpg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*qbs7ghGnbQgAGWGg.jpg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*qbs7ghGnbQgAGWGg.jpg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*qbs7ghGnbQgAGWGg.jpg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*qbs7ghGnbQgAGWGg.jpg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*qbs7ghGnbQgAGWGg.jpg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*qbs7ghGnbQgAGWGg.jpg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*qbs7ghGnbQgAGWGg.jpg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*qbs7ghGnbQgAGWGg.jpg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"of og oh nq nr oi oj be b bf z dv\" data-selectable-paragraph=\"\">BERT output as contextualized embeddings [<a class=\"af ok\" href=\"https:\/\/www.geeksforgeeks.org\/explanation-of-bert-model-nlp\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Source<\/a>]<\/figcaption>\n<\/figure>\n<p id=\"ecf6\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The model takes a <code class=\"cw ol om on oo b\">CLS<\/code> token prepended with the input text, which acts as a classification token. Each of the encoder layers applies self-attention mechanism and passes the derived embeddings through a feedforward network. The base version of the model outputs a vector of size <em class=\"oq\">768 <\/em>which can be used later with a classifier model to predict the corresponding class labels.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<blockquote class=\"oz\"><p id=\"39b3\" class=\"pa pb fo be pc pd pe pf pg ph pi mp dv\" data-selectable-paragraph=\"\">Innovation and academia go hand-in-hand. <a class=\"af ok\" href=\"https:\/\/www.youtube.com\/watch?v=7XCsi64HLQ8\" target=\"_blank\" rel=\"noopener ugc nofollow\">Listen to our own CEO Gideon Mendels chat with the Stanford MLSys Seminar Series team about the future of MLOps and give the Comet platform a try for free<\/a>!<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h2 id=\"3677\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">LSTM vs BERT Transformer model<\/h2>\n<blockquote class=\"pj pk pl\"><p id=\"58be\" class=\"lt lu oq be b lv lw lx ly lz ma mb mc pm me mf mg pn mi mj mk po mm mn mo mp fh bj\" data-selectable-paragraph=\"\">LSTM is dead, long live transformers \ud83d\ude00<\/p><\/blockquote>\n<p id=\"1a03\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Traditional LSTM models have deeper roots in the NLP world and have been used to solve natural language processing-related tasks. However, LSTMs take more time concerning training the model since they take words sequentially based on timestamps. LSTMs (or rather, RNNs in general), use sequential processing, i.e., processing word-by-word. On the other hand, transformers use non-sequential processing and look at a complete sentence..<\/p>\n<figure class=\"nt nu nv nw nx ny nq nr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg od oe c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*mKc-hkPYdnn_dlxI.gif\" alt=\"\" width=\"640\" height=\"566\"><\/figure><div class=\"nq nr pp\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*mKc-hkPYdnn_dlxI.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*mKc-hkPYdnn_dlxI.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*mKc-hkPYdnn_dlxI.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*mKc-hkPYdnn_dlxI.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*mKc-hkPYdnn_dlxI.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*mKc-hkPYdnn_dlxI.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1280\/0*mKc-hkPYdnn_dlxI.gif 1280w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 640px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*mKc-hkPYdnn_dlxI.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*mKc-hkPYdnn_dlxI.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*mKc-hkPYdnn_dlxI.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*mKc-hkPYdnn_dlxI.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*mKc-hkPYdnn_dlxI.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*mKc-hkPYdnn_dlxI.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1280\/0*mKc-hkPYdnn_dlxI.gif 1280w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 640px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"of og oh nq nr oi oj be b bf z dv\" data-selectable-paragraph=\"\">Encoder-decoder mechanism in Transformers [<a class=\"af ok\" href=\"https:\/\/ai.googleblog.com\/2017\/08\/transformer-novel-neural-network.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Source<\/a>]<\/figcaption>\n<\/figure>\n<p id=\"74ff\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">On the other hand, transformers have much higher bandwidth and long-term memory. The model predicts the following word in the sequence based on context vectors associated with these source positions and all the previously generated target words with the <a class=\"af ok\" href=\"https:\/\/medium.com\/analytics-vidhya\/https-medium-com-understanding-attention-mechanism-natural-language-processing-9744ab6aed6a\" rel=\"noopener\">attention mechanism<\/a>.<\/p>\n<p id=\"40e8\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The BERT model, which is deeply based on transformers models, overcomes these two limitations and makes the model faster and deeply bidirectional.<\/p>\n<h1 id=\"bbe6\" class=\"pq mr fo be ms pr ps pt mw pu pv pw na px py pz qa qb qc qd qe qf qg qh qi qj bj\" data-selectable-paragraph=\"\">Different learning strategies of BERT<\/h1>\n<h2 id=\"5650\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Masked Language Model (MLM)<\/strong><\/h2>\n<p id=\"a083\" class=\"pw-post-body-paragraph lt lu fo be b lv nl lx ly lz nm mb mc md nn mf mg mh no mj mk ml np mn mo mp fh bj\" data-selectable-paragraph=\"\">BERT takes in sentences with random words replaced with [MASK] tokens and uses this as input text. In training, BERT attempts to predict the original text of the words replaced by [MASK] tokens, based on the pattern of other sequences of non-masked words. This helps the model better understand the domain-specific context of the language-model. The end goal of the BERT model is to predict the same sentence that was input, thereby filling in the masked tokens with the original words.<\/p>\n<figure class=\"nt nu nv nw nx ny nq nr paragraph-image\">\n<div class=\"nz oa eb ob bg oc\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg od oe c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg\" alt=\"\" width=\"700\" height=\"514\"><\/figure><div class=\"nq nr qk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*l8zo-INydmcuhW4c7Ug6mQ.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"of og oh nq nr oi oj be b bf z dv\" data-selectable-paragraph=\"\">Masked Language model in BERT [<a class=\"af ok\" href=\"https:\/\/www.geeksforgeeks.org\/understanding-bert-nlp\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be op\">Source<\/strong><\/a>]<\/figcaption>\n<\/figure>\n<p id=\"f637\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The model uses a \u2018softmax\u2019 layer to predict the word from the vocabulary that would best fill in the blank using probability.<\/p>\n<h2 id=\"ca83\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">Next Sentence Prediction<\/h2>\n<p id=\"7365\" class=\"pw-post-body-paragraph lt lu fo be b lv nl lx ly lz nm mb mc md nn mf mg mh no mj mk ml np mn mo mp fh bj\" data-selectable-paragraph=\"\">In this training technique, BERT takes a set of sentences as input and predicts if one sentence will logically follow the other. This helps Bert understand the context across different sentences and applies the attention technique for long-range memory.<\/p>\n<figure class=\"nt nu nv nw nx ny nq nr paragraph-image\">\n<div class=\"nz oa eb ob bg oc\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg od oe c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*FGYOxtx-Wfd8H5sTl8sh4w.png\" alt=\"\" width=\"700\" height=\"221\"><\/figure><div class=\"nq nr ql\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*FGYOxtx-Wfd8H5sTl8sh4w.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"of og oh nq nr oi oj be b bf z dv\" data-selectable-paragraph=\"\">Next sentence prediction training strategy followed by BERT [<a class=\"af ok\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener ugc nofollow\">Source<\/a>]<\/figcaption>\n<\/figure>\n<p id=\"e27f\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The model distinguishes between the sets of sentences by the use of tokens which are fed to the model in combination with the input set of sentences:<\/p>\n<ol class=\"\">\n<li id=\"3384\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc md qm mf mg mh qn mj mk ml qo mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\">Two sentences are separated by a [<code class=\"cw ol om on oo b\">SEP<\/code>] token and start with a [<code class=\"cw ol om on oo b\">CLS<\/code>] token.<\/li>\n<li id=\"825b\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\">A sentence embedding indicating sentence A\/B is mapped to each word of the sentence to understand the correlation between the sentences.<\/li>\n<li id=\"8f03\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\">Each token\u2019s positional embedding is added to indicate its position in the sequence, giving it a bidirectional overview in terms of model training.<\/li>\n<\/ol>\n<p id=\"873c\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The training of the BERT model involves using both the strategies of Masked Language Model and Next Sentence Prediction simultaneously, with the aim of minimizing the cumulative loss function of the two strategies and getting a better understanding of the overall language.<\/p>\n<h2 id=\"a587\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">Why are pre-trained models like BERT better for NLP related tasks?<\/h2>\n<p id=\"1ca5\" class=\"pw-post-body-paragraph lt lu fo be b lv nl lx ly lz nm mb mc md nn mf mg mh no mj mk ml np mn mo mp fh bj\" data-selectable-paragraph=\"\">Transfer learning is a technique of re-applying a previously trained ML model on a new task using the past knowledge from a related task in which the model had already been trained before.<\/p>\n<p id=\"33e6\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The main goal of using a pre-trained model is to solve a similar problem using the derived features from the previously-trained model. Instead of building a model from scratch and going through all the initial stages of training the model, it&#8217;s often more efficient to re-use the model trained on another problem as a starting point.<\/p>\n<figure class=\"nt nu nv nw nx ny nq nr paragraph-image\">\n<div class=\"nz oa eb ob bg oc\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg od oe c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*TXb_t_Ywesrr_HTjWiwzTA.png\" alt=\"\" width=\"700\" height=\"346\"><\/figure><div class=\"nq nr qx\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*TXb_t_Ywesrr_HTjWiwzTA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*TXb_t_Ywesrr_HTjWiwzTA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*TXb_t_Ywesrr_HTjWiwzTA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*TXb_t_Ywesrr_HTjWiwzTA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*TXb_t_Ywesrr_HTjWiwzTA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*TXb_t_Ywesrr_HTjWiwzTA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*TXb_t_Ywesrr_HTjWiwzTA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*TXb_t_Ywesrr_HTjWiwzTA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*TXb_t_Ywesrr_HTjWiwzTA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*TXb_t_Ywesrr_HTjWiwzTA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*TXb_t_Ywesrr_HTjWiwzTA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*TXb_t_Ywesrr_HTjWiwzTA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*TXb_t_Ywesrr_HTjWiwzTA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*TXb_t_Ywesrr_HTjWiwzTA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"of og oh nq nr oi oj be b bf z dv\" data-selectable-paragraph=\"\">Training vs Performance of models following Transfer Learning [<a class=\"af ok\" href=\"https:\/\/www.researchgate.net\/figure\/Performance-graph-with-and-without-Transfer-Learning_fig2_345904103\" target=\"_blank\" rel=\"noopener ugc nofollow\">Source<\/a>]<\/figcaption>\n<\/figure>\n<ol class=\"\">\n<li id=\"2150\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc md qm mf mg mh qn mj mk ml qo mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Higher start:<\/strong> The initial knowledge curve of the base model is already higher than the model trained from scratch.<\/li>\n<li id=\"8751\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Higher slope:<\/strong> The source model\u2019s performance and quality of training, as depicted in the graph, are steeper than they otherwise would be.<\/li>\n<li id=\"43cb\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Higher asymptote:<\/strong> The consolidated knowledge of the pre-trained model is better than it otherwise would be.<\/li>\n<\/ol>\n<h2 id=\"b1e9\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">Fine Tuning BERT model<\/h2>\n<figure class=\"nt nu nv nw nx ny nq nr paragraph-image\">\n<div class=\"nz oa eb ob bg oc\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg od oe c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-AD1XLbXFVrPPlTKvdB4WA.png\" alt=\"\" width=\"700\" height=\"251\"><\/figure><div class=\"nq nr qy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*-AD1XLbXFVrPPlTKvdB4WA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*-AD1XLbXFVrPPlTKvdB4WA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*-AD1XLbXFVrPPlTKvdB4WA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*-AD1XLbXFVrPPlTKvdB4WA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*-AD1XLbXFVrPPlTKvdB4WA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*-AD1XLbXFVrPPlTKvdB4WA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*-AD1XLbXFVrPPlTKvdB4WA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*-AD1XLbXFVrPPlTKvdB4WA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*-AD1XLbXFVrPPlTKvdB4WA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*-AD1XLbXFVrPPlTKvdB4WA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*-AD1XLbXFVrPPlTKvdB4WA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*-AD1XLbXFVrPPlTKvdB4WA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*-AD1XLbXFVrPPlTKvdB4WA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*-AD1XLbXFVrPPlTKvdB4WA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"of og oh nq nr oi oj be b bf z dv\" data-selectable-paragraph=\"\">Fine-tuning the BERT model<\/figcaption>\n<\/figure>\n<p id=\"1e83\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The base version of BERT is trained on the Wikipedia corpus and BooksCorpus. Note that it might require some domain-specific knowledge of the language to use. To achieve this, it&#8217;s important to fine-tune the model and provide context to the model so that it performs well in scenarios where domain-specific knowledge is required.<\/p>\n<p id=\"3158\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Fine-tuning the model on even a much smaller dataset as compared to the one on which it was originally trained can bring major advantages as well. Depending on your problem at hand, some fine-tuning techniques include:<\/p>\n<ul class=\"\">\n<li id=\"6aff\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc md qm mf mg mh qn mj mk ml qo mn mo mp qz qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Retraining the entire architecture<\/strong> <strong class=\"be op\">of the model: <\/strong>One can retrain the entire model, which means training all the model\u2019s layers again. The training error\/loss is back-propagated through the entire architecture, updating the weights of all the pre-trained layers.<\/li>\n<li id=\"b721\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qz qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Train some layers while freezing the others<\/strong>: In this strategy of partial training, one can keep the weights of some layers of the model frozen while retraining only the remaining layers.<\/li>\n<li id=\"f090\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qz qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Freeze the entire architecture: <\/strong>One can even think of freezing all the pre-trained layers.<\/li>\n<\/ul>\n<h2 id=\"ac06\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">NLP applications where BERT is useful<\/h2>\n<p id=\"22de\" class=\"pw-post-body-paragraph lt lu fo be b lv nl lx ly lz nm mb mc md nn mf mg mh no mj mk ml np mn mo mp fh bj\" data-selectable-paragraph=\"\">Below are some NLP applications where BERT has proven to be useful:<\/p>\n<ol class=\"\">\n<li id=\"4abb\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc md qm mf mg mh qn mj mk ml qo mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Text classification: <\/strong>As an example, sentiment analysis classifies the text into different categories to measure the positivity\/negativity of the sentence.<\/li>\n<li id=\"0d0a\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Text summarization:<\/strong> BERT is useful for both extractive and abstractive text summarization. By extractive summarization, we mean that important sentences from the text are extracted, and a summary is generated. In abstractive summarization, novel sentences are framed by introducing new words that show similar meaning.<\/li>\n<li id=\"1de5\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Smart search engines:<\/strong> With the development of BERT, search engines like Google can better understand the intention and context of search text and provide better and relevant results by analyzing sentences parallelly with maximum self-attention.<\/li>\n<li id=\"8d19\" class=\"lt lu fo be b lv qs lx ly lz qt mb mc md qu mf mg mh qv mj mk ml qw mn mo mp qp qq qr bj\" data-selectable-paragraph=\"\"><strong class=\"be op\">Question answering:<\/strong> The same features can be extended to work as a chatbot.<\/li>\n<\/ol>\n<h2 id=\"5051\" class=\"mq mr fo be ms mt mu mv mw mx my mz na md nb nc nd mh ne nf ng ml nh ni nj nk bj\" data-selectable-paragraph=\"\">End Notes<\/h2>\n<p id=\"678a\" class=\"pw-post-body-paragraph lt lu fo be b lv nl lx ly lz nm mb mc md nn mf mg mh no mj mk ml np mn mo mp fh bj\" data-selectable-paragraph=\"\">BERT is one of the best performers on a variety of NLP tasks and the credit goes to its bidirectional training strategy. The model is pre-trained, but can learn to apply the patterns of the language for other tasks as well with the use of transfer learning. We can later use this learning to solve simpler tasks like sequence classification or more complex tasks like machine translation, question-answering, etc.<\/p>\n<p id=\"936c\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">To get a detailed overview do have a look at Google\u2019s <a class=\"af ok\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener ugc nofollow\">original BERT research paper<\/a>. Another useful reference is the <a class=\"af ok\" href=\"https:\/\/github.com\/google-research\/bert\" target=\"_blank\" rel=\"noopener ugc nofollow\">BERT source code<\/a> and models, which were generously released as open source by the research team.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>BERT is among those developments proposed by the Google research team that shifted machine learning standards by demonstrating outstanding results in NLP tasks like question-answering in chatbot applications, computer translation, language interpretation, next sentence prediction, and much, much more. Overview BERT (Bidirectional Encoder representation from Transformers) is an open-source machine learning framework designed to assist [&hellip;]<\/p>\n","protected":false},"author":53,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[155],"class_list":["post-7417","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>BERT: State-of-the-Art Model for Natural Language Processing - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"BERT: State-of-the-Art Model for Natural Language Processing\" \/>\n<meta property=\"og:description\" content=\"BERT is among those developments proposed by the Google research team that shifted machine learning standards by demonstrating outstanding results in NLP tasks like question-answering in chatbot applications, computer translation, language interpretation, next sentence prediction, and much, much more. Overview BERT (Bidirectional Encoder representation from Transformers) is an open-source machine learning framework designed to assist [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-11T17:53:44+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*9LX_2r3W_4V1g6mU.jpg\" \/>\n<meta name=\"author\" content=\"Pragati Baheti\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Pragati Baheti\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"BERT: State-of-the-Art Model for Natural Language Processing - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/","og_locale":"en_US","og_type":"article","og_title":"BERT: State-of-the-Art Model for Natural Language Processing","og_description":"BERT is among those developments proposed by the Google research team that shifted machine learning standards by demonstrating outstanding results in NLP tasks like question-answering in chatbot applications, computer translation, language interpretation, next sentence prediction, and much, much more. Overview BERT (Bidirectional Encoder representation from Transformers) is an open-source machine learning framework designed to assist [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-09-11T17:53:44+00:00","article_modified_time":"2025-04-24T17:14:15+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*9LX_2r3W_4V1g6mU.jpg","type":"","width":"","height":""}],"author":"Pragati Baheti","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Pragati Baheti","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/"},"author":{"name":"Pragati Baheti","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/54958874fd9a373469e70e19b6597439"},"headline":"BERT: State-of-the-Art Model for Natural Language Processing","datePublished":"2023-09-11T17:53:44+00:00","dateModified":"2025-04-24T17:14:15+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/"},"wordCount":1523,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*9LX_2r3W_4V1g6mU.jpg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/","url":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/","name":"BERT: State-of-the-Art Model for Natural Language Processing - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*9LX_2r3W_4V1g6mU.jpg","datePublished":"2023-09-11T17:53:44+00:00","dateModified":"2025-04-24T17:14:15+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*9LX_2r3W_4V1g6mU.jpg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*9LX_2r3W_4V1g6mU.jpg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/bert-state-of-the-art-model-for-natural-language-processing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"BERT: State-of-the-Art Model for Natural Language Processing"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/54958874fd9a373469e70e19b6597439","name":"Pragati Baheti","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/851362323c20d10f17041155fc07cae2","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1535716570395-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1535716570395-96x96.jpg","caption":"Pragati Baheti"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/pragatibaheti001gmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7417","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/53"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7417"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7417\/revisions"}],"predecessor-version":[{"id":15551,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7417\/revisions\/15551"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7417"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7417"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7417"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7417"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}