{"id":7342,"date":"2023-08-29T13:32:31","date_gmt":"2023-08-29T21:32:31","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7342"},"modified":"2025-04-24T17:14:29","modified_gmt":"2025-04-24T17:14:29","slug":"the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/","title":{"rendered":"The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*9s0eyPQqGOhQW5fxmM90Zw.jpeg\" alt=\"\" width=\"2400\" height=\"1665\"><\/figure><div class=\"mg bg\">\n<figure class=\"mh mi mj mk ml mg bg paragraph-image\"><picture><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"4611\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Introduction<\/strong><\/h1>\n<p id=\"60ce\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Speech recognition is invading our lives. It\u2019s built into our phones (Siri), our game consoles (Kinect), our smartwatches (Apple Watch), and even our homes (Amazon Echo). But speech recognition has been around for decades, so why is it just now hitting the mainstream?<\/p>\n<p id=\"0bc1\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">The reason is that deep learning finally made speech recognition accurate enough to be useful outside of carefully-controlled environments. In this blog post, we\u2019ll learn how to perform speech recognition with 3 different implementations of popular deep learning frameworks.<\/p>\n<blockquote class=\"ok ol om\"><p id=\"337a\" class=\"nk nl on be b gm of nn no gp og nq nr oo oh nu nv op oi ny nz oq oj oc od oe fh bj\" data-selectable-paragraph=\"\"><strong class=\"be or\">Note<\/strong>: The content of this blog post comes from <a class=\"af os\" href=\"https:\/\/www.youtube.com\/watch?v=3MjIkWxXigM&amp;t=2220s&amp;authuser=2\" target=\"_blank\" rel=\"noopener ugc nofollow\">Navdeep Jaitly\u2019s lecture at Stanford<\/a>. I\u2019d highly recommend watching his talk for the full details.<\/p><\/blockquote>\n<h1 id=\"f02a\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Speech Recognition \u2014 The Classic Way<\/strong><\/h1>\n<p id=\"149d\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">In the era of <em class=\"on\">OK Google<\/em>, I might not really need to define ASR, but here\u2019s a basic description: Say you have a person or an audio source saying something textual, and you have a bunch of microphones that are receiving the audio signals. You can get these signals from one or many devices, and then pass them into an ASR system \u2014 whose job it is to infer the original source transcript that the person spoke or that the device played.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*LnORfDgmydjugnbWws5-Pw.png\" alt=\"\" width=\"700\" height=\"210\"><\/figure><div class=\"ot ou ov\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*LnORfDgmydjugnbWws5-Pw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*LnORfDgmydjugnbWws5-Pw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*LnORfDgmydjugnbWws5-Pw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*LnORfDgmydjugnbWws5-Pw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*LnORfDgmydjugnbWws5-Pw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*LnORfDgmydjugnbWws5-Pw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*LnORfDgmydjugnbWws5-Pw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*LnORfDgmydjugnbWws5-Pw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*LnORfDgmydjugnbWws5-Pw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*LnORfDgmydjugnbWws5-Pw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*LnORfDgmydjugnbWws5-Pw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*LnORfDgmydjugnbWws5-Pw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*LnORfDgmydjugnbWws5-Pw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*LnORfDgmydjugnbWws5-Pw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h2 id=\"fba8\" class=\"pa mp fo be mq pb pc pd mt pe pf pg mw ns ph pi pj nw pk pl pm oa pn po pp pq bj\" data-selectable-paragraph=\"\">So why is ASR important?<\/h2>\n<p id=\"b1ea\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Firstly, it\u2019s a very natural interface for human communication. You don\u2019t need a mouse or a keyboard, so it\u2019s obviously a good way to interact with machines. You don\u2019t even really need to learn new techniques because most people learn to speak as a function of natural development. It\u2019s a very natural interface for talking with simple devices such as cars, handheld phones, and chatbots.<\/p>\n<h2 id=\"810e\" class=\"pa mp fo be mq pb pc pd mt pe pf pg mw ns ph pi pj nw pk pl pm oa pn po pp pq bj\" data-selectable-paragraph=\"\">So how is this done classically?<\/h2>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*6NVTrAW3UdsILOcJaxyGcA.png\" alt=\"\" width=\"700\" height=\"188\"><\/figure><div class=\"ot ou pr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*6NVTrAW3UdsILOcJaxyGcA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*6NVTrAW3UdsILOcJaxyGcA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*6NVTrAW3UdsILOcJaxyGcA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*6NVTrAW3UdsILOcJaxyGcA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*6NVTrAW3UdsILOcJaxyGcA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*6NVTrAW3UdsILOcJaxyGcA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*6NVTrAW3UdsILOcJaxyGcA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*6NVTrAW3UdsILOcJaxyGcA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*6NVTrAW3UdsILOcJaxyGcA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*6NVTrAW3UdsILOcJaxyGcA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*6NVTrAW3UdsILOcJaxyGcA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*6NVTrAW3UdsILOcJaxyGcA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*6NVTrAW3UdsILOcJaxyGcA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*6NVTrAW3UdsILOcJaxyGcA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"6367\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">As observed above, the classic way of building a speech recognition system is to build a generative model of language. On the rightmost side, you produce a certain sequence of words from language models. And then for each word, you have a pronunciation model that says how this particular word is spoken. Typically it\u2019s written out as the sequence of phonemes \u2014 which are basic units of sound, but for our vocabulary, we\u2019ll just say a sequence of tokens \u2014 which represent a cluster of things that have been defined by linguistics experts.<\/p>\n<p id=\"94db\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Then, the pronunciation models are fed into an acoustic model, which basically defines how does a given token sounds. These acoustic models are now used to describe the data itself. Here the data would be x, which is the sequence of frames of audio features from x1 to xT. Typically, these features are something that signal processing experts have defined (such as the frequency components of the audio waveforms that are captured).<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*GEhB6B2nzBICGqgs__DeDQ.png\" alt=\"\" width=\"700\" height=\"230\"><\/figure><div class=\"ot ou ps\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*GEhB6B2nzBICGqgs__DeDQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*GEhB6B2nzBICGqgs__DeDQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*GEhB6B2nzBICGqgs__DeDQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*GEhB6B2nzBICGqgs__DeDQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*GEhB6B2nzBICGqgs__DeDQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*GEhB6B2nzBICGqgs__DeDQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*GEhB6B2nzBICGqgs__DeDQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*GEhB6B2nzBICGqgs__DeDQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*GEhB6B2nzBICGqgs__DeDQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*GEhB6B2nzBICGqgs__DeDQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*GEhB6B2nzBICGqgs__DeDQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*GEhB6B2nzBICGqgs__DeDQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*GEhB6B2nzBICGqgs__DeDQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*GEhB6B2nzBICGqgs__DeDQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"1177\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Each of these different components in this pipeline uses a different statistical model:<\/p>\n<ul class=\"\">\n<li id=\"bd4a\" class=\"nk nl fo be b gm of nn no gp og nq nr oo oh nu nv op oi ny nz oq oj oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">In the past, language models were typically N-gram models, which worked very well for simple problems with limited speech input data. They are essentially tables describing the probabilities of token sequences.<\/li>\n<li id=\"8bc1\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">The pronunciation models were simple lookup tables with probabilities associated with pronunciations. These tables would be very large tables of different pronunciations.<\/li>\n<li id=\"fd13\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">Acoustic models are built using <a class=\"af os\" href=\"https:\/\/towardsdatascience.com\/gaussian-mixture-models-explained-6986aaf5a95\" target=\"_blank\" rel=\"noopener\">Gaussian Mixture Models<\/a> with very specific architectures associated with them.<\/li>\n<li id=\"f3a2\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">The speech processing was pre-defined.<\/li>\n<\/ul>\n<p id=\"8485\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Once we have this kind of model built, we can perform the recognition by doing the inference on the data received. So you get a waveform, you compute the features for it (X) and do a search for Y that gives the highest probabilities of X.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"6634\" class=\"mo mp fo be mq mr qt go mt mu qu gr mw mx qv mz na nb qw nd ne nf qx nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">The Neural Network Invasion<\/strong><\/h1>\n<p id=\"90be\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Over time, researchers started noticing that each of these components could work more effectively if we used neural networks.<\/p>\n<ul class=\"\">\n<li id=\"b8f2\" class=\"nk nl fo be b gm of nn no gp og nq nr oo oh nu nv op oi ny nz oq oj oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">Instead of the N-gram language models, we can build neural language models and feed them into a speech recognition system to restore things that were produced by a first path speech recognition system.<\/li>\n<li id=\"9dcc\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">Looking into the pronunciation models, we can figure out how to do pronunciation for a new sequence of characters that we\u2019ve never seen before using a neural network.<\/li>\n<li id=\"c9e0\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">For acoustic models, we can build deep neural networks (such as LSTM-based models) to get much better classification accuracy scores of the features for the current frame.<\/li>\n<li id=\"6662\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">Interestingly enough, even the speech pre-processing steps were found to be replaceable with convolutional neural networks on raw speech signals.<\/li>\n<\/ul>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*tUBIQuzq1JcSSPyA_ESqOQ.png\" alt=\"\" width=\"700\" height=\"193\"><\/figure><div class=\"ot ou qy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*tUBIQuzq1JcSSPyA_ESqOQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"ef94\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">However, there\u2019s still a problem. There are neural networks in each component, but they\u2019re trained independently with different objectives. Because of that, the errors in one component may not behave well with the errors in another component. So that\u2019s the basic motivation for devising a process where you can train the entire model as one big component itself.<\/p>\n<p id=\"5ca3\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">These so-called <strong class=\"be or\">end-to-end models<\/strong> encompass more and more components in the pipeline discussed above. The 2 most popular ones are (1) <mark class=\"adv adw ao\"><strong class=\"be or\">Connectionist Temporal Classification<\/strong><\/mark> (CTC), which is in wide usage these days at Baidu and Google, but it requires a lot of training; and (2) <strong class=\"be or\">Sequence-To-Sequence<\/strong> (Seq-2-Seq), which doesn\u2019t require manual customization.<\/p>\n<p id=\"7ab4\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">The basic motivation is that we want to do end-to-end speech recognition. We are given the audio X \u2014 which is a sequence of frames from x1 to xT, and the corresponding output text Y \u2014 which is a sequence of y1 to yL. Y is just a text sequence (transcript) and X is the audio processed spectrogram. We want to perform speech recognition by learning a probabilistic model p(Y|X): starting with the data and predicting the target sequences themselves.<\/p>\n<h1 id=\"121a\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">1 \u2014 Connectionist Temporal Classification<\/strong><\/h1>\n<p id=\"0a26\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">The first of these models is called Connectionist Temporal Classification (CTC) ([1], [2], [3]). X is a sequence of data frames with length T: x1, x2, \u2026, xT, and Y is the output tokens of length L: y1, y2, \u2026, yL. Because of the way the model is constructed, we require T to be greater than L.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*i70AUnRnHSauVRIJ1afWDg.png\" alt=\"\" width=\"700\" height=\"313\"><\/figure><div class=\"ot ou qz\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*i70AUnRnHSauVRIJ1afWDg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*i70AUnRnHSauVRIJ1afWDg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*i70AUnRnHSauVRIJ1afWDg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*i70AUnRnHSauVRIJ1afWDg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*i70AUnRnHSauVRIJ1afWDg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*i70AUnRnHSauVRIJ1afWDg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*i70AUnRnHSauVRIJ1afWDg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*i70AUnRnHSauVRIJ1afWDg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*i70AUnRnHSauVRIJ1afWDg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*i70AUnRnHSauVRIJ1afWDg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*i70AUnRnHSauVRIJ1afWDg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*i70AUnRnHSauVRIJ1afWDg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*i70AUnRnHSauVRIJ1afWDg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*i70AUnRnHSauVRIJ1afWDg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"5ff0\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">This model has a very specific structure that makes it suitable for speech:<\/p>\n<ul class=\"\">\n<li id=\"2ddb\" class=\"nk nl fo be b gm of nn no gp og nq nr oo oh nu nv op oi ny nz oq oj oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">You get the spectrogram at the bottom (X). You feed it into a bi-directional recurrent neural network, and as a result, the arrow pointing at any time step depends on the entirety of the input data. As such, it can compute a fairly complicated function of the entire data X.<\/li>\n<li id=\"9e2d\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">This model, at the top, has softmax functions at every timeframe corresponding to the input. The softmax function is applied to a vocabulary with a particular length that you\u2019re interested in. In this case, you have the lowercase letters a to z and some punctuation symbols. So the vocabulary for CTC would be all that and an extra token called a blank token.<\/li>\n<li id=\"b7ba\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">Each frame of the prediction is basically producing a log probability for a different token class at that time step. In the case above, a score s(k, t) is the log probability of category k at time step t given the data X.<\/li>\n<\/ul>\n<p id=\"9ebf\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">In a CTC model, if you look at just the softmax functions that are produced by the recurring neural network over the entire time step, you\u2019ll be able to find the probability of the transcript through these individual softmax functions over time.<\/p>\n<p id=\"30b3\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Let\u2019s take a look at an example (below). The CTC model can represent all these paths through the entire space of softmax functions and look at only the symbols that correspond to each of the time steps.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:416\/1*3d8our9f89mxbYf8e1Po2g.png\" alt=\"\" width=\"638\" height=\"282\"><\/figure><div class=\"ot ou ra\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*3d8our9f89mxbYf8e1Po2g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*3d8our9f89mxbYf8e1Po2g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*3d8our9f89mxbYf8e1Po2g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*3d8our9f89mxbYf8e1Po2g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*3d8our9f89mxbYf8e1Po2g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*3d8our9f89mxbYf8e1Po2g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:832\/format:webp\/1*3d8our9f89mxbYf8e1Po2g.png 832w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 416px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*3d8our9f89mxbYf8e1Po2g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*3d8our9f89mxbYf8e1Po2g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*3d8our9f89mxbYf8e1Po2g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*3d8our9f89mxbYf8e1Po2g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*3d8our9f89mxbYf8e1Po2g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*3d8our9f89mxbYf8e1Po2g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:832\/1*3d8our9f89mxbYf8e1Po2g.png 832w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 416px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"61e3\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">As seen on the left, the CTC model will go through 2 C symbols, then through a blank symbol, then produce 2 A symbols, then produce another blank symbol, then transition to a T symbol, and then finally produce a blank symbol again.<\/p>\n<p id=\"8251\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">So when you go through these paths with the constraint, you can only transition between the same phoneme from one step to the next. Therefore, you\u2019ll end up with different ways of representing an output sequence.<\/p>\n<p id=\"92e4\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">For the example above, we have <strong class=\"be or\">cc &lt;b&gt; aa &lt;b&gt; t &lt;b&gt; <\/strong>or <strong class=\"be or\">cc &lt;b&gt; &lt;b&gt; a &lt;b&gt; t &lt;b&gt;<\/strong> or <strong class=\"be or\">cccc &lt;b&gt; aaaa &lt;b&gt; tttt &lt;b&gt;.<\/strong> Given these constraints, it turns out that even though there\u2019s an exponential number of paths by which you can produce the same output symbol, you can actually do it correctly using a dynamic programming algorithm. Because of dynamic programming, it\u2019s possible to compute both the log probability p(Y|X) and its gradient exactly. This gradient can be backpropagated to a neural network whose parameters can then be adjusted by your favorite optimizer!<\/p>\n<p id=\"eaa7\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Below are some results for CTC, which show how the model functions on given audio. A raw waveform is aligned at the bottom, and the corresponding predictions are outputted at the top. You can see that it produces the symbol H at the beginning. At a certain point, it gets a very high probability, which means that the model is confident that it hears the sound corresponding to H.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:662\/1*35HAf2Agf8AT7pQ2MlZzbw.png\" alt=\"\" width=\"662\" height=\"324\"><\/figure><div class=\"ot ou rb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*35HAf2Agf8AT7pQ2MlZzbw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*35HAf2Agf8AT7pQ2MlZzbw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*35HAf2Agf8AT7pQ2MlZzbw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*35HAf2Agf8AT7pQ2MlZzbw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*35HAf2Agf8AT7pQ2MlZzbw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*35HAf2Agf8AT7pQ2MlZzbw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1324\/format:webp\/1*35HAf2Agf8AT7pQ2MlZzbw.png 1324w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 662px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*35HAf2Agf8AT7pQ2MlZzbw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*35HAf2Agf8AT7pQ2MlZzbw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*35HAf2Agf8AT7pQ2MlZzbw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*35HAf2Agf8AT7pQ2MlZzbw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*35HAf2Agf8AT7pQ2MlZzbw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*35HAf2Agf8AT7pQ2MlZzbw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1324\/1*35HAf2Agf8AT7pQ2MlZzbw.png 1324w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 662px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"2a71\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">However, there are some drawbacks to CTC language models. They often misspell words and struggle with grammar. So if you had some way to figure out how to rank the different paths produced from the model and re-rank them just by the language model, the results should be much better.<\/p>\n<p id=\"9027\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Google actually fixed these problems by integrating a language model as part of the CTC model itself during training. That\u2019s the kind of production model currently being deployed with <em class=\"on\">OK Google<\/em>.<\/p>\n<h1 id=\"b1a6\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">2 \u2014 Sequence-To-Sequence<\/strong><\/h1>\n<p id=\"8830\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">An alternative approach to speech processing is the sequence-to-sequence model that makes next-step predictions. Let\u2019s say that you\u2019re given some data X and that you need to produce some symbols y1 to y{i}. The model predicts the probability of the next symbol of y{i+1}. The goal here is to basically learn a very good model for p.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*EjsrlveQH-K3pNO4O8M_ag.png\" alt=\"\" width=\"700\" height=\"317\"><\/figure><div class=\"ot ou rc\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*EjsrlveQH-K3pNO4O8M_ag.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*EjsrlveQH-K3pNO4O8M_ag.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*EjsrlveQH-K3pNO4O8M_ag.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*EjsrlveQH-K3pNO4O8M_ag.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*EjsrlveQH-K3pNO4O8M_ag.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*EjsrlveQH-K3pNO4O8M_ag.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*EjsrlveQH-K3pNO4O8M_ag.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*EjsrlveQH-K3pNO4O8M_ag.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*EjsrlveQH-K3pNO4O8M_ag.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*EjsrlveQH-K3pNO4O8M_ag.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*EjsrlveQH-K3pNO4O8M_ag.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*EjsrlveQH-K3pNO4O8M_ag.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*EjsrlveQH-K3pNO4O8M_ag.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*EjsrlveQH-K3pNO4O8M_ag.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"8cec\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">With the model architecture (left), you have a neural network (which is the decoder in a sequence-to-sequence model) that looks at the entire input (which is the encoder). It feeds in the path symbols that are produced as a recurrent neural network, and then you predict the next token itself as the output.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*Vywp5idCLQb3iMhH2VSbcw.png\" alt=\"\" width=\"700\" height=\"297\"><\/figure><div class=\"ot ou rd\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Vywp5idCLQb3iMhH2VSbcw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Vywp5idCLQb3iMhH2VSbcw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Vywp5idCLQb3iMhH2VSbcw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Vywp5idCLQb3iMhH2VSbcw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Vywp5idCLQb3iMhH2VSbcw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Vywp5idCLQb3iMhH2VSbcw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*Vywp5idCLQb3iMhH2VSbcw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Vywp5idCLQb3iMhH2VSbcw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Vywp5idCLQb3iMhH2VSbcw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Vywp5idCLQb3iMhH2VSbcw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Vywp5idCLQb3iMhH2VSbcw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Vywp5idCLQb3iMhH2VSbcw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Vywp5idCLQb3iMhH2VSbcw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*Vywp5idCLQb3iMhH2VSbcw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"c497\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">So this model does speech recognition with the sequence-to-sequence framework. In translation, the X would be the source language. In the speech domain, the X would be a huge sequence of audio that\u2019s now encoded with a recurrent neural network.<\/p>\n<p id=\"abc5\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">What it needs to function is the ability to look at different parts of temporal space, because the input is really long. Intuitively, translation results get worse as the source sentence becomes longer. That\u2019s because it\u2019s really difficult for the model to look in the right place. Turns out, that problem is aggravated a lot more with audio streams that are much longer. Therefore, you would need to implement an attention mechanism if you want to make this model work at all.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png\" alt=\"\" width=\"700\" height=\"330\"><\/figure><div class=\"ot ou qz\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*LfeKh4OPR-Jp4GVrpdEQ6w.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"019a\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">Seen in the example on the left, you\u2019re trying to produce the 1st character C. You create an attention vector that essentially looks at different parts of the input time steps and produces the next chapter (which is A) after changing the attention.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*b6SKvhvGLRUq4Su78NHSXQ.png\" alt=\"\" width=\"700\" height=\"461\"><\/figure><div class=\"ot ou re\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*b6SKvhvGLRUq4Su78NHSXQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*b6SKvhvGLRUq4Su78NHSXQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*b6SKvhvGLRUq4Su78NHSXQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*b6SKvhvGLRUq4Su78NHSXQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*b6SKvhvGLRUq4Su78NHSXQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*b6SKvhvGLRUq4Su78NHSXQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*b6SKvhvGLRUq4Su78NHSXQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*b6SKvhvGLRUq4Su78NHSXQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*b6SKvhvGLRUq4Su78NHSXQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*b6SKvhvGLRUq4Su78NHSXQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*b6SKvhvGLRUq4Su78NHSXQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*b6SKvhvGLRUq4Su78NHSXQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*b6SKvhvGLRUq4Su78NHSXQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*b6SKvhvGLRUq4Su78NHSXQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"46ba\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">If you keep doing this over the entire input stream, then you\u2019re moving forward attention just learned by the model itself. Seen here, it produces the output sequence \u201ccancel, cancel, cancel.\u201d<\/p>\n<p id=\"6185\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">The Listen, Attend, and Spell [4] model is the canonical model for the seq-2-seq category. Let\u2019s look at the diagram below taken from the paper:<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*LRPyLHYjoX_gk8dCjN2hQg.png\" alt=\"\" width=\"700\" height=\"831\"><\/figure><div class=\"ot ou rf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*LRPyLHYjoX_gk8dCjN2hQg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*LRPyLHYjoX_gk8dCjN2hQg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*LRPyLHYjoX_gk8dCjN2hQg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*LRPyLHYjoX_gk8dCjN2hQg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*LRPyLHYjoX_gk8dCjN2hQg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*LRPyLHYjoX_gk8dCjN2hQg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*LRPyLHYjoX_gk8dCjN2hQg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*LRPyLHYjoX_gk8dCjN2hQg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*LRPyLHYjoX_gk8dCjN2hQg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*LRPyLHYjoX_gk8dCjN2hQg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*LRPyLHYjoX_gk8dCjN2hQg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*LRPyLHYjoX_gk8dCjN2hQg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*LRPyLHYjoX_gk8dCjN2hQg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*LRPyLHYjoX_gk8dCjN2hQg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<ul class=\"\">\n<li id=\"2301\" class=\"nk nl fo be b gm of nn no gp og nq nr oo oh nu nv op oi ny nz oq oj oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">In the Listener architecture, you have an encoder structure. For every time step of the input, it produces a vector representation that encodes the input and is represented as h_t at time step t.<\/li>\n<li id=\"01d2\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">In the Speller architecture, you have a decoder architecture. You generate the next character c_t at every time step t.<\/li>\n<li id=\"5799\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">The LAS model uses a hierarchical encoder to replace the traditional recurrent neural network. Instead of processing one frame for every time step, it collapses neighboring frames as you feed into the next layer. Because of that, it reduces the number of time steps to be processed, thus making the processing faster.<\/li>\n<\/ul>\n<p id=\"342b\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">So what are the limitations of this model?<\/p>\n<ul class=\"\">\n<li id=\"c839\" class=\"nk nl fo be b gm of nn no gp og nq nr oo oh nu nv op oi ny nz oq oj oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">One of the big limitations preventing its use in an online system is that the output produced is being conditioned on the entire input. That means if you\u2019re going to put the model in a real-world speech recognition system, you\u2019d have to first wait for the entire audio to be received before outputting the symbol.<\/li>\n<li id=\"e32f\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">Another limitation is that the attention model itself is a computational bottleneck since every output token pays attention to every input time step. This makes it harder and slower for the model to do its learning.<\/li>\n<li id=\"7469\" class=\"nk nl fo be b gm pw nn no gp px nq nr oo py nu nv op pz ny nz oq qa oc od oe pt pu pv bj\" data-selectable-paragraph=\"\">Further, as the input is received and becomes longer, the word error rate goes down.<\/li>\n<\/ul>\n<h1 id=\"47ee\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">3 \u2014 Online Sequence-to-Sequence<\/strong><\/h1>\n<p id=\"df46\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">Online sequence-to-sequence models are designed to overcome the limits of sequence-to-sequence models\u2014you don\u2019t want to wait for the entire input sequence to arrive, and you also want to avoid using the attention model itself over the entire sequence. Essentially, the intention is to produce the outputs as the inputs arrive. It has to solve the following problem: is the model ready to produce an output now that it\u2019s received this much input?<\/p>\n<p id=\"61bd\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">The most notable online seq-2-seq model is called a <strong class=\"be or\">Neural Transducer<\/strong> [5]. If you take the input as it comes in, and every so often at a regular interval, you can run a seq-2-seq model on what\u2019s been received in the last block. As seen in the architecture below, the encoder&#8217;s attention (instead of looking at the entire input) will focus only on a little block. The transducer will produce the output symbols.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:610\/1*h7IAUR8txM9hkJanypAdQA.png\" alt=\"\" width=\"610\" height=\"330\"><\/figure><div class=\"ot ou rg\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*h7IAUR8txM9hkJanypAdQA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*h7IAUR8txM9hkJanypAdQA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*h7IAUR8txM9hkJanypAdQA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*h7IAUR8txM9hkJanypAdQA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*h7IAUR8txM9hkJanypAdQA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*h7IAUR8txM9hkJanypAdQA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1220\/format:webp\/1*h7IAUR8txM9hkJanypAdQA.png 1220w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 610px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*h7IAUR8txM9hkJanypAdQA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*h7IAUR8txM9hkJanypAdQA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*h7IAUR8txM9hkJanypAdQA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*h7IAUR8txM9hkJanypAdQA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*h7IAUR8txM9hkJanypAdQA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*h7IAUR8txM9hkJanypAdQA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1220\/1*h7IAUR8txM9hkJanypAdQA.png 1220w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 610px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"3b75\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">The nice thing about the neural transducer is that it maintains causality. More specifically, the model preserves the disadvantage of a seq-2-seq model. It also introduces an alignment problem: in essence, what you want to know is that you have to produce some symbols as outputs, but you don\u2019t know which chunk these symbols should be aligned to.<\/p>\n<p id=\"a04d\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">You can actually make this model better by incorporating convolutional neural networks, which are borrowed from computer vision. The paper [6] uses CNNs to do the encoder side in speech architecture.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*MV_vtzZHZzvt-thxxiTDkg.png\" alt=\"\" width=\"700\" height=\"392\"><\/figure><div class=\"ot ou rh\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*MV_vtzZHZzvt-thxxiTDkg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*MV_vtzZHZzvt-thxxiTDkg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*MV_vtzZHZzvt-thxxiTDkg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*MV_vtzZHZzvt-thxxiTDkg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*MV_vtzZHZzvt-thxxiTDkg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*MV_vtzZHZzvt-thxxiTDkg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*MV_vtzZHZzvt-thxxiTDkg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*MV_vtzZHZzvt-thxxiTDkg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*MV_vtzZHZzvt-thxxiTDkg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*MV_vtzZHZzvt-thxxiTDkg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*MV_vtzZHZzvt-thxxiTDkg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*MV_vtzZHZzvt-thxxiTDkg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*MV_vtzZHZzvt-thxxiTDkg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*MV_vtzZHZzvt-thxxiTDkg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"3eff\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">You take the traditional model for the pyramid as seen to the left, and instead of building the pyramid by simply stacking 2 things together, you can put a fancy architecture on top when you do the stacking. More specifically, as seen below, you can stack them as feature maps and put a CNN on the top. For the speech recognition problem, the frequency bands and the timestamps of the features that you look at will correspond to a natural substructure of the input data. The convolutional architecture essentially looks at that substructure.<\/p>\n<figure class=\"mh mi mj mk ml mg ot ou paragraph-image\">\n<div class=\"ow ox eb oy bg oz\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg mm mn c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:586\/1*AH3Vm5bX937DlMTOm7v4lQ.png\" alt=\"\" width=\"586\" height=\"449\"><\/figure><div class=\"ot ou ri\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*AH3Vm5bX937DlMTOm7v4lQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*AH3Vm5bX937DlMTOm7v4lQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*AH3Vm5bX937DlMTOm7v4lQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*AH3Vm5bX937DlMTOm7v4lQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*AH3Vm5bX937DlMTOm7v4lQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*AH3Vm5bX937DlMTOm7v4lQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1172\/format:webp\/1*AH3Vm5bX937DlMTOm7v4lQ.png 1172w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 586px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*AH3Vm5bX937DlMTOm7v4lQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*AH3Vm5bX937DlMTOm7v4lQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*AH3Vm5bX937DlMTOm7v4lQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*AH3Vm5bX937DlMTOm7v4lQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*AH3Vm5bX937DlMTOm7v4lQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*AH3Vm5bX937DlMTOm7v4lQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1172\/1*AH3Vm5bX937DlMTOm7v4lQ.png 1172w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 586px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<h1 id=\"3943\" class=\"mo mp fo be mq mr ms go mt mu mv gr mw mx my mz na nb nc nd ne nf ng nh ni nj bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Conclusion<\/strong><\/h1>\n<p id=\"0694\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">You should now generally be up to speed on the 3 most common deep learning-based frameworks for performing automatic speech recognition in a variety of contexts. The papers that I\u2019ve referenced below will help you get into the nitty-gritty technical details of how they work if you\u2019re inclined to do that.<\/p>\n<h2 id=\"c5ee\" class=\"pa mp fo be mq pb pc pd mt pe pf pg mw ns ph pi pj nw pk pl pm oa pn po pp pq bj\" data-selectable-paragraph=\"\"><strong class=\"al\">References<\/strong><\/h2>\n<p id=\"725a\" class=\"pw-post-body-paragraph nk nl fo be b gm nm nn no gp np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe fh bj\" data-selectable-paragraph=\"\">[1] Graves, Alex, and Navdeep Jaitly. \u201c<a class=\"af os\" href=\"http:\/\/proceedings.mlr.press\/v32\/graves14.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Towards End-To-End Speech Recognition with Recurrent Neural Networks<\/a>.\u201d ICML. Vol. 14. 2014.<\/p>\n<p id=\"ba36\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">[2] Amodei, Dario, et al. \u201c<a class=\"af os\" href=\"http:\/\/proceedings.mlr.press\/v48\/amodei16.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Deep speech 2: End-to-end speech recognition in english and mandarin<\/a>.\u201d arXiv preprint arXiv:1512.02595 (2015).<\/p>\n<p id=\"185b\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">[3] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk, \u201c<a class=\"af os\" href=\"https:\/\/static.googleusercontent.com\/media\/research.google.com\/en\/\/pubs\/archive\/43908.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Learning acoustic frame labeling for speech recognition with recurrent neural networks<\/a>,\u201d in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015.<\/p>\n<p id=\"db0f\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">[4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. \u201c<a class=\"af os\" href=\"https:\/\/arxiv.org\/pdf\/1508.01211.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Listen, Attend, and Spell<\/a>,\u201d in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015.<\/p>\n<p id=\"cd1b\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">[5] N. Jaitly, D. Sussillo, Q. Le, O. Vinyals, I. Sutskever, and S. Bengio. \u201c<a class=\"af os\" href=\"https:\/\/arxiv.org\/pdf\/1511.04868.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">A Neural Transducer<\/a>,\u201d arXiv preprint arXiv:1511.04868 (2016).<\/p>\n<p id=\"09e5\" class=\"pw-post-body-paragraph nk nl fo be b gm of nn no gp og nq nr ns oh nu nv nw oi ny nz oa oj oc od oe fh bj\" data-selectable-paragraph=\"\">[6] N. Jaitly, W. Chan, and Y. Zhang. \u201c<a class=\"af os\" href=\"https:\/\/arxiv.org\/pdf\/1610.03022.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Very Deep Convolutional Networks for End-to-End Speech Recognition<\/a>,\u201d arXiv preprint arXiv:1610.03022 (2016).<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Speech recognition is invading our lives. It\u2019s built into our phones (Siri), our game consoles (Kinect), our smartwatches (Apple Watch), and even our homes (Amazon Echo). But speech recognition has been around for decades, so why is it just now hitting the mainstream? The reason is that deep learning finally made speech recognition accurate [&hellip;]<\/p>\n","protected":false},"author":39,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[150],"class_list":["post-7342","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices\" \/>\n<meta property=\"og:description\" content=\"Introduction Speech recognition is invading our lives. It\u2019s built into our phones (Siri), our game consoles (Kinect), our smartwatches (Apple Watch), and even our homes (Amazon Echo). But speech recognition has been around for decades, so why is it just now hitting the mainstream? The reason is that deep learning finally made speech recognition accurate [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-29T21:32:31+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:29+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*9s0eyPQqGOhQW5fxmM90Zw.jpeg\" \/>\n<meta name=\"author\" content=\"James Le\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"James Le\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/","og_locale":"en_US","og_type":"article","og_title":"The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices","og_description":"Introduction Speech recognition is invading our lives. It\u2019s built into our phones (Siri), our game consoles (Kinect), our smartwatches (Apple Watch), and even our homes (Amazon Echo). But speech recognition has been around for decades, so why is it just now hitting the mainstream? The reason is that deep learning finally made speech recognition accurate [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-29T21:32:31+00:00","article_modified_time":"2025-04-24T17:14:29+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*9s0eyPQqGOhQW5fxmM90Zw.jpeg","type":"","width":"","height":""}],"author":"James Le","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"James Le","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/"},"author":{"name":"James Le","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/9ea207111d311668f59477646ffd469a"},"headline":"The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices","datePublished":"2023-08-29T21:32:31+00:00","dateModified":"2025-04-24T17:14:29+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/"},"wordCount":2653,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*9s0eyPQqGOhQW5fxmM90Zw.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/","url":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/","name":"The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*9s0eyPQqGOhQW5fxmM90Zw.jpeg","datePublished":"2023-08-29T21:32:31+00:00","dateModified":"2025-04-24T17:14:29+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*9s0eyPQqGOhQW5fxmM90Zw.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*9s0eyPQqGOhQW5fxmM90Zw.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/the-3-deep-learning-frameworks-for-end-to-end-speech-recognition-that-power-your-devices\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/9ea207111d311668f59477646ffd469a","name":"James Le","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/e9faebcdd7afdaff187857dc289b23ba","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1678305362870-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1678305362870-96x96.jpg","caption":"James Le"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/khanhle-1013gmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7342","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7342"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7342\/revisions"}],"predecessor-version":[{"id":15565,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7342\/revisions\/15565"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7342"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7342"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7342"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}