{"id":13153,"date":"2025-03-13T18:38:36","date_gmt":"2025-03-13T18:38:36","guid":{"rendered":"https:\/\/comet-marketing-site.lndo.site\/?page_id=13153"},"modified":"2025-11-17T20:57:25","modified_gmt":"2025-11-17T20:57:25","slug":"assemblyai","status":"publish","type":"page","link":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/","title":{"rendered":"Building an End-to-End Speech Recognition Model in PyTorch with AssemblyAI"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"880\" height=\"365\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/assembly_featured.jpg\" alt=\"\" class=\"wp-image-1318\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/assembly_featured.jpg 880w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/assembly_featured-300x124.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/assembly_featured-768x319.jpg 768w\" sizes=\"auto, (max-width: 880px) 100vw, 880px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>This post was written by Michael Nguyen, Machine Learning Research Engineer at&nbsp;<a href=\"https:\/\/www.assemblyai.com\/\">AssemblyAI<\/a>. AssemblyAI uses Comet to log, visualize, and understand their model development pipeline.&nbsp;<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning has changed the game in speech recognition with the introduction of end-to-end models. These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models today are Deep Speech by Baidu, and Listen Attend Spell (LAS) by Google. Both Deep Speech and LAS, are recurrent neural network (RNN) based architectures with different approaches to modeling speech recognition. Deep Speech uses the Connectionist Temporal Classification (CTC) loss function to predict the speech transcript. LAS uses a sequence to sequence network architecture for its predictions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These models simplified speech recognition pipelines by taking advantage of the capacity of deep learning system to learn from large datasets. With enough data, you should, in theory, be able to build a super robust speech recognition model that can account for all the nuance in speech without having to spend a ton of time and effort hand engineering acoustic features or dealing with complex pipelines in more old-school GMM-HMM model architectures, for example.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Deep learning is a fast-moving field, and Deep Speech and LAS style architectures are already quickly becoming outdated. You can read about where the industry is moving in the Latest Advancement Section below.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How to Build Your Own End-to-End Speech Recognition Model in PyTorch<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s walk through how one would build their own end-to-end speech recognition model in PyTorch. The model we\u2019ll build is inspired by Deep Speech 2 (Baidu\u2019s second revision of their now-famous model) with some personal improvements to the architecture. The output of the model will be a probability matrix of characters, and we\u2019ll use that probability matrix to decode the most likely characters spoken from the audio. You can find the full code and also run the it with GPU support on&nbsp;<a href=\"https:\/\/colab.research.google.com\/drive\/1IPpwx4rX32rqHKpLz7dc8sOKspUa-YKO\">Google Colaboratory<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Preparing the data pipeline<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Data is one of the most important aspects of speech recognition. We\u2019ll take raw audio waves and transform them into Mel Spectrograms.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"220\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/spectogram-1024x220-1.png\" alt=\"\" class=\"wp-image-1317\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/spectogram-1024x220-1.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/spectogram-1024x220-1-300x64.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/spectogram-1024x220-1-768x165.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You can read more on the details about how that transformation looks from this excellent post&nbsp;<a href=\"https:\/\/haythamfayek.com\/2016\/04\/21\/speech-processing-for-machine-learning.html\">here<\/a>. For this post, you can just think of a Mel Spectrogram as essentially a picture of sound.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"299\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/freq-1024x299-1.jpg\" alt=\"\" class=\"wp-image-1316\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/freq-1024x299-1.jpg 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/freq-1024x299-1-300x88.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/freq-1024x299-1-768x224.jpg 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For handling the audio data, we are going to use an extremely useful utility called&nbsp;<strong>torchaudio&nbsp;<\/strong>which is a library built by the PyTorch team specifically for audio data. We\u2019ll be training on a subset of&nbsp;<a href=\"http:\/\/www.openslr.org\/12\/\">LibriSpeech<\/a>, which is a corpus of read English speech data derived from audiobooks, comprising 100 hours of transcribed audio data. You can easily download this dataset using&nbsp;<strong>torchaudio<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import torchaudio train_dataset = torchaudio.datasets.LIBRISPEECH(\".\/\", url=\"train-clean-100\", download=True) \ntest_dataset = torchaudio.datasets.LIBRISPEECH(\".\/\", url=\"test-clean\", download=True)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Each sample of the dataset contains the waveform, sample rate of audio, the utterance\/label, and more metadata on the sample. You can view what each sample looks like from the source code&nbsp;<a href=\"https:\/\/github.com\/pytorch\/audio\/blob\/master\/torchaudio\/datasets\/librispeech.py#L40\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Data Augmentation \u2013 SpecAugment<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Data augmentation is a technique used to artificially increase the diversity of your dataset in order to increase your dataset size. This strategy is especially helpful when data is scarce or if your model is overfitting. For speech recognition, you can do the standard augmentation techniques, like changing the pitch, speed, injecting noise, and adding reverb to your audio data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We found Spectrogram Augmentation (SpecAugment), to be a much simpler and more effective approach. SpecAugment, was first introduced in the paper&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1904.08779\">SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition<\/a>, in which the authors found that simply cutting out random blocks of consecutive time and frequency dimensions improved the models generalization abilities significantly!<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"619\" height=\"286\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/tmp.png\" alt=\"\" class=\"wp-image-1315\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/tmp.png 619w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/tmp-300x139.png 300w\" sizes=\"auto, (max-width: 619px) 100vw, 619px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In PyTorch, you can use the<strong>&nbsp;torchaudio<\/strong>&nbsp;function&nbsp;<strong>FrequencyMasking<\/strong>&nbsp;to mask out the frequency dimension, and&nbsp;<strong>TimeMasking<\/strong>&nbsp;for the time dimension.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>torchaudio.transforms.FrequencyMasking()\ntorchaudio.transforms.TimeMasking()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now that we have the data, we\u2019ll need to transform the audio into Mel Spectrograms, and map the character labels for each audio sample into integer labels:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class TextTransform:\n    \"\"\"Maps characters to integers and vice versa\"\"\"\n    def __init__(self):\n        char_map_str = \"\"\"\n        ' 0\n        &lt;SPACE&gt; 1\n        a 2\n        b 3\n        c 4\n        d 5\n        e 6\n        f 7\n        g 8\n        h 9\n        i 10\n        j 11\n        k 12\n        l 13\n        m 14\n        n 15\n        o 16\n        p 17\n        q 18\n        r 19\n        s 20\n        t 21\n        u 22\n        v 23\n        w 24\n        x 25\n        y 26\n        z 27\n        \"\"\"\n        self.char_map = {}\n        self.index_map = {}\n        for line in char_map_str.strip().split('\\n'):\n            ch, index = line.split()\n            self.char_map&#91;ch] = int(index)\n            self.index_map&#91;int(index)] = ch\n        self.index_map&#91;1] = ' '\n\n    def text_to_int(self, text):\n        \"\"\" Use a character map and convert text to an integer sequence \"\"\"\n        int_sequence = &#91;]\n        for c in text:\n            if c == ' ':\n                ch = self.char_map&#91;'']\n            else:\n                ch = self.char_map\n            int_sequence.append(ch)\n        return int_sequence\n\n    def int_to_text(self, labels):\n        \"\"\" Use a character map and convert integer labels to an text sequence \"\"\"\n        string = &#91;]\n        for i in labels:\n            string.append(self.index_map&#91;i])\n        return ''.join(string).replace('', ' ')\n\n\ntrain_audio_transforms = nn.Sequential(\n    torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=128),\n    torchaudio.transforms.FrequencyMasking(freq_mask_param=15),\n    torchaudio.transforms.TimeMasking(time_mask_param=35)\n)\n\nvalid_audio_transforms = torchaudio.transforms.MelSpectrogram()\n\ntext_transform = TextTransform()\n\n\ndef data_processing(data, data_type=\"train\"):\n    spectrograms = &#91;]\n    labels = &#91;]\n    input_lengths = &#91;]\n    label_lengths = &#91;]\n    for (waveform, _, utterance, _, _, _) in data:\n        if data_type == 'train':\n            spec = train_audio_transforms(waveform).squeeze(0).transpose(0, 1)\n        else:\n            spec = valid_audio_transforms(waveform).squeeze(0).transpose(0, 1)\n        spectrograms.append(spec)\n        label = torch.Tensor(text_transform.text_to_int(utterance.lower()))\n        labels.append(label)\n        input_lengths.append(spec.shape&#91;0]\/\/2)\n        label_lengths.append(len(label))\n\n    spectrograms = nn.utils.rnn.pad_sequence(spectrograms, batch_first=True).unsqueeze(1).transpose(2, 3)\n    labels = nn.utils.rnn.pad_sequence(labels, batch_first=True)\n\n    return spectrograms, labels, input_lengths, label_lengths<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Define the Model \u2013 Deep Speech 2 (but better)<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our model will be similar to the Deep Speech 2 architecture. The model will have two main neural network modules \u2013 N layers of Residual Convolutional Neural Networks (ResCNN) to learn the relevant audio features, and a set of Bidirectional Recurrent Neural Networks (BiRNN) to leverage the learned ResCNN audio features. The model is topped off with a fully connected layer used to classify characters per time step.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1281\" height=\"126\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/BHOBfDVTcGCQKTtp.png\" alt=\"\" class=\"wp-image-1314\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/BHOBfDVTcGCQKTtp.png 1281w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/BHOBfDVTcGCQKTtp-300x30.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/BHOBfDVTcGCQKTtp-1024x101.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/BHOBfDVTcGCQKTtp-768x76.png 768w\" sizes=\"auto, (max-width: 1281px) 100vw, 1281px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Convolutional Neural Networks (CNN) are great at extracting abstract features, and we\u2019ll apply the same feature extraction power to audio spectrograms. Instead of just vanilla CNN layers, we choose to use Residual CNN layers. Residual connections (AKA skip connections) were first introduced in the paper&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1512.03385\">Deep Residual Learning for Image Recognition<\/a>, where the author found that you can build really deep networks with good accuracy gains if you add these connections to your CNN\u2019s. Adding these Residual connections also helps the model learn faster and generalize better. The paper&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1712.09913\">Visualizing the Loss Landscape of Neural Nets<\/a>&nbsp;shows that networks with residual connections have a \u201cflatter\u201d loss surface, making it easier for models to navigate the loss landscape and find a lower and more generalizable minima.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"2486\" height=\"1100\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/qbBytawOwsmKfYlI.png\" alt=\"\" class=\"wp-image-1313\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/qbBytawOwsmKfYlI.png 2486w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/qbBytawOwsmKfYlI-300x133.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/qbBytawOwsmKfYlI-1024x453.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/qbBytawOwsmKfYlI-768x340.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/qbBytawOwsmKfYlI-1536x680.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/qbBytawOwsmKfYlI-2048x906.png 2048w\" sizes=\"auto, (max-width: 2486px) 100vw, 2486px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Recurrent Neural Networks (RNN) are naturally great at sequence modeling problems. RNN\u2019s processes the audio features step by step, making a prediction for each frame while using context from previous frames. We use BiRNN\u2019s because we want the context of not only the frame before each step, but the frames after it as well. This can help the model make better predictions, as each frame in the audio will have more information before making a prediction. We use Gated Recurrent Unit (GRU\u2019s) variant of RNN\u2019s as it needs less computational resources than LSTM\u2019s, and works just as well in some cases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The model outputs a probability matrix for characters which we\u2019ll use to feed into our decoder to extract what the model believes are the highest probability characters that were spoken.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class CNNLayerNorm(nn.Module):\n    \"\"\"Layer normalization built for cnns input\"\"\"\n    def __init__(self, n_feats):\n        super(CNNLayerNorm, self).__init__()\n        self.layer_norm = nn.LayerNorm(n_feats)\n\n    def forward(self, x):\n        # x (batch, channel, feature, time)\n        x = x.transpose(2, 3).contiguous() # (batch, channel, time, feature)\n        x = self.layer_norm(x)\n        return x.transpose(2, 3).contiguous() # (batch, channel, feature, time) \n\n\nclass ResidualCNN(nn.Module):\n    \"\"\"Residual CNN inspired by https:\/\/arxiv.org\/pdf\/1603.05027.pdf\n        except with layer norm instead of batch norm\n    \"\"\"\n    def __init__(self, in_channels, out_channels, kernel, stride, dropout, n_feats):\n        super(ResidualCNN, self).__init__()\n\n        self.cnn1 = nn.Conv2d(in_channels, out_channels, kernel, stride, padding=kernel\/\/2)\n        self.cnn2 = nn.Conv2d(out_channels, out_channels, kernel, stride, padding=kernel\/\/2)\n        self.dropout1 = nn.Dropout(dropout)\n        self.dropout2 = nn.Dropout(dropout)\n        self.layer_norm1 = CNNLayerNorm(n_feats)\n        self.layer_norm2 = CNNLayerNorm(n_feats)\n\n    def forward(self, x):\n        residual = x  # (batch, channel, feature, time)\n        x = self.layer_norm1(x)\n        x = F.gelu(x)\n        x = self.dropout1(x)\n        x = self.cnn1(x)\n        x = self.layer_norm2(x)\n        x = F.gelu(x)\n        x = self.dropout2(x)\n        x = self.cnn2(x)\n        x += residual\n        return x # (batch, channel, feature, time)\n\n\nclass BidirectionalGRU(nn.Module):\n\n    def __init__(self, rnn_dim, hidden_size, dropout, batch_first):\n        super(BidirectionalGRU, self).__init__()\n\n        self.BiGRU = nn.GRU(\n            input_size=rnn_dim, hidden_size=hidden_size,\n            num_layers=1, batch_first=batch_first, bidirectional=True)\n        self.layer_norm = nn.LayerNorm(rnn_dim)\n        self.dropout = nn.Dropout(dropout)\n\n    def forward(self, x):\n        x = self.layer_norm(x)\n        x = F.gelu(x)\n        x, _ = self.BiGRU(x)\n        x = self.dropout(x)\n        return x\n\n\nclass SpeechRecognitionModel(nn.Module):\n    \"\"\"Speech Recognition Model Inspired by DeepSpeech 2\"\"\"\n\n    def __init__(self, n_cnn_layers, n_rnn_layers, rnn_dim, n_class, n_feats, stride=2, dropout=0.1):\n        super(SpeechRecognitionModel, self).__init__()\n        n_feats = n_feats\/\/2\n        self.cnn = nn.Conv2d(1, 32, 3, stride=stride, padding=3\/\/2)  # cnn for extracting heirachal features\n\n        # n residual cnn layers with filter size of 32\n        self.rescnn_layers = nn.Sequential(*&#91;\n            ResidualCNN(32, 32, kernel=3, stride=1, dropout=dropout, n_feats=n_feats) \n            for _ in range(n_cnn_layers)\n        ])\n        self.fully_connected = nn.Linear(n_feats*32, rnn_dim)\n        self.birnn_layers = nn.Sequential(*&#91;\n            BidirectionalGRU(rnn_dim=rnn_dim if i==0 else rnn_dim*2,\n                             hidden_size=rnn_dim, dropout=dropout, batch_first=i==0)\n            for i in range(n_rnn_layers)\n        ])\n        self.classifier = nn.Sequential(\n            nn.Linear(rnn_dim*2, rnn_dim),  # birnn returns rnn_dim*2\n            nn.GELU(),\n            nn.Dropout(dropout),\n            nn.Linear(rnn_dim, n_class)\n        )\n\n    def forward(self, x):\n        x = self.cnn(x)\n        x = self.rescnn_layers(x)\n        sizes = x.size()\n        x = x.view(sizes&#91;0], sizes&#91;1] * sizes&#91;2], sizes&#91;3])  # (batch, feature, time)\n        x = x.transpose(1, 2) # (batch, time, feature)\n        x = self.fully_connected(x)\n        x = self.birnn_layers(x)\n        x = self.classifier(x)\n        return x<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Picking the Right Optimizer and Scheduler \u2013 AdamW with Super Convergence<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The optimizer and learning rate schedule plays a very important role in getting our model to converge to the best point. Picking the right optimizer and scheduler can also save you compute time, and help your model generalize better to real-world use cases. For our model, we\u2019ll be using&nbsp;<strong>AdamW<\/strong>&nbsp;with the&nbsp;<strong>One Cycle Learning Rate Scheduler<\/strong>.&nbsp;<strong>Adam<\/strong>&nbsp;is a widely used optimizer that helps your model converge more quickly, therefore, saving compute time, but has been notorious for not generalizing as well as&nbsp;<strong>Stochastic Gradient Descent<\/strong>&nbsp;AKA&nbsp;<strong>SGD<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>AdamW<\/strong>&nbsp;was first introduced in&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1711.05101\">Decoupled Weight Decay Regularization<\/a>, and is considered a \u201cfix\u201d to&nbsp;<strong>Adam<\/strong>. The paper pointed out that the original&nbsp;<strong>Adam<\/strong>&nbsp;algorithm has a wrong implementation of weight decay, which&nbsp;<strong>AdamW<\/strong>&nbsp;attempts to fix. This fix helps with&nbsp;<strong>Adam<\/strong>\u2018s generalization problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<strong>One Cycle Learning Rate Scheduler<\/strong>&nbsp;was first introduced in the paper&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1708.07120\">Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates<\/a>. This paper shows that you can train neural networks an order of magnitude faster, while keeping their generalizable abilities, using a simple trick. You start with a low learning rate, which warms up to a large maximum learning rate, then decays linearly to the same point of where you originally started.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/XhVKIzvgauDALinA.jpg\" alt=\"\" class=\"wp-image-1312\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/XhVKIzvgauDALinA.jpg 800w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/XhVKIzvgauDALinA-300x225.jpg 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/XhVKIzvgauDALinA-768x576.jpg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Because the maximum learning rate is magnitudes higher than the lowest, you also gain some regularization benefits which helps your model generalize better if you have a smaller set of data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With PyTorch, these two methods are already part of the package.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>optimizer = optim.AdamW(model.parameters(), hparams&#91;'learning_rate'])\nscheduler = optim.lr_scheduler.OneCycleLR(optimizer,\n\tmax_lr=hparams&#91;'learning_rate'],\n\tsteps_per_epoch=int(len(train_loader)),\n\tepochs=hparams&#91;'epochs'],\n\tanneal_strategy='linear')<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">The CTC Loss Function \u2013 Aligning Audio to Transcript<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our model will be trained to predict the probability distribution of all characters in the alphabet for each frame (ie, timestep) in the spectrogram we feed into the model.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"803\" height=\"774\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mQEJyPYmfQAHalLK.png\" alt=\"\" class=\"wp-image-1311\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mQEJyPYmfQAHalLK.png 803w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mQEJyPYmfQAHalLK-300x289.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mQEJyPYmfQAHalLK-768x740.png 768w\" sizes=\"auto, (max-width: 803px) 100vw, 803px\" \/><figcaption class=\"wp-element-caption\"><em>Image from\u00a0<\/em><a href=\"https:\/\/distill.pub\/2017\/ctc\/\"><em>distill.pub<\/em><\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Traditional speech recognition models would require you to align the transcript text to the audio before training, and the model would be trained to predict specific labels at specific frames.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The innovation of the CTC loss function is that it allows us to skip this step. Our model will learn to align the transcript itself during training. The key to this is the \u201cblank\u201d label introduced by CTC, which gives the model the ability to say that a certain audio frame did not produce a character. You can see a more detailed explanation of CTC and how it works from&nbsp;<a href=\"https:\/\/distill.pub\/2017\/ctc\/\">this excellent post<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The CTC loss function is also built into PyTorch.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>criterion = nn.CTCLoss(blank=28).to(device)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Evaluating Your Speech Model<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When Evaluating your speech recognition model, the industry standard is using the Word Error Rate (WER) as the metric. The Word Error Rate does exactly what it says \u2013 it takes the transcription your model outputs, and the true transcription, and measures the error between them. You can see how that\u2019s implemented&nbsp;<a href=\"https:\/\/colab.research.google.com\/drive\/1IPpwx4rX32rqHKpLz7dc8sOKspUa-YKO\">here<\/a>. Another useful metric is called the Character Error Rate (CER). The CER measures the error of the characters between the model\u2019s output and the true labels. These metrics are helpful to measure how well your model performs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For this tutorial, we\u2019ll use a \u201cgreedy\u201d decoding method to process our model\u2019s output into characters that can be combined to create the transcript. A \u201cgreedy\u201d decoder takes in the model output, which is a softmax probability matrix of characters, and for each time step (spectrogram frame), it chooses the label with the highest probability. If the label is a blank label, we remove it from the final transcript.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def GreedyDecoder(output, labels, label_lengths, blank_label=28, collapse_repeated=True):\n    arg_maxes = torch.argmax(output, dim=2)\n    decodes = &#91;]\n    targets = &#91;]\n    for i, args in enumerate(arg_maxes):\n        decode = &#91;]\n        targets.append(text_transform.int_to_text(labels&#91;i]&#91;:label_lengths&#91;i]].tolist()))\n        for j, index in enumerate(args):\n            if index != blank_label:\n                if collapse_repeated and j != 0 and index == args&#91;j -1]:\n                    continue\n                decode.append(index.item())\n        decodes.append(text_transform.int_to_text(decode))\n    return decodes, targets<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Training and Monitoring Your Experiments Using comet.ml<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.comet.com\/site\/\">comet.ml<\/a>&nbsp;provides a platform that allows deep learning researchers to track, compare, explain, and optimize their experiments and models. comet.ml has improved our productivity at AssemblyAI and we highly recommend using this platform for teams doing any sort of data science experiments. comet.ml is super easy to set up. And works with just a few lines of code.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># initialize experiment object\nexperiment = Experiment(api_key=comet_api_key, project_name=project_name)\nexperiment.set_name(exp_name)\n\n# track metrics\nexperiment.log_metric('loss', loss.item())<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.comet.com\/site\/\">comet.ml<\/a>&nbsp;provides you with a very productive dashboard where you can view and track your model\u2019s progress.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"3584\" height=\"2324\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/DKQrnmKaVwPXdTZz.png\" alt=\"\" class=\"wp-image-1310\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/DKQrnmKaVwPXdTZz.png 3584w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/DKQrnmKaVwPXdTZz-300x195.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/DKQrnmKaVwPXdTZz-1024x664.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/DKQrnmKaVwPXdTZz-768x498.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/DKQrnmKaVwPXdTZz-1536x996.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/DKQrnmKaVwPXdTZz-2048x1328.png 2048w\" sizes=\"auto, (max-width: 3584px) 100vw, 3584px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You can use Comet to track metrics, code, hyper parameters, your model\u2019s graphs, among many other things! A really handy feature that Comet provides is the ability to compare your experiment among many other experiments.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"3584\" height=\"2324\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mdwMVAypIpIDBFJg.png\" alt=\"\" class=\"wp-image-1319\" srcset=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mdwMVAypIpIDBFJg.png 3584w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mdwMVAypIpIDBFJg-300x195.png 300w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mdwMVAypIpIDBFJg-1024x664.png 1024w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mdwMVAypIpIDBFJg-768x498.png 768w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mdwMVAypIpIDBFJg-1536x996.png 1536w, https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/mdwMVAypIpIDBFJg-2048x1328.png 2048w\" sizes=\"auto, (max-width: 3584px) 100vw, 3584px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Comet has a rich feature set that we won\u2019t cover all here, but we highly recommended using it for a productivity and sanity boost.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the rest of our training script.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class IterMeter(object):\n    \"\"\"keeps track of total iterations\"\"\"\n    def __init__(self):\n        self.val = 0\n\n    def step(self):\n        self.val += 1\n\n    def get(self):\n        return self.val\n\n\ndef train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, experiment):\n    model.train()\n    data_len = len(train_loader.dataset)\n    with experiment.train():\n        for batch_idx, _data in enumerate(train_loader):\n            spectrograms, labels, input_lengths, label_lengths = _data \n            spectrograms, labels = spectrograms.to(device), labels.to(device)\n\n            optimizer.zero_grad()\n\n            output = model(spectrograms)  # (batch, time, n_class)\n            output = F.log_softmax(output, dim=2)\n            output = output.transpose(0, 1) # (time, batch, n_class)\n\n            loss = criterion(output, labels, input_lengths, label_lengths)\n            loss.backward()\n\n            experiment.log_metric('loss', loss.item(), step=iter_meter.get())\n            experiment.log_metric('learning_rate', scheduler.get_lr(), step=iter_meter.get())\n\n            optimizer.step()\n            scheduler.step()\n            iter_meter.step()\n            if batch_idx % 100 == 0 or batch_idx == data_len:\n                print('Train Epoch: {} &#91;{}\/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n                    epoch, batch_idx * len(spectrograms), data_len,\n                    100. * batch_idx \/ len(train_loader), loss.item()))\n\n\ndef test(model, device, test_loader, criterion, epoch, iter_meter, experiment):\n    print('\\nevaluating\u2026')\n    model.eval()\n    test_loss = 0\n    test_cer, test_wer = &#91;], &#91;]\n    with experiment.test():\n        with torch.no_grad():\n            for I, _data in enumerate(test_loader):\n                spectrograms, labels, input_lengths, label_lengths = _data \n                spectrograms, labels = spectrograms.to(device), labels.to(device)\n\n                output = model(spectrograms)  # (batch, time, n_class)\n                output = F.log_softmax(output, dim=2)\n                output = output.transpose(0, 1) # (time, batch, n_class)\n\n                loss = criterion(output, labels, input_lengths, label_lengths)\n                test_loss += loss.item() \/ len(test_loader)\n\n                decoded_preds, decoded_targets = GreedyDecoder(output.transpose(0, 1), labels, label_lengths)\n                for j in range(len(decoded_preds)):\n                    test_cer.append(cer(decoded_targets&#91;j], decoded_preds&#91;j]))\n                    test_wer.append(wer(decoded_targets&#91;j], decoded_preds&#91;j]))\n\n\n    avg_cer = sum(test_cer)\/len(test_cer)\n    avg_wer = sum(test_wer)\/len(test_wer)\n    experiment.log_metric('test_loss', test_loss, step=iter_meter.get())\n    experiment.log_metric('cer', avg_cer, step=iter_meter.get())\n    experiment.log_metric('wer', avg_wer, step=iter_meter.get())\n\n    print('Test set: Average loss: {:.4f}, Average CER: {:4f} Average WER: {:.4f}\\n'.format(test_loss, avg_cer, avg_wer))\n\n\ndef main(learning_rate=5e-4, batch_size=20, epochs=10,\n        train_url=\"train-clean-100\", test_url=\"test-clean\",\n        experiment=Experiment(api_key='dummy_key', disabled=True)):\n\n    hparams = {\n        \"n_cnn_layers\": 3,\n        \"n_rnn_layers\": 5,\n        \"rnn_dim\": 512,\n        \"n_class\": 29,\n        \"n_feats\": 128,\n        \"stride\": 2,\n        \"dropout\": 0.1,\n        \"learning_rate\": learning_rate,\n        \"batch_size\": batch_size,\n        \"epochs\": epochs\n    }\n\n    experiment.log_parameters(hparams)\n\n    use_cuda = torch.cuda.is_available()\n    torch.manual_seed(7)\n    device = torch.device(\"cuda\" if use_cuda else \"cpu\")\n\n    if not os.path.isdir(\".\/data\"):\n        os.makedirs(\".\/data\")\n\n    train_dataset = torchaudio.datasets.LIBRISPEECH(\".\/data\", url=train_url, download=True)\n    test_dataset = torchaudio.datasets.LIBRISPEECH(\".\/data\", url=test_url, download=True)\n\n    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}\n    train_loader = data.DataLoader(dataset=train_dataset,\n                                batch_size=hparams&#91;'batch_size'],\n                                shuffle=True,\n                                collate_fn=lambda x: data_processing(x, 'train'),\n                                **kwargs)\n    test_loader = data.DataLoader(dataset=test_dataset,\n                                batch_size=hparams&#91;'batch_size'],\n                                shuffle=False,\n                                collate_fn=lambda x: data_processing(x, 'valid'),\n                                **kwargs)\n\n    model = SpeechRecognitionModel(\n        hparams&#91;'n_cnn_layers'], hparams&#91;'n_rnn_layers'], hparams&#91;'rnn_dim'],\n        hparams&#91;'n_class'], hparams&#91;'n_feats'], hparams&#91;'stride'], hparams&#91;'dropout']\n        ).to(device)\n\n    print(model)\n    print('Num Model Parameters', sum(&#91;param.nelement() for param in model.parameters()]))\n\n    optimizer = optim.AdamW(model.parameters(), hparams&#91;'learning_rate'])\n    criterion = nn.CTCLoss(blank=28).to(device)\n    scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=hparams&#91;'learning_rate'], \n                                            steps_per_epoch=int(len(train_loader)),\n                                            epochs=hparams&#91;'epochs'],\n                                            anneal_strategy='linear')\n\n    iter_meter = IterMeter()\n    for epoch in range(1, epochs + 1):\n        train(model, device, train_loader, criterion, optimizer, scheduler, epoch, iter_meter, experiment)\n        test(model, device, test_loader, criterion, epoch, iter_meter, experiment)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The<strong>&nbsp;train<\/strong>&nbsp;function trains the model on a full epoch of data. The&nbsp;<strong>test<\/strong>&nbsp;function evaluates the model on test data after every epoch. It gets the<strong>&nbsp;test_loss<\/strong>&nbsp;as well as the&nbsp;<strong>cer<\/strong>&nbsp;and&nbsp;<strong>wer<\/strong>&nbsp;of the model. You can start running the training script right now with GPU support in the&nbsp;<a href=\"https:\/\/colab.research.google.com\/drive\/1IPpwx4rX32rqHKpLz7dc8sOKspUa-YKO\">Google Colaboratory<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to Improve Accuracy<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Speech Recognition Requires a ton of data and a ton of compute resources. The example laid out is trained on a subset of LibriSpeech (100 hours of audio) and a single GPU. To get state of the art results you\u2019ll need to do distributed training on thousands of hours of data, on tens of GPU\u2019s spread out across many machines.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another way to get a big accuracy improvement is to decode the CTC probability matrix using a Language Model and the CTC beam search algorithm. CTC type models are very dependent on this decoding process to get good results. Luckily there is a handy&nbsp;<a href=\"https:\/\/github.com\/parlance\/ctcdecode\">open source library<\/a>&nbsp;that allows you to do that.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This tutorial was made to be more accessible so it\u2019s a relatively small model (23 million Parameters) compared to something like BERT (340 million Parameters). It seems to be the larger you can get your network, the better it performs, although there are diminishing returns. A larger model equating to better performance is not always the case though, as proven by OpenAI\u2019s research&nbsp;<a href=\"https:\/\/openai.com\/assets\/deep-double-descent\/\">Deep Double Descent<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This model has 3 residual CNN layers and 5 Bidirectional GRU layers which should allow you to train a reasonable batch size on a single GPU with at least 11GB of memory. You can tweak some of the hyper parameters in the main function to reduce or increase the model size for your use case and compute availability.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Latest Advancements In Speech Recognition with Deep Learning<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Deep learning is a fast-moving field. It seems like you can\u2019t go a week without some new technique getting state of the art results. Here are a few of things worth exploring int the world of speech recognition.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Transformers<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Transformers have taken the Natural Language Processing world by storm! First Introduced in the paper&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, transformers have been taking and modified to beat pretty much all existing NLP task dethroning RNN\u2019s type architectures. The Transformer\u2019s ability to see the full context of sequence data is transferable to speech as well.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Unsupervised Pre-training<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you follow deep learning closely you\u2019ve probably heard of BERT, GPT, and GPT2. These Transformer models have first pertained on a language modeling task with unlabeled text data, and fine-tuned on a wide array of NLP task and get state of the art results! During pre-training, the model learns something fundamental on the statistics of language and uses that power to excel at other tasks. We believe this technique has great promises on speech data as well.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Word Piece Models<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Our model defined above output characters. Some benefits to that are the model doesn\u2019t have to worry about out of vocabulary words when running inference on speech. So for the word&nbsp;<strong>c h a t<\/strong>&nbsp;each character has is its own label. The downside to using characters are inefficiency and the model being prone to more errors because you\u2019re predicting one character at a time. Using the whole word as labels have been explored, to some degree of success. Using this method, the entire word&nbsp;<strong>chat<\/strong>&nbsp;would be the label. But using whole words, you would have to keep an index of all possible vocabularies to make a prediction, which is memory inefficient with the possibility of running into out of vocabulary words during prediction. The sweet spot would be using word piece or sub-word units as labels. Instead of characters for the individual label, you can chop up the words into sub-word units, and use those as labels, i.e.&nbsp;<strong>ch at<\/strong>. This solves the out of vocabulary issue, and is much more efficient, as it needs fewer steps to decode then using characters, and without the need to have an index of all possible words. Word pieces have been used successfully with many NLP models, like BERT and would work natural with speech recognition problems as well.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post was written by Michael Nguyen, Machine Learning Research Engineer at&nbsp;AssemblyAI. AssemblyAI uses Comet to log, visualize, and understand their model development pipeline.&nbsp; Deep Learning has changed the game in speech recognition with the introduction of end-to-end models. These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models [&hellip;]<\/p>\n","protected":false},"author":140,"featured_media":18111,"parent":488,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"case-study","meta":{"customer_name":"AssemblyAI","customer_description":"Making advanced deep learning technology accessible to developers","customer_industry":"Technology - Speech AI Models","customer_technologies":"PyTorch","customer_logo":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/10\/assembly-ai-logo-white-1.svg","footnotes":""},"coauthors":[127],"class_list":["post-13153","page","type-page","status-publish","has-post-thumbnail","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Customer Case Study: Building an end-to-end Speech Recognition model in PyTorch with AssemblyAI - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/customers\/assemblyai\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building an End-to-End Speech Recognition Model in PyTorch with AssemblyAI\" \/>\n<meta property=\"og:description\" content=\"This post was written by Michael Nguyen, Machine Learning Research Engineer at&nbsp;AssemblyAI. AssemblyAI uses Comet to log, visualize, and understand their model development pipeline.&nbsp; Deep Learning has changed the game in speech recognition with the introduction of end-to-end models. These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/customers\/assemblyai\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-17T20:57:25+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/10\/Case-study-AssemblyAI.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1842\" \/>\n\t<meta property=\"og:image:height\" content=\"650\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"13 minutes\" \/>\n\t<meta name=\"twitter:label2\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data2\" content=\"Caroline Brady\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Customer Case Study: Building an end-to-end Speech Recognition model in PyTorch with AssemblyAI - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/","og_locale":"en_US","og_type":"article","og_title":"Building an End-to-End Speech Recognition Model in PyTorch with AssemblyAI","og_description":"This post was written by Michael Nguyen, Machine Learning Research Engineer at&nbsp;AssemblyAI. AssemblyAI uses Comet to log, visualize, and understand their model development pipeline.&nbsp; Deep Learning has changed the game in speech recognition with the introduction of end-to-end models. These models take in audio, and directly output transcriptions. Two of the most popular end-to-end models [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_modified_time":"2025-11-17T20:57:25+00:00","og_image":[{"width":1842,"height":650,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/10\/Case-study-AssemblyAI.webp","type":"image\/webp"}],"twitter_card":"summary_large_image","twitter_site":"@Cometml","twitter_misc":{"Est. reading time":"13 minutes","Written by":"Caroline Brady"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/","url":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/","name":"Customer Case Study: Building an end-to-end Speech Recognition model in PyTorch with AssemblyAI - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/10\/Case-study-AssemblyAI.webp","datePublished":"2025-03-13T18:38:36+00:00","dateModified":"2025-11-17T20:57:25+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/customers\/assemblyai\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/10\/Case-study-AssemblyAI.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/10\/Case-study-AssemblyAI.webp","width":1842,"height":650,"caption":"Comet x AssemblyAI"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/customers\/assemblyai\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Customers","item":"https:\/\/www.comet.com\/site\/customers\/"},{"@type":"ListItem","position":3,"name":"Building an End-to-End Speech Recognition Model in PyTorch with AssemblyAI"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]}]}},"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/pages\/13153","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/140"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=13153"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/pages\/13153\/revisions"}],"predecessor-version":[{"id":18475,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/pages\/13153\/revisions\/18475"}],"up":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/pages\/488"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/18111"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=13153"}],"wp:term":[{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=13153"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}