{"id":6615,"date":"2023-07-10T09:36:43","date_gmt":"2023-07-10T17:36:43","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=6615"},"modified":"2025-04-24T17:15:13","modified_gmt":"2025-04-24T17:15:13","slug":"working-with-audio-data-for-machine-learning-in-python","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/","title":{"rendered":"Working with Audio Data for Machine Learning in Python"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"lt bg\">\n<figure class=\"lu lv lw lx ly lt bg paragraph-image\"><picture><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*vB3QdvugPeA_4mw0PTWQeA.jpeg\" alt=\"\" width=\"2400\" height=\"1667\"><\/picture><figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mi\" href=\"https:\/\/unsplash.com\/@thomasble?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Thomas Le<\/a> on <a class=\"af mi\" href=\"https:\/\/unsplash.com\/s\/photos\/audio?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"e694\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Most of the attention, when it comes to machine learning or deep learning models, is given to computer vision or natural language sub-domain problems.<\/p>\n<p id=\"82bf\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">However, there\u2019s an ever-increasing need to process audio data, with emerging advancements in technologies like Google Home and Alexa that extract information from voice signals. As such, working with audio data has become a new trend and area of study.<\/p>\n<p id=\"7b55\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The possible applications extend to voice recognition, music classification, tagging, and generation, and are paving the way for audio use cases to become the new era of deep learning.<\/p>\n<h2 id=\"6aa2\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\">Audio File Overview<\/h2>\n<p id=\"c06f\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">Sound are pressure waves, and these waves can be represented by numbers over a time period. These air pressure differences communicates with the brain. Audio files are generally stored in .wav format and need to be digitized, using the concept of sampling.<\/p>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:240\/1*dgktFkjdV9NTxdggksMirw.gif\" alt=\"\" width=\"240\" height=\"138\"><\/figure><div class=\"me mf og\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*dgktFkjdV9NTxdggksMirw.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*dgktFkjdV9NTxdggksMirw.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*dgktFkjdV9NTxdggksMirw.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*dgktFkjdV9NTxdggksMirw.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*dgktFkjdV9NTxdggksMirw.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*dgktFkjdV9NTxdggksMirw.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:480\/1*dgktFkjdV9NTxdggksMirw.gif 480w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 240px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*dgktFkjdV9NTxdggksMirw.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*dgktFkjdV9NTxdggksMirw.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*dgktFkjdV9NTxdggksMirw.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*dgktFkjdV9NTxdggksMirw.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*dgktFkjdV9NTxdggksMirw.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*dgktFkjdV9NTxdggksMirw.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:480\/1*dgktFkjdV9NTxdggksMirw.gif 480w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 240px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<blockquote class=\"om on oo\"><p id=\"815e\" class=\"mj mk op be b ml mm mn mo mp mq mr ms oq mu mv mw or my mz na os nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The <strong class=\"be ot\">sampling frequency<\/strong> (or sample rate) is the number of samples (data points) per second in a ound. For example: if the sampling frequency is 44 khz, a recording with a duration of 60 seconds will contain 2,646,000 samples. In practice, sampling even higher than 10x helps measure the amplitude correctly in the time domain.<\/p><\/blockquote>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:520\/1*D5kF_SinYxgic0K7TJinDg.png\" alt=\"\" width=\"520\" height=\"277\"><\/figure><div class=\"me mf ou\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*D5kF_SinYxgic0K7TJinDg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*D5kF_SinYxgic0K7TJinDg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*D5kF_SinYxgic0K7TJinDg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*D5kF_SinYxgic0K7TJinDg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*D5kF_SinYxgic0K7TJinDg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*D5kF_SinYxgic0K7TJinDg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1040\/format:webp\/1*D5kF_SinYxgic0K7TJinDg.png 1040w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 520px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*D5kF_SinYxgic0K7TJinDg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*D5kF_SinYxgic0K7TJinDg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*D5kF_SinYxgic0K7TJinDg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*D5kF_SinYxgic0K7TJinDg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*D5kF_SinYxgic0K7TJinDg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*D5kF_SinYxgic0K7TJinDg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1040\/1*D5kF_SinYxgic0K7TJinDg.png 1040w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 520px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"f3bd\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\">Loading and Visualizing an audio file in Python<\/h2>\n<p id=\"7610\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\"><mark class=\"adu adv ao\">Librosa<\/mark> is a Python library that helps us work with audio data. For complete documentation, you can also refer to this <a class=\"af mi\" href=\"https:\/\/librosa.github.io\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">link<\/a>.<\/p>\n<ol class=\"\">\n<li id=\"74ab\" class=\"mj mk fo be b ml mm mn mo mp mq mr ms oq mu mv mw or my mz na os nc nd ne nf ov ow ox bj\" data-selectable-paragraph=\"\">Install the library : <code class=\"cw oy oz pa pb b\">pip install librosa<\/code><\/li>\n<li id=\"d563\" class=\"mj mk fo be b ml pc mn mo mp pd mr ms oq pe mv mw or pf mz na os pg nd ne nf ov ow ox bj\" data-selectable-paragraph=\"\">Loading the file: The audio file is loaded into a NumPy array after being sampled at a particular sample rate (sr).<\/li>\n<\/ol>\n<figure class=\"oh oi oj ok ol lt\"><\/figure>\n<pre>import librosa\n#path of the audio file\naudio_data = 'sampleaudio.wav'\n#This returns an audio time series as a numpy array with a default sampling rate(sr) of 22KHZ\nx = librosa.load(audio_data, sr=None)\n\n#We can change this behavior by resampling at sr=44.1KHz.\nx = librosa.load(audio_data, sr=44000)\n\n<\/pre>\n<p id=\"4616\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">3. Playing Audio : Using,<code class=\"cw oy oz pa pb b\"><strong class=\"be ot\">IPython.display.Audio<\/strong><\/code>, we can play the audio file in a Jupyter Notebook, using the command <code class=\"cw oy oz pa pb b\">IPython.display.Audio(audio_data)<\/code><\/p>\n<p id=\"6025\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">4. Waveform visualization : To visualize the sampled signal and plot it, we need two Python libraries\u2014Matplotlib and Librosa. The following code depicts the waveform visualization of the amplitude vs the time representation of the signal.<\/p>\n<pre>%matplotlib inline\nimport matplotlib.pyplot as plt\nimport librosa.display\n\nplt.figure(figsize=(14, 5))\n#plotting the sampled signal\nlibrosa.display.waveplot(x, sr=sr)<\/pre>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:590\/1*__sP8IqEZuTlhz7dY6qhwg.png\" alt=\"\" width=\"590\" height=\"220\"><\/figure><div class=\"me mf pk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*__sP8IqEZuTlhz7dY6qhwg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*__sP8IqEZuTlhz7dY6qhwg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*__sP8IqEZuTlhz7dY6qhwg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*__sP8IqEZuTlhz7dY6qhwg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*__sP8IqEZuTlhz7dY6qhwg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*__sP8IqEZuTlhz7dY6qhwg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1180\/format:webp\/1*__sP8IqEZuTlhz7dY6qhwg.png 1180w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 590px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*__sP8IqEZuTlhz7dY6qhwg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*__sP8IqEZuTlhz7dY6qhwg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*__sP8IqEZuTlhz7dY6qhwg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*__sP8IqEZuTlhz7dY6qhwg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*__sP8IqEZuTlhz7dY6qhwg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*__sP8IqEZuTlhz7dY6qhwg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1180\/1*__sP8IqEZuTlhz7dY6qhwg.png 1180w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 590px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"373e\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">5. Spectrogram : A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. They are time-frequency portraits of signals. Using a spectrogram, we can see how energy levels (dB) vary over time.<\/p>\n<pre>#x: numpy array\nX = librosa.stft(x)\n#converting into energy levels(dB)\nXdb = librosa.amplitude_to_db(abs(X))\n\nplt.figure(figsize=(20, 5))\nlibrosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')\nplt.colorbar()<\/pre>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:639\/1*LNHUcSHH7JJSd936_xsv9A.png\" alt=\"\" width=\"639\" height=\"209\"><\/figure><div class=\"me mf pl\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*LNHUcSHH7JJSd936_xsv9A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*LNHUcSHH7JJSd936_xsv9A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*LNHUcSHH7JJSd936_xsv9A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*LNHUcSHH7JJSd936_xsv9A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*LNHUcSHH7JJSd936_xsv9A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*LNHUcSHH7JJSd936_xsv9A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1278\/format:webp\/1*LNHUcSHH7JJSd936_xsv9A.png 1278w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 639px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*LNHUcSHH7JJSd936_xsv9A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*LNHUcSHH7JJSd936_xsv9A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*LNHUcSHH7JJSd936_xsv9A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*LNHUcSHH7JJSd936_xsv9A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*LNHUcSHH7JJSd936_xsv9A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*LNHUcSHH7JJSd936_xsv9A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1278\/1*LNHUcSHH7JJSd936_xsv9A.png 1278w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 639px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"bcbb\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">6. Log-frequency axis: Features can be obtained from a spectrogram by converting the linear frequency axis, as shown above, into a logarithmic axis. The resulting representation is also called a log-frequency spectrogram. The code we need to write here is:<\/p>\n<p id=\"525e\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\"><code class=\"cw oy oz pa pb b\">librosa.display.specshow(Xdb, sr=sr, x_axis=\u2019time\u2019, y_axis=\u2019log\u2019)<\/code><\/p>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:377\/1*M1xqDmuhRso59OnxtUWf7g.png\" alt=\"\" width=\"377\" height=\"241\"><\/figure><div class=\"me mf pm\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*M1xqDmuhRso59OnxtUWf7g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*M1xqDmuhRso59OnxtUWf7g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*M1xqDmuhRso59OnxtUWf7g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*M1xqDmuhRso59OnxtUWf7g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*M1xqDmuhRso59OnxtUWf7g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*M1xqDmuhRso59OnxtUWf7g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:754\/format:webp\/1*M1xqDmuhRso59OnxtUWf7g.png 754w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 377px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*M1xqDmuhRso59OnxtUWf7g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*M1xqDmuhRso59OnxtUWf7g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*M1xqDmuhRso59OnxtUWf7g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*M1xqDmuhRso59OnxtUWf7g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*M1xqDmuhRso59OnxtUWf7g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*M1xqDmuhRso59OnxtUWf7g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:754\/1*M1xqDmuhRso59OnxtUWf7g.png 754w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 377px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"c4a7\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\">Creating an audio signal and saving it<\/h2>\n<p id=\"1f88\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">A digitized audio signal is a NumPy array with a specified frequency and sample rate. The analog wave format of the audio signal represents a function (i.e. sine, cosine etc). We need to save the composed audio signal generated from the NumPy array. This kind of audio creation could be used in applications that require voice-to-text translation in audio-enabled bots or search engines.<\/p>\n<pre>sr = 22050 # sample rate\nT = 5.0    # seconds\nt = np.linspace(0, T, int(T*sr), endpoint=False) # time variable\nx = 0.5*np.sin(2*np.pi*220*t)# pure sine wave at 220 Hz\n\n#playing generated audio\nipd.Audio(x, rate=sr) # load a NumPy array\n\nlibrosa.output.write_wav('generated.wav', x, sr) # writing wave file in .wav format<\/pre>\n<p id=\"e78c\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">So far, so good. Easy and fun to learn. But data pre-processing steps can be difficult and memory-consuming, as we\u2019ll often have to deal with audio signals that are longer than 1 second. Compared to the images or number of pixels in each training item in popular datasets such as MNIST or CIFAR, the number of data points in digital audio is much higher. This may lead to memory issues.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"b2ae\" class=\"qf nh fo be ni qg qh qi nm qj qk ql nq qm qn qo qp qq qr qs qt qu qv qw qx qy bj\" data-selectable-paragraph=\"\">Pre-processing of audio signals<\/h1>\n<h2 id=\"1ebc\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Normalization<\/strong><\/h2>\n<p id=\"9ba6\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">A technique used to adjust the volume of audio files to a standard set level; if this isn\u2019t done, the volume can differ greatly from word to word, and the file can end up unable to be processed clearly.<\/p>\n<pre>#min = minimum value for each row of the vector signal\n#max = maximum value for each row of the vector signal\ndef normalize(x, axis=0):\n    return sklearn.preprocessing.minmax_scale(x, axis=axis)\n\n#Plotting the Spectral Centroid along the waveform\nlibrosa.display.waveplot(x, sr=sr, alpha=0.4)\nplt.plot(t, normalize(spectral_centroids), color='r')<\/pre>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:632\/1*0VAeCXcb0Gy8Ex9bweGBGg.png\" alt=\"\" width=\"632\" height=\"175\"><\/figure><div class=\"me mf qz\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1264\/format:webp\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 1264w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 632px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1264\/1*0VAeCXcb0Gy8Ex9bweGBGg.png 1264w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 632px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"a145\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Pre-emphasis<\/strong><\/h2>\n<p id=\"32fe\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">Pre-emphasis is done before starting with feature extraction. We do this by boosting only the signal\u2019s high-frequency components, while leaving the low-frequency components in their original states. This is done in order to compensate the high-frequency section, which is suppressed naturally when humans make sounds.<\/p>\n<pre>import matplotlib.pyplot as plt\ny, sr = librosa.load(audio_file.wav, offset=30, duration=10)\ny_filt = librosa.effects.preemphasis(y)\n# and plot the results for comparison\nS_orig = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)\nS_preemph = librosa.amplitude_to_db(np.abs(librosa.stft(y_filt)), ref=np.max)\n\nlibrosa.display.specshow(S_orig, y_axis='log', x_axis='time')\nplt.title('Original signal')\nlibrosa.display.specshow(S_preemph, y_axis='log', x_axis='time')\nplt.title('Pre-emphasized signal')<\/pre>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:364\/1*B_HaSmki1H4CNR53ziyHUQ.png\" alt=\"\" width=\"364\" height=\"287\"><\/figure><div class=\"me mf ra\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*B_HaSmki1H4CNR53ziyHUQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*B_HaSmki1H4CNR53ziyHUQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*B_HaSmki1H4CNR53ziyHUQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*B_HaSmki1H4CNR53ziyHUQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*B_HaSmki1H4CNR53ziyHUQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*B_HaSmki1H4CNR53ziyHUQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:728\/format:webp\/1*B_HaSmki1H4CNR53ziyHUQ.png 728w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 364px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*B_HaSmki1H4CNR53ziyHUQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*B_HaSmki1H4CNR53ziyHUQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*B_HaSmki1H4CNR53ziyHUQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*B_HaSmki1H4CNR53ziyHUQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*B_HaSmki1H4CNR53ziyHUQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*B_HaSmki1H4CNR53ziyHUQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:728\/1*B_HaSmki1H4CNR53ziyHUQ.png 728w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 364px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h1 id=\"fbb0\" class=\"qf nh fo be ni qg rb qi nm qj rc ql nq qm rd qo qp qq re qs qt qu rf qw qx qy bj\" data-selectable-paragraph=\"\">Feature extraction from audio signals<\/h1>\n<p id=\"0115\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">Up until now, we\u2019ve gone through the basic overview of audio signals and how they can be visualized in Python. To take us one step closer to model building, let\u2019s look at the various ways to extract feature from this data.<\/p>\n<h2 id=\"8894\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Zero Crossing Rate<\/strong><\/h2>\n<p id=\"5f3c\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">The number times over a given interval that the signal\u2019s amplitude crosses a value of zero. Essentially, it denotes the number of times the signal changes sign from positive to negative in the given time period. If the count of zero crossings is higher for a given signal, the signal is said to change rapidly, which implies that the signal contains the high-frequency information, and vice-versa.<\/p>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:492\/1*jVwrBFZpXh6KLN_Uv5APfg.png\" alt=\"\" width=\"492\" height=\"196\"><\/figure><div class=\"me mf rg\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*jVwrBFZpXh6KLN_Uv5APfg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*jVwrBFZpXh6KLN_Uv5APfg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*jVwrBFZpXh6KLN_Uv5APfg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*jVwrBFZpXh6KLN_Uv5APfg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*jVwrBFZpXh6KLN_Uv5APfg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*jVwrBFZpXh6KLN_Uv5APfg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:984\/format:webp\/1*jVwrBFZpXh6KLN_Uv5APfg.png 984w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 492px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*jVwrBFZpXh6KLN_Uv5APfg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*jVwrBFZpXh6KLN_Uv5APfg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*jVwrBFZpXh6KLN_Uv5APfg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*jVwrBFZpXh6KLN_Uv5APfg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*jVwrBFZpXh6KLN_Uv5APfg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*jVwrBFZpXh6KLN_Uv5APfg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:984\/1*jVwrBFZpXh6KLN_Uv5APfg.png 984w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 492px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<pre>#zero crossings to be found between a given time\nn0 = 9000\nn1 = 9100\nplt.figure(figsize=(20, 5))\nplt.plot(x[n0:n1])\nplt.grid()\n\nzero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)\nzero_crossings.shape<\/pre>\n<h2 id=\"70e5\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Spectral Rolloff<\/strong><\/h2>\n<p id=\"67de\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">The rolloff frequency is defined as the frequency under which the cutoff of the total energy of the spectrum is contained, eg. 85%. It can be used to distinguish between harmonic and noisy sounds.<\/p>\n<pre>y, sr = librosa.load(librosa.util.example_audio_file())\n# Approximate maximum frequencies with roll_percent=0.85 (default)\nrolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)\n\n# Approximate minimum frequencies with roll_percent=0.1\nrolloff = librosa.feature.spectral_rolloff(y=y, sr=sr, roll_percent=0.1)<\/pre>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:607\/1*Do9_WMYgS3PvqY3Mkmf3lw.png\" alt=\"\" width=\"607\" height=\"216\"><\/figure><div class=\"me mf rh\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1214\/format:webp\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 1214w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 607px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1214\/1*Do9_WMYgS3PvqY3Mkmf3lw.png 1214w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 607px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"4955\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\"><strong class=\"al\">MFCC<\/strong><\/h2>\n<p id=\"0ebe\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">One popular audio feature extraction method is the Mel-frequency cepstral coefficients (MFCC), which has 39 features. The feature count is small enough to force the model to learn the information of the audio. 12 parameters are related to the amplitude of frequencies. The extraction flow of MFCC features is depicted below:<\/p>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:577\/1*M3Fq-ltf5dkLW85xc2T6YA.png\" alt=\"\" width=\"577\" height=\"251\"><\/figure><div class=\"me mf ri\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*M3Fq-ltf5dkLW85xc2T6YA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*M3Fq-ltf5dkLW85xc2T6YA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*M3Fq-ltf5dkLW85xc2T6YA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*M3Fq-ltf5dkLW85xc2T6YA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*M3Fq-ltf5dkLW85xc2T6YA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*M3Fq-ltf5dkLW85xc2T6YA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1154\/format:webp\/1*M3Fq-ltf5dkLW85xc2T6YA.png 1154w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 577px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*M3Fq-ltf5dkLW85xc2T6YA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*M3Fq-ltf5dkLW85xc2T6YA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*M3Fq-ltf5dkLW85xc2T6YA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*M3Fq-ltf5dkLW85xc2T6YA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*M3Fq-ltf5dkLW85xc2T6YA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*M3Fq-ltf5dkLW85xc2T6YA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1154\/1*M3Fq-ltf5dkLW85xc2T6YA.png 1154w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 577px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<ol class=\"\">\n<li id=\"3b15\" class=\"mj mk fo be b ml mm mn mo mp mq mr ms oq mu mv mw or my mz na os nc nd ne nf ov ow ox bj\" data-selectable-paragraph=\"\">Framing and Windowing: The continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M. The result after this step is called spectrum.<\/li>\n<li id=\"9666\" class=\"mj mk fo be b ml pc mn mo mp pd mr ms oq pe mv mw or pf mz na os pg nd ne nf ov ow ox bj\" data-selectable-paragraph=\"\">Mel Frequency Wrapping: For each tone with a frequency f, a pitch is measured on the Mel scale. This scale uses a linear spacing for frequencies below 1000Hz and transforms frequencies above 1000Hz by using a logarithmic function.<\/li>\n<li id=\"2088\" class=\"mj mk fo be b ml pc mn mo mp pd mr ms oq pe mv mw or pf mz na os pg nd ne nf ov ow ox bj\" data-selectable-paragraph=\"\">Cepstrum: Converting of log-mel scale back to time. This provides a good representation of a signal\u2019s local spectral properties, with the result as MFCC features.<\/li>\n<\/ol>\n<p id=\"45de\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The MFCC features can be extracted using the Librosa Python library we installed earlier:<\/p>\n<p id=\"ce78\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\"><code class=\"cw oy oz pa pb b\">librosa.feature.mfcc(x, sr=sr)<\/code><\/p>\n<p id=\"c893\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Where x = time domain NumPy series and sr = sampling rate<\/p>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:522\/1*LKeX6j2QZTyGVWFzQAla9A.png\" alt=\"\" width=\"522\" height=\"226\"><\/figure><div class=\"me mf rj\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*LKeX6j2QZTyGVWFzQAla9A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*LKeX6j2QZTyGVWFzQAla9A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*LKeX6j2QZTyGVWFzQAla9A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*LKeX6j2QZTyGVWFzQAla9A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*LKeX6j2QZTyGVWFzQAla9A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*LKeX6j2QZTyGVWFzQAla9A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1044\/format:webp\/1*LKeX6j2QZTyGVWFzQAla9A.png 1044w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 522px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*LKeX6j2QZTyGVWFzQAla9A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*LKeX6j2QZTyGVWFzQAla9A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*LKeX6j2QZTyGVWFzQAla9A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*LKeX6j2QZTyGVWFzQAla9A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*LKeX6j2QZTyGVWFzQAla9A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*LKeX6j2QZTyGVWFzQAla9A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1044\/1*LKeX6j2QZTyGVWFzQAla9A.png 1044w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 522px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h2 id=\"e829\" class=\"ng nh fo be ni nj nk nl nm nn no np nq mt nr ns nt mx nu nv nw nb nx ny nz oa bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Chroma Frequencies<\/strong><\/h2>\n<p id=\"cbcf\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">The entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave. The human perception of pitch is periodic in the sense that two pitches are perceived as similar if they differ by one or several octaves (where 1 octave=12 pitches).<\/p>\n<pre>x, sr = librosa.load('audio.wav')\nipd.Audio(x, rate=sr)\n\nhop_length = 512\n# returns normalized energy for each chroma bin at each frame.\nchromagram = librosa.feature.chroma_stft(x, sr=sr, hop_length=hop_length)\nplt.figure(figsize=(15, 5))\nlibrosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')<\/pre>\n<figure class=\"oh oi oj ok ol lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:550\/1*FzFmSQfOLL5hEB7dX2DBdQ.png\" alt=\"\" width=\"550\" height=\"237\"><\/figure><div class=\"me mf rk\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 1100w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 550px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*FzFmSQfOLL5hEB7dX2DBdQ.png 1100w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 550px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<h1 id=\"d84b\" class=\"qf nh fo be ni qg rb qi nm qj rc ql nq qm rd qo qp qq re qs qt qu rf qw qx qy bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"f176\" class=\"pw-post-body-paragraph mj mk fo be b ml ob mn mo mp oc mr ms mt od mv mw mx oe mz na nb of nd ne nf fh bj\" data-selectable-paragraph=\"\">In this article on how to work with audio signals in Python, we covered the following sub-topics:<\/p>\n<ul class=\"\">\n<li id=\"84c6\" class=\"mj mk fo be b ml mm mn mo mp mq mr ms oq mu mv mw or my mz na os nc nd ne nf rl ow ox bj\" data-selectable-paragraph=\"\">Loading and visualizing audio signals<\/li>\n<li id=\"9a68\" class=\"mj mk fo be b ml pc mn mo mp pd mr ms oq pe mv mw or pf mz na os pg nd ne nf rl ow ox bj\" data-selectable-paragraph=\"\">Techniques of pre-processing of audio data by pre-emphasis, normalization<\/li>\n<li id=\"cbaf\" class=\"mj mk fo be b ml pc mn mo mp pd mr ms oq pe mv mw or pf mz na os pg nd ne nf rl ow ox bj\" data-selectable-paragraph=\"\">Feature extraction from audio files by Zero Crossing Rate, MFCC, and Chroma frequencies<\/li>\n<\/ul>\n<p id=\"4efe\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Thanks for sticking till the end!<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Thomas Le on Unsplash Most of the attention, when it comes to machine learning or deep learning models, is given to computer vision or natural language sub-domain problems. However, there\u2019s an ever-increasing need to process audio data, with emerging advancements in technologies like Google Home and Alexa that extract information from voice signals. [&hellip;]<\/p>\n","protected":false},"author":53,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[155],"class_list":["post-6615","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Working with Audio Data for Machine Learning in Python - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Working with Audio Data for Machine Learning in Python\" \/>\n<meta property=\"og:description\" content=\"Photo by Thomas Le on Unsplash Most of the attention, when it comes to machine learning or deep learning models, is given to computer vision or natural language sub-domain problems. However, there\u2019s an ever-increasing need to process audio data, with emerging advancements in technologies like Google Home and Alexa that extract information from voice signals. [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-07-10T17:36:43+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:13+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*vB3QdvugPeA_4mw0PTWQeA.jpeg\" \/>\n<meta name=\"author\" content=\"Pragati Baheti\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Pragati Baheti\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Working with Audio Data for Machine Learning in Python - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/","og_locale":"en_US","og_type":"article","og_title":"Working with Audio Data for Machine Learning in Python","og_description":"Photo by Thomas Le on Unsplash Most of the attention, when it comes to machine learning or deep learning models, is given to computer vision or natural language sub-domain problems. However, there\u2019s an ever-increasing need to process audio data, with emerging advancements in technologies like Google Home and Alexa that extract information from voice signals. [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-07-10T17:36:43+00:00","article_modified_time":"2025-04-24T17:15:13+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*vB3QdvugPeA_4mw0PTWQeA.jpeg","type":"","width":"","height":""}],"author":"Pragati Baheti","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Pragati Baheti","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/"},"author":{"name":"Pragati Baheti","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/54958874fd9a373469e70e19b6597439"},"headline":"Working with Audio Data for Machine Learning in Python","datePublished":"2023-07-10T17:36:43+00:00","dateModified":"2025-04-24T17:15:13+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/"},"wordCount":1040,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*vB3QdvugPeA_4mw0PTWQeA.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/","url":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/","name":"Working with Audio Data for Machine Learning in Python - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*vB3QdvugPeA_4mw0PTWQeA.jpeg","datePublished":"2023-07-10T17:36:43+00:00","dateModified":"2025-04-24T17:15:13+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*vB3QdvugPeA_4mw0PTWQeA.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*vB3QdvugPeA_4mw0PTWQeA.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/working-with-audio-data-for-machine-learning-in-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Working with Audio Data for Machine Learning in Python"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/54958874fd9a373469e70e19b6597439","name":"Pragati Baheti","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/851362323c20d10f17041155fc07cae2","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1535716570395-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1535716570395-96x96.jpg","caption":"Pragati Baheti"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/pragatibaheti001gmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6615","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/53"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=6615"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6615\/revisions"}],"predecessor-version":[{"id":15604,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6615\/revisions\/15604"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=6615"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=6615"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=6615"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=6615"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}