{"id":8532,"date":"2024-01-12T06:00:36","date_gmt":"2024-01-12T14:00:36","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=8532"},"modified":"2025-04-24T17:03:33","modified_gmt":"2025-04-24T17:03:33","slug":"generating-images-from-audio-with-machine-learning","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning\/","title":{"rendered":"Generating Images from Audio with Machine\u00a0Learning"},"content":{"rendered":"\n<figure class=\"wp-block-image lp lq lr ls lt lu lm ln paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*dIcYQqeGSOJlU48nUuIjaA.jpeg\" alt=\"green audio waves\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading ma mb fr be mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx bj\" id=\"d13b\">Quick Summary<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"03ae\">In this article, I\u2019ll show you how to create amazing images from audio using the magic of Machine Learning and the Transformers models. I\u2019ll explain each step clearly, uncover the secrets behind Whisper, and highlight the incredible abilities of Hugging Face models. By the end, you\u2019ll know how to transform audio into stunning images with these powerful tools.<\/p>\n\n\n\n<h2 class=\"wp-block-heading ma mb fr be mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx bj\" id=\"4a10\">Introduction<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"c7d8\">Have you ever listened to someone describe a scene in a speech or audiobook and pictured it in your head? Using AI, we can automatically turn spoken words into images by transcribing audio and generating pictures.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"5374\">In this guide, I will demonstrate how to generate images from audio using pre-trained models. The key steps are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Transcribing audio to text using the Whisper speech recognition model<\/li>\n\n\n\n<li>Summarizing the text transcripts using a transformer model<\/li>\n\n\n\n<li>Generating images from the text summaries with Stable Diffusion<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"17d3\">I will walk through Python code examples for each stage of this audio-to-image tutorial. By the end, you will understand how to combine speech recognition, text summarization, and text-to-image generation models to produce relevant images from audio input.<\/p>\n\n\n\n<h2 class=\"wp-block-heading ma mb fr be mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx bj\" id=\"3825\">Whisper and Hugging Face Models: A Deep Dive<\/h2>\n\n\n\n<h3 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"9083\"><strong class=\"al\">Whisper: The Automatic Speech Recognition System<\/strong><\/h3>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"6de4\">Our journey into audio-to-image transformation begins with <a class=\"af pa\" href=\"https:\/\/openai.com\/research\/whisper\" target=\"_blank\" rel=\"noopener ugc nofollow\">Whisper<\/a>, an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Whisper ASR plays a pivotal role in our project. It\u2019s the initial step where audio gets transformed into text. This is crucial because we can\u2019t directly turn sound waves into images; we need an intermediary, and Whisper fulfills that role exceptionally well. It ensures that the spoken words in the audio are accurately represented in written form.<\/p>\n\n\n\n<figure class=\"wp-block-image pc pd pe pf pg lu lm ln paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*WJEHQFvpvYC7F-vjVGaX3A.png\" alt=\"Whisper Architecture for audio and machine learning\"\/><figcaption class=\"wp-element-caption\">Whisper Architecture \u2014 Credit: OpenAI<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"0fdf\"><strong class=\"al\">Pre-Trained Models from Hugging Face<\/strong><\/h3>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"9df9\"><a class=\"af pa\" href=\"https:\/\/huggingface.co\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"na fs\">Hugging Face<\/strong><\/a> is a hub for developers. It\u2019s packed with pre-trained models that excel in various language tasks. Whether you need to understand text, translate languages, summarize paragraphs, or generate images, Hugging Face has your back.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"cab8\">In our journey to turn audio into images, we picked two remarkable models from the Hugging Face model hub:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong class=\"na fs\">T5- Text-to-Text-Transformer-Transformer: <\/strong>This model takes a unique approach by framing all NLP tasks into a unified text-to-text format. In this format, both the input and output are always text strings. This versatility allows one T5 model to handle multiple tasks effectively. Imagine it as a jack of all NLP trades! You can dive deeper into T5 in this <a class=\"af pa\" href=\"https:\/\/blog.research.google\/2020\/02\/exploring-transfer-learning-with-t5.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Google Research blog<\/a> and access it on the hub <a class=\"af pa\" href=\"https:\/\/huggingface.co\/t5-small\" target=\"_blank\" rel=\"noopener ugc nofollow\">here<\/a>.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image pc pd pe pf pg lu lm ln paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*60g54I1mCfKKYv3wTgfq8A.gif\" alt=\"T5 model architecture\"\/><figcaption class=\"wp-element-caption\">T5 model architecture<\/figcaption><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong class=\"na fs\">Stable Diffusion: <\/strong><a class=\"af pa\" href=\"https:\/\/huggingface.co\/runwayml\/stable-diffusion-v1-5\" target=\"_blank\" rel=\"noopener ugc nofollow\">Stable Diffusion <\/a>is a text-to-image AI model created by CompVis, Stability AI, and LAION using latent diffusion, an efficient image generation technique proposed in the paper \u201c<a class=\"af pa\" href=\"https:\/\/paperswithcode.com\/paper\/high-resolution-image-synthesis-with-latent\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"na fs\">High-Resolution Image Synthesis with Latent Diffusion Models<\/strong><\/a><strong class=\"na fs\">.<\/strong>\u201d<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"1399\">The Stable Diffusion model uses an encoder-decoder architecture consisting of a UNet decoder and CLIP text encoder. It is trained on LAION-5B data to generate 512&#215;512 images matching text prompts on consumer GPUs.<\/p>\n\n\n\n<figure class=\"wp-block-image pc pd pe pf pg lu lm ln paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*7Bljw25WbtoPOH49ZWklYw.png\" alt=\"Stable Diffusion Architecture for audio and machine learning\"\/><figcaption class=\"wp-element-caption\">Stable Diffusion Architecture<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"c3b0\">Innovations like latent diffusion and CLIP make Stable Diffusion an accessible, state-of-the-art text-to-image synthesis model. You can learn more about the model <a class=\"af pa\" href=\"https:\/\/github.com\/Stability-AI\/StableDiffusion\" target=\"_blank\" rel=\"noopener ugc nofollow\">here.<\/a><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"c383\">Now, we\u2019re going under the hood. We\u2019ll explore the code that drives our audio-to-image magic. But don\u2019t worry, I\u2019ll explain it step by step.<\/p>\n\n\n\n<h2 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"7401\"><strong class=\"al\">Prerequisites<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"3407\">Before we start our audio-to-image adventure, let\u2019s make sure we have a few things in place:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong class=\"na fs\">Python Knowledge:<\/strong> You should know a bit about Python, the programming language we\u2019ll use. Don\u2019t worry; you don\u2019t need to be a Python pro, just familiar with the basics.<\/li>\n\n\n\n<li><strong class=\"na fs\">Some NLP Understanding:<\/strong> We\u2019ll deal with text and language processing, so having a basic idea of NLP concepts, like how we handle words and sentences, will be helpful.<\/li>\n\n\n\n<li><strong class=\"na fs\">Curiosity:<\/strong> Most importantly, bring your curiosity and interest. This project is all about exploring cool tech stuff and being creative. So, let\u2019s get started!<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading ma mb fr be mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx bj\" id=\"9a4c\">Model Setup and Code Walkthrough<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"3fa1\">In this section, we\u2019ll get our tools ready. We\u2019ll use a Colab notebook (the free version is enough), the powerful Whisper ASR system, and some pre-trained models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"15b8\"><strong class=\"al\">Installing the Libraries<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"152c\">We first need to install some essential Python libraries to get started with audio-to-image generation. These will provide the abilities our code needs.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"196d\">The core libraries we\u2019ll install are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>PyTorch for deep learning capabilities<\/li>\n\n\n\n<li><a href=\"https:\/\/huggingface.co\/docs\/transformers\/index\">Transformers<\/a> for natural language processing models<\/li>\n\n\n\n<li><a href=\"https:\/\/huggingface.co\/docs\/diffusers\/index\">Diffusers<\/a> for diffusion models like Stable Diffusion<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"6e90\">You can install these essential libraries using pip:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"99d7\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">!pip install torch transformers <\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"833b\">And for the Diffusers:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"fee4\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">!pip install --upgrade diffusers[torch]<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"8868\">With the essential libraries installed, we can now import them to start using their capabilities:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"2a7e\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> transformers <span class=\"hljs-keyword\">import<\/span> pipeline\n<span class=\"hljs-keyword\">from<\/span> diffusers <span class=\"hljs-keyword\">import<\/span> StableDiffusionPipeline\n<span class=\"hljs-keyword\">import<\/span> torch\n<span class=\"hljs-keyword\">import<\/span> os\n<span class=\"hljs-keyword\">from<\/span> IPython.display <span class=\"hljs-keyword\">import<\/span> Image, display<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"3328\">These libraries will enable us to perform audio transcription, text summarization, and image generation as part of our project. If you\u2019re all set, we can now dive into the next steps of the process.<\/p>\n\n\n\n<h2 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"7a34\"><strong class=\"al\">Loading the Models<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"a900\">Let\u2019s set up the models for handling audio, text, and images.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"eb3d\">First, we need a way to convert spoken audio into text. We will use Whisper, an automatic speech recognition model trained by OpenAI. Rather than loading the model directly, we can use the <a href=\"https:\/\/huggingface.co\/learn\/nlp-course\/chapter1\/3?fw=pt#working-with-pipelines\">pipeline function<\/a> to load Whisper like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"2200\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># Load the Whisper model using a pipeline<\/span>\nwhisper_pipeline = pipeline(<span class=\"hljs-string\">\"automatic-speech-recognition\"<\/span>, model=<span class=\"hljs-string\">\"openai\/whisper-base\"<\/span>)<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"7097\">Whisper has five different model sizes, each offering varying trade-offs between speed and accuracy. We used the \u201c<strong class=\"na fs\">base\u201d.en<\/strong> model, optimized for English-only applications. It performs better, especially for the base and tiny model sizes.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"2d72\">Next, we need to load a model to generate images from text. For this, we will use Stable Diffusion, a text-to-image diffusion model.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"01e2\">We load the Stable Diffusion model like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"9f6d\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">\nmodel_id = <span class=\"hljs-string\">\"runwayml\/stable-diffusion-v1-5\"<\/span>\ntext_to_image_pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)\ntext_to_image_pipe = text_to_image_pipe.to(<span class=\"hljs-string\">\"cuda\"<\/span>)<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"1c2f\">Calling <code class=\"cw py pz qa pq b\">StableDiffusionPipeline.from_pretrained(model_id)<\/code> initializes the model with its architecture and configuration. Now, the text-to-image generation capabilities are ready to use.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"3906\">In addition to transcription and image generation, we want to summarize long texts. For this purpose, we load the T5-small model:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"4fe3\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">summarization_pipeline = pipeline(<span class=\"hljs-string\">\"summarization\"<\/span>, model=<span class=\"hljs-string\">\"t5-small\"<\/span>, tokenizer=<span class=\"hljs-string\">\"t5-small\"<\/span>)<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"2988\">At this point, we have set up Whisper, Stable Diffusion, and now T5-small. Our pipelines for speech-to-text, text-to-image, and text summarization are initialized and ready for us to start feeding in audio.<\/p>\n\n\n\n<h2 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"0278\"><strong class=\"al\">Transcribing Audio to Text<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"ba2e\">To transcribe audio into text, we\u2019ll use this Python function to handle the speech-to-text conversion:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"4fc4\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">\n<span class=\"hljs-comment\"># Function to transcribe audio<\/span>\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">transcribe_audio<\/span>(<span class=\"hljs-params\">audio_path<\/span>):\n    <span class=\"hljs-comment\"># Transcribe the audio using the Whisper pipeline<\/span>\n    transcribed_text = whisper_pipeline(audio_path)\n    <span class=\"hljs-keyword\">return<\/span> transcribed_text<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"818f\">The transcribe_audio function accepts an audio file path and handles the speech-to-text transcription process. Under the hood, it passes the audio file to the Whisper automatic speech recognition model we loaded earlier using the pipeline. Whisper analyzes the audio, detects speech, and transcribes the spoken words into text. This text transcript is then returned by the transcribe_audio function as the output.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"16d4\">Let\u2019s look at an example of using transcribe_audio:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"08c5\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">\n<span class=\"hljs-comment\"># Example usage<\/span>\naudio_path = <span class=\"hljs-string\">\"\/content\/audio.wav\"<\/span>  <span class=\"hljs-comment\"># Replace with your audio file path<\/span>\ntranscribed_text = transcribe_audio(audio_path)\n<span class=\"hljs-built_in\">print<\/span>(<span class=\"hljs-string\">\"Transcribed Text:\"<\/span>, transcribed_text)<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"41bd\">The code above shows how we can call transcribe_audio, passing our audio file path to get back the text transcript. This demonstrates the simplicity of using the function to transcribe any audio file into text.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"e3ec\"><strong class=\"na fs\">Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"3952\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">Transcribed Text: The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.<\/span><\/pre>\n\n\n\n<h2 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"a792\"><strong class=\"al\">Summarizing the Transcript<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"9365\">After transcribing the audio, the resulting text transcript can often be lengthy. For generating images, long descriptions are hard to interpret.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"c42f\">To address this, creating a summary can concisely capture the essence of the transcript.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"e479\">We\u2019ll load the T5 text summarization model and create a function to generate summaries:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"5e15\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">\n<span class=\"hljs-comment\"># Function to summarize text<\/span>\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">summarize_text<\/span>(<span class=\"hljs-params\">text<\/span>):\n    <span class=\"hljs-comment\"># Adjust max_length based on input text length<\/span>\n    max_length = <span class=\"hljs-built_in\">min<\/span>(<span class=\"hljs-built_in\">len<\/span>(text) + <span class=\"hljs-number\">10<\/span>, <span class=\"hljs-number\">15<\/span>)  <span class=\"hljs-comment\"># Add a buffer of 10 tokens and cap at 15<\/span>\n    summary = summarization_pipeline(text, max_length=max_length, min_length=<span class=\"hljs-number\">5<\/span>, do_sample=<span class=\"hljs-literal\">False<\/span>)[<span class=\"hljs-number\">0<\/span>][<span class=\"hljs-string\">'summary_text'<\/span>]\n    <span class=\"hljs-keyword\">return<\/span> summary\nsummarized_text = summarize_text(transcribed_text)\n<span class=\"hljs-built_in\">print<\/span>(<span class=\"hljs-string\">\"Summarized Text:\"<\/span>, summarized_text)<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"9248\">This function uses a library that streamlines the summarization process. We\u2019ve set it up to adjust the maximum length based on the input text\u2019s length, ensuring an effective summary. Finally, we print the summarized text.<\/p>\n\n\n\n<h2 class=\"wp-block-heading oj mb fr be mc ok ol om mg on oo op mk nj oq or os nn ot ou ov nr ow ox oy oz bj\" id=\"fa60\"><strong class=\"al\">Synthesizing Images from Text<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"559e\">With the summarized text, the next step is to transform it into cool visual imagery. This is where our project reaches its artistic peak. To achieve this, we utilize yet another piece of Python code.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span id=\"823f\" class=\"pt mb fr pq b bf pu pv l pw px\" data-selectable-paragraph=\"\">\n<span class=\"hljs-comment\"># Generate image from summarized text<\/span>\ngenerated_image = text_to_image_pipe(summarized_text).images[<span class=\"hljs-number\">0<\/span>]\n<span class=\"hljs-comment\"># Display or save the generated image as needed<\/span>\ngenerated_image.save(<span class=\"hljs-string\">\"generated_image.png\"<\/span>)\n<span class=\"hljs-comment\">#Display the generated image<\/span>\ndisplay(generated_image)<\/span><\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"ee99\">In this code, we take the summarized text as input; using an innovative model, the function generates an image corresponding to the text\u2019s content. Once the image is generated, we can display it or save it as a file for later use. The result is a captivating visual representation of the spoken words, ready to be admired or shared.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"529a\">Here is the output:<\/p>\n\n\n\n<figure class=\"wp-block-image pc pd pe pf pg lu lm ln paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:512\/1*0gEjsF_mc2m8kcuSTEVGiQ.png\" alt=\"piles of salt, 2 in bowls, and 1 lemon\"\/><figcaption class=\"wp-element-caption\">Generated Image<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading ma mb fr be mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx bj\" id=\"b454\">What\u2019s Next?<\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv fk bj\" id=\"397d\">Now that you\u2019ve seen how we transform audio into amazing images, how about taking it a step further? Imagine creating your own app where you and others can turn spoken words into captivating visuals.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph my mz fr na b nb nw nd ne nf nx nh ni nj ny nl nm nn nz np nq nr oa nt nu nv fk bj\" id=\"ce19\">And here\u2019s the best part: you don\u2019t need to be a coding expert. User-friendly tools like Gradio and Streamlit are fantastic choices for building your app. So, what are you waiting for? Dive into this creative journey and let your imagination take the lead with audio and machine learning!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Quick Summary In this article, I\u2019ll show you how to create amazing images from audio using the magic of Machine Learning and the Transformers models. I\u2019ll explain each step clearly, uncover the secrets behind Whisper, and highlight the incredible abilities of Hugging Face models. By the end, you\u2019ll know how to transform audio into stunning [&hellip;]<\/p>\n","protected":false},"author":117,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[7],"tags":[],"coauthors":[214],"class_list":["post-8532","post","type-post","status-publish","format-standard","hentry","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Audio and Machine Learning: Image Generation<\/title>\n<meta name=\"description\" content=\"Learn how to create amazing images from audio and Machine Learning and the Transformers models. Dive into the article.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Generating Images from Audio with Machine\u00a0Learning\" \/>\n<meta property=\"og:description\" content=\"Learn how to create amazing images from audio and Machine Learning and the Transformers models. Dive into the article.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-01-12T14:00:36+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:03:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*dIcYQqeGSOJlU48nUuIjaA.jpeg\" \/>\n<meta name=\"author\" content=\"Joas Pambou\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Joas Pambou\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Audio and Machine Learning: Image Generation","description":"Learn how to create amazing images from audio and Machine Learning and the Transformers models. Dive into the article.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning","og_locale":"en_US","og_type":"article","og_title":"Generating Images from Audio with Machine\u00a0Learning","og_description":"Learn how to create amazing images from audio and Machine Learning and the Transformers models. Dive into the article.","og_url":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-01-12T14:00:36+00:00","article_modified_time":"2025-04-24T17:03:33+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*dIcYQqeGSOJlU48nUuIjaA.jpeg","type":"","width":"","height":""}],"author":"Joas Pambou","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Joas Pambou","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning\/"},"author":{"name":"Joas Pambou","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/57ba61eb4732a196c58dc4fda404bf76"},"headline":"Generating Images from Audio with Machine\u00a0Learning","datePublished":"2024-01-12T14:00:36+00:00","dateModified":"2025-04-24T17:03:33+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning\/"},"wordCount":1411,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*dIcYQqeGSOJlU48nUuIjaA.jpeg","articleSection":["Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning\/","url":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning","name":"Audio and Machine Learning: Image Generation","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*dIcYQqeGSOJlU48nUuIjaA.jpeg","datePublished":"2024-01-12T14:00:36+00:00","dateModified":"2025-04-24T17:03:33+00:00","description":"Learn how to create amazing images from audio and Machine Learning and the Transformers models. Dive into the article.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*dIcYQqeGSOJlU48nUuIjaA.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*dIcYQqeGSOJlU48nUuIjaA.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/generating-images-from-audio-with-machine-learning#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Generating Images from Audio with Machine\u00a0Learning"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/57ba61eb4732a196c58dc4fda404bf76","name":"Joas Pambou","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/b8b6da574a6f75b9323f92cb3dac6a34","url":"https:\/\/secure.gravatar.com\/avatar\/2196ab692938538189a06d69b675c707c422352bc9f7c5ea20ce88b125b8ad2e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/2196ab692938538189a06d69b675c707c422352bc9f7c5ea20ce88b125b8ad2e?s=96&d=mm&r=g","caption":"Joas Pambou"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/joaspambougmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8532","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/117"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=8532"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8532\/revisions"}],"predecessor-version":[{"id":15405,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8532\/revisions\/15405"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=8532"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=8532"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=8532"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=8532"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}