{"id":5714,"date":"2023-04-24T13:53:17","date_gmt":"2023-04-24T21:53:17","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=5714"},"modified":"2025-04-24T17:15:36","modified_gmt":"2025-04-24T17:15:36","slug":"an-introduction-to-multimodal-models","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/","title":{"rendered":"An Introduction to Multimodal Models"},"content":{"rendered":"\n<p><span style=\"font-weight: 300;\">Multimodal Learning seeks to allow computers to represent real world objects and concepts using multiple data streams. This post provides an overview of diverse applications and state-of-the-art techniques for training and evaluating multimodal models.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">What Does Multimodal Mean?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 300;\">Modality refers to a type of information or representation format in which information is stored. In the context of Deep Learning, modality refers to the type of data a model processes. These data modes include images, text, audio, video, and more. By combining multiple data modes, multimodal learning creates a more comprehensive understanding of a particular object, concept, or task. This approach leverages the strengths of each data type, producing more accurate and robust predictions or classifications.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 300;\">For example, in computer vision, a multimodal model can combine image and text data to perform image captioning or visual question answering. By processing visual and textual information, the model can provide more accurate and detailed image descriptions.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">What Are Some of the Applications for These Types of Models?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 300;\">Multimodal Learning models have various applications, such as:<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"font-weight: 300;\">Visual Search and Question Answering: E-commerce websites can use multimodal models to help customers find products that interest them. For example, a customer could upload a picture of a dress they like, and the website&#8217;s multimodal model would associate the picture with descriptions of similar dresses in the store&#8217;s inventory.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 300;\">Structuring Unstructured Data: Multimodal models can aid organizations in transforming unstructured data into structured data that can be analyzed. For instance, a company could use a multimodal model to extract data from images or PDFs of invoices or receipts.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 300;\">Facilitating robots&#8217; manipulation of their surroundings based on natural language instructions: Multimodal models can improve robots&#8217; comprehension of natural language instructions. For example, a robot could use a multimodal model to understand verbal instructions to &#8220;pick up the red ball&#8221; and then use computer vision to locate and pick up the red ball.<\/span><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">How Do You Train a Model to Understand Multiple Types of Data?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 300;\">The main idea behind multimodal models is to create consistent representations of a given concept across different modalities. There are several ways to build these representations, but the most popular approach involves creating encoders for each modality and using an objective function that encourages the models to produce similar embeddings for similar data pairs.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 300;\">One popular approach for integrating the representations of similar data pairs while separating the representations of dissimilar data pairs is contrastive learning. This is accomplished by defining a similarity metric that measures the distance between data pairs in the model&#8217;s latent space. The similarity between the representations is measured using cosine similarity or dot product. The model is then trained to minimize the distance between similar pairs and maximize the distance between dissimilar pairs.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 300;\">In this approach, the encoder models are independent and produce embeddings that are similar but not exactly the same.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Notable Models: OpenAI\u2019s CLIP<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 300;\">CLIP (Contrastive Language-Image Pre-Training) is a state-of-the-art multimodal deep learning model trained using contrastive learning. This approach leverages massive datasets of 400 million image and text pairs to train the model on a vast range of visual and textual concepts, enabling it to learn the relationships between images and text.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 300;\">CLIP addresses some major problems in the standard deep learning approach to computer vision, such as:<\/span><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"font-weight: 400;\">Costly Labeled Datasets:<\/span><span style=\"font-weight: 300;\"> CLIP learns from text-image pairs that are already publicly available on the internet. This reduces the need for expensive, large labeled datasets, which has been extensively studied by prior work.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Narrow Range of Concepts:<\/span><span style=\"font-weight: 300;\"> Models trained using supervised learning are restricted to predicting the number of labels present in their training dataset. To expand the model&#8217;s functionality to new classes, a practitioner would need to change the output head of the model to include these extra classes, and then retrain the model on a dataset that includes these classes. To apply CLIP to a new task, all we need to do is &#8220;tell&#8221; CLIP&#8217;s text-encoder the names of the task&#8217;s visual concepts, and it will output a linear classifier of CLIP&#8217;s visual representations.<\/span><\/li>\n\n\n\n<li><span style=\"font-weight: 400;\">Poor Generalization:<\/span><span style=\"font-weight: 300;\"> The real world performance of supervised models tends to be slightly inflated, owing to the fact that they can &#8220;cheat&#8221; by optimizing for performance on their evaluation datasets. CLIP&#8217;s performance on out-of-distribution datasets is closer to its real world performance.<\/span><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Integrating CLIP<\/h2>\n\n\n\n<p><span style=\"font-weight: 300;\">CLIP has enabled numerous advancements in multimodal learning. DeepMind leveraged image encoders from their own CLIP-like models to create a unique training methodology that interleaves image tokens into text sequences to train their Flamingo LLM. StabilityAI&#8217;s Stable Diffusion model uses CLIP&#8217;s text encoder to help its generation model create images based on a text description.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 300;\">Furthermore, CLIP facilitates the development of various types of zero-shot models for computer vision tasks, such as image classification and object detection.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">How Do You Evaluate a Multimodal Model?<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 300;\">Multimodal models are judged based on the quality of their representations. For models like CLIP, their zero-shot performance on tasks such as image classification is used as a proxy for overall performance. For example, CLIP achieves a Top 1 Accuracy score of 56% and a Top 5 Accuracy score of 83% on the ImageNet Dataset.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 300;\">Models like Flamingo are also evaluated by comparing their zero-shot and few-shot performance on multimodal tasks such as visual question answering, image classification, OCR, and image captioning to task-specific models that have been trained with significantly more data.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Conclusion<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 300;\">Multimodal Learning is a rapidly growing and exciting field of computer vision and AI that has the potential to revolutionize how computers interact with the world. There has never been a better time to get involved in multimodal learning and explore the cutting-edge techniques used to train and evaluate these complex models. With its diverse applications and potential to transform numerous industries, multimodal learning offers many opportunities for researchers, engineers, and enthusiasts.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Multimodal Learning seeks to allow computers to represent real world objects and concepts using multiple data streams. This post provides an overview of diverse applications and state-of-the-art techniques for training and evaluating multimodal models. What Does Multimodal Mean? Modality refers to a type of information or representation format in which information is stored. In the [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":7313,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,6],"tags":[],"coauthors":[128],"class_list":["post-5714","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>An Introduction to Multimodal Models - Comet<\/title>\n<meta name=\"description\" content=\"Multimodal models are capable of processing information from different modalities like images, videos, and text.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"An Introduction to Multimodal Models\" \/>\n<meta property=\"og:description\" content=\"Multimodal models are capable of processing information from different modalities like images, videos, and text.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-04-24T21:53:17+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/04\/Screen-Shot-2023-08-25-at-3.30.29-PM-1024x765.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"765\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Dhruv Nair\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Dhruv Nair\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"An Introduction to Multimodal Models - Comet","description":"Multimodal models are capable of processing information from different modalities like images, videos, and text.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/","og_locale":"en_US","og_type":"article","og_title":"An Introduction to Multimodal Models","og_description":"Multimodal models are capable of processing information from different modalities like images, videos, and text.","og_url":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-04-24T21:53:17+00:00","article_modified_time":"2025-04-24T17:15:36+00:00","og_image":[{"width":1024,"height":765,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/04\/Screen-Shot-2023-08-25-at-3.30.29-PM-1024x765.png","type":"image\/png"}],"author":"Dhruv Nair","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Dhruv Nair","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/"},"author":{"name":"Team Comet Digital","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf"},"headline":"An Introduction to Multimodal Models","datePublished":"2023-04-24T21:53:17+00:00","dateModified":"2025-04-24T17:15:36+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/"},"wordCount":974,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/04\/Screen-Shot-2023-08-25-at-3.30.29-PM.png","articleSection":["Comet Community Hub","Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/","url":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/","name":"An Introduction to Multimodal Models - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/04\/Screen-Shot-2023-08-25-at-3.30.29-PM.png","datePublished":"2023-04-24T21:53:17+00:00","dateModified":"2025-04-24T17:15:36+00:00","description":"Multimodal models are capable of processing information from different modalities like images, videos, and text.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/04\/Screen-Shot-2023-08-25-at-3.30.29-PM.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/04\/Screen-Shot-2023-08-25-at-3.30.29-PM.png","width":1806,"height":1350,"caption":"Multimodel Machine Learning models with Comet ML, black background witha wave of musical notes and cursive handwritten script overlaid over an artistic representation of a tree."},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/an-introduction-to-multimodal-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"An Introduction to Multimodal Models"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/6266601170c60a7a82b3e0043fbe8ddf","name":"Team Comet Digital","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/4f0c0a8cc7c0e87c636ff6a420a6647c","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/Screen-Shot-2023-08-12-at-8.58.50-AM-96x96.png","caption":"Team Comet Digital"},"sameAs":["https:\/\/www.comet.ml\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/teamcometdigital\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/5714","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=5714"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/5714\/revisions"}],"predecessor-version":[{"id":15625,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/5714\/revisions\/15625"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/7313"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=5714"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=5714"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=5714"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=5714"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}