skip to Main Content
Comet Launches Prompt Management Tools to revolutionize your Large Language Model workflow.

An Introduction to Multimodal Models

Multimodal Learning seeks to allow computers to represent real-world objects and concepts using multiple data streams. This post provides an overview of diverse applications and state-of-the-art techniques for training and evaluating multimodal models.

What Does Multimodal Mean?

Modality refers to a type of information or representation format in which information is stored. In the context of Deep Learning, modality refers to the type of data a model processes. These data modes include images, text, audio, video, and more. By combining multiple data modes, multimodal learning creates a more comprehensive understanding of a particular object, concept, or task. This approach leverages the strengths of each data type, producing more accurate and robust predictions or classifications.

For example, in computer vision, a multimodal model can combine image and text data to perform image captioning or visual question answering. By processing visual and textual information, the model can provide more accurate and detailed image descriptions.

What Are Some of the Applications for These Types of Models?

Multimodal Learning models have various applications, such as:

  • Visual Search and Question Answering: E-commerce websites can use multimodal models to help customers find products they are interested in. For example, a customer could upload a picture of a dress they like, and the website’s multimodal model would associate the picture with descriptions of similar dresses in the store’s inventory.
  • Structuring Unstructured Data: Multimodal models can aid organizations in transforming unstructured data into structured data that can be analyzed. For instance, a company could use a multimodal model to extract data from images or PDFs of invoices or receipts.
  • Facilitating robots’ manipulation of their surroundings based on natural language instructions: Multimodal models can improve robots’ comprehension of natural language instructions. For example, a robot could use a multimodal model to understand verbal instructions to “pick up the red ball” and then use computer vision to locate and pick up the red ball.

How Do You Train a Model to Understand Multiple Types of Data?

The main idea behind multimodal models is to create consistent representations of a given concept across different modalities. There are several ways to build these representations, but the most popular approach involves creating encoders for each modality and using an objective function that encourages the models to produce similar embeddings for similar data pairs.

One popular approach for integrating the representations of similar data pairs while separating the representations of dissimilar data pairs is contrastive learning. This is accomplished by defining a similarity metric that measures the distance between data pairs in the model’s latent space. The similarity between the representations is measured using cosine similarity or dot product. The model is then trained to minimize the distance between similar pairs and maximize the distance between dissimilar pairs.

In this approach, the encoder models are independent and produce embeddings that are similar but not exactly the same.

Notable Models: OpenAI’s CLIP

CLIP (Contrastive Language-Image Pre-Training) is a state-of-the-art multimodal deep learning model trained using contrastive learning. This approach leverages massive datasets of 400 million image and text pairs to train the model on a vast range of visual and textual concepts, enabling it to learn the relationships between images and text.

CLIP addresses some major problems in the standard deep learning approach to computer vision, such as:

  • Costly Labeled Datasets: CLIP learns from text-image pairs that are already publicly available on the internet. This reduces the need for expensive, large labeled datasets, which has been extensively studied by prior work.
  • Narrow Range of Concepts: Models trained using supervised learning are restricted to predicting the number of labels present in their training dataset. To expand the model’s functionality to new classes, a practitioner would need to change the output head of the model to include these extra classes, and then retrain the model on a dataset that includes these classes. To apply CLIP to a new task, all we need to do is “tell” CLIP’s text-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations.
  • Poor Generalization: The real-world performance of supervised models tends to be slightly inflated, owing to the fact that they can “cheat” by optimizing for performance on their evaluation datasets. CLIP’s performance on out-of-distribution datasets is closer to its real-world performance.

CLIP has enabled numerous advancements in multimodal learning. DeepMind leveraged image encoders from their own CLIP-like models to create a unique training methodology that interleaves image tokens into text sequences to train their Flamingo LLM. StabilityAI’s Stable Diffusion model uses CLIP’s text encoder to help its generation model create images based on a text description.

Furthermore, CLIP facilitates the development of various types of zero-shot models for computer vision tasks, such as image classification and object detection.

How Do You Evaluate a Multimodal Model?

Multimodal models are judged based on the quality of their representations. For models like CLIP, their zero-shot performance on tasks such as image classification is used as a proxy for overall performance. For example, CLIP achieves a Top 1 Accuracy score of 56% and a Top 5 Accuracy score of 83% on the ImageNet Dataset.

Models like Flamingo are also evaluated by comparing their zero-shot and few-shot performance on multimodal tasks such as visual question answering, image classification, OCR, and image captioning to task-specific models that have been trained with significantly more data.


Multimodal Learning is a rapidly growing and exciting field of computer vision and AI that has the potential to revolutionize how computers interact with the world. There has never been a better time to get involved in multimodal learning and explore the cutting-edge techniques used to train and evaluate these complex models. With its diverse applications and potential to transform numerous industries, multimodal learning offers many opportunities for researchers, engineers, and enthusiasts.

Dhruv | Comet ML

Dhruv Nair

Data Scientist at
Back To Top