dH #005: Exploring Multimodal Deep Learning: From CLIP to Contrastive Models

datahacker.rs Other 13.06.2025 | 0

*A comprehensive guide to understanding how AI systems learn from multiple data types simultaneously*

—

## Introduction

The classic multimodal problem – which receives a lot of attention in the DL literature – is image captioning. But there’s lots of other potential multimodal applications as well. It may be natural to think of many document problems as multimodal, combining representations from the document images and OCR’ed texts.

Multimodal Problem Overview

We can learn representations that combine information from multiple modalities (vision, text, audio). When working with multiple modalities, you need a way to fuse the representations. You can jointly train a model with images and text, to align the representation.

Multimodal Training Approach

—

## The Convergence Challenge in Multimodal Learning

With vision and NLP, we largely saw a coalescence of the field around a few main models (ResNet, ViT, a handful of transformer LLMs). The multimodal space feels more dispersed. This lecture will cover a range of different models, and is by no means comprehensive.

Model Convergence Comparison

It’s a very active area, I think for the development of things like virtual reality, multimodality is at the center of that. So there’s lots and lots of research taking place. But just tends to feel a bit more dispersed and not like there’s really kind of a few models that everyone has settled on.

—

## Course Overview: Contrastive Models and Beyond

All right, so we’ll start by talking about contrastive models like CLIP, Locked-in Tuning (LiT), and Contrastive Captioners (CoCa). So this is where you learn an aligned space between different types of representations.

Contrastive Models Overview

Then I’ll talk about multimodal classification and multimodal transformers like CMA-CLIP and Multimodal Bottleneck Transformer.

—

## CLIP: The Breakthrough in Multimodal Learning

All right, so probably the most famous model in the multimodal space is CLIP. CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which captions during training. It can then be used for zero-shot prediction by encoding a new image and text, and predicting the class of the image by finding the nearest text embedding to the image embedding.

CLIP Architecture and Process

### CLIP’s Training Foundation

So CLIP was trained on a proprietary dataset called WebImageText that had 400 million image caption pairs scraped from the internet. Trained on a proprietary dataset (WebImageText) of 400 million Image-Caption pairs scraped from online (there are now open source web image datasets like LAION-400M, which have been used to train open-source CLIP models).

CLIP Training Details

Trained with contrastive learning on 256 GPUs for 2 weeks. It does take a lot of compute to train a model with 400 million image caption pairs. Vision encoder is a ViT; text encoder is GPT-2. Symmetric cross-entropy loss.

—

## Addressing the Limitations of Traditional Approaches

A supervised classification model trained on ImageNet can only be applied off-the-shelf to classification for the categories in ImageNet. It doesn’t have a zero-shot capacity to classify categories unseen during training.

ImageNet Limitations

They also point out that it was costly to create ImageNet (required over 25,000 annotators). It required over 25,000 annotators to annotate those 14 million images and so that no doubt was a kind of a considerable cost whereas they script their 400 million images from the internet and there’s no really human supervision required there.

—

## CLIP in Action: Zero-Shot Classification Example

And so this is a data set called Food 101. Food 101 has images of food and a text description. And so they’re using the images from this data set and then they encode the text descriptions.

Food 101 Classification Example

And you can see that the closest neighbor to this particular image is the embedding of the text ‘a photo of guacamole, a type of food’, which is the correct class guacamole with 90.1% confidence, ranked 1 out of 101 labels. And so that’s how they do zero-shot classification.

—

## The Historical Context: ConVIRT and the Road to CLIP

To contrastively train an image language model but clip really, really popularized it. So I thought that this tweet from Chris Manning, who’s a really influential figure in NLP was pretty interesting.

Chris Manning Tweet

He’s saying, you know, he’s happy to share that their paper ConVIRT algorithm was finally accepted at the MLHC2022 (PMLR 182) conference. And so they first wrote this paper in 2020. It was a pioneering work in contrastive learning of perception by using naturally occurring paired text, as shown in the image with the ‘Image Encoder’ and ‘Text Encoder’ components.

### ConVIRT’s Medical Focus and Performance

The paper is called ‘Contrastive Learning of Medical Visual Representations from Paired Images and Text’, as shown in the title of the figure. The paper (Contrastive Learning of Medical Visual Representations from Paired Images and Text, @yuhaozhangx @hjian42 Yasuhide Miura @chrmanning & @curtlanglotz) shows much better unsupervised visual representation learning using paired text versus vision alone (SimCLR, MoCo v2), as evidenced by the higher RSNA (Linear, 1%) score of 90.7 and CheXt (Linear, 1%) score of 85.1 on the onVIRT dataset under linear classification with 1% labels.

ConVIRT Performance Results

And the fact that it was in radiology (x-rays) and not something like showing performance on ImageNet seemed to dampen interest. But just kind of all to say that the idea of contrastive pre-training on paired image-text data maybe wasn’t necessarily original to the Clip paper, but they did it on 400 million images, which nobody else had done before.

And they were really able to kind of show the power of contrastive pre-training at that scale.

—

## Locked-image Tuning (LiT): Design Choices for Contrastive Learning

And so now I want to talk about another paper, Locked-image Tuning (LiT). And they talk about the different design choices for contrastive image and text training, in particular focusing on whether you freeze one of the models and whether you use pre-training models.

LiT Design Choices Overview

### Three Approaches to Contrastive Training

And so the first, the blue, is what they call Locked pre-trained. So they’re going to take the image encoder and they’re going to freeze it using a state-of-the-art pre-trained image encoder. So they’re going to lock that. And then in the contrastive training, only the text tower, which will be like a BERT, will be allowed to update.

A second approach is Unlocked pre-trained. So they’re using a pre-trained image net backbone again and a pre-trained text language model, like BERT. But now they’re both unlocked and so in contrastive training, the parameters of both of those towers are updated with BackProp.

And then the third approach, which would be like the approach of CLIP, is unlocked and trained from-scratch.

### How LiT Works

LiT freezes the vision encoder, so the pretrained image encoder is not going to change. But the text encoder will update, and they have these paired image caption data, and then contrastively the model will update at each round.

LiT Training Process

The goal is to align the text embeddings to the vision embeddings. So the text embedding of ‘Hot Air Balloon’ will be similar to the vision embedding of an image of a hot air balloon. And the same thing for a variety of different paired image caption pairs.

So again this is data that comes from scraping the internet, so it’s probably pretty noisy, which is potentially important to keep in mind.

—

## LiT Performance Results

And so they find good performance. They evaluate on ImageNet and CIFAR-100, which is a slightly older version of ImageNet with 100 classes, and also on ImageNet Pets. So classifying different types of pets.

LiT Performance Comparison

The locked version of the vision encoder (LiT) outperforms the unlocked version (Uu) that is using a pre-trained ImageNet encoder, which in turn beats out the one that is not using a pre-trained ImageNet encoder (uu).

So I think there’s no surprises that starting with the state-of-the-art pre-trained ImageNet vision encoder will beat out training from scratch. That’s kind of, yeah, not at all surprising, but it’s a little bit surprising at first glance that locking the vision model (LiT) is a good thing to do.

### Scaling Performance Analysis

And it’s getting an accuracy of above 90% for the LiT model on the image-text pairs trained on 20B examples, as shown in the left chart. And you can see that LiT (locked image tuning) helps to kind of close the gap between zero-shot CLIP and fine-tuned state-of-the-art models, as mentioned in the title.

LiT Scaling Results

As I mentioned CLIP is trained on a proprietary dataset. So they do kind of another exercise where they take their data and they train a vision text contrastive model from scratch, referred to as ‘From-scratch’ in the right chart. And then they have an unlocked image encoder, but they’re using a pre-trained ImageNet encoder for the vision side of it, as shown by the ‘Unlocked image encoder’ line in the right chart. And then they have the locked image tuning (LiT) model, as shown in both charts.

And as we saw in the previous table, looking even across a variety of different numbers of image text pairs ranging from a modest data set size of like a couple of hundred million to a billion, the right chart shows the performance of different models trained on varying amounts of image-text pair data up to 1000M examples. 100 million to a very, very large data set.

They’re finding that LiT (locked image encoder fine-tuning) does better than fine-tuning the image encoder.

Design Choice Comparison

### Understanding Why LiT Works

So why is this the case that LiT performs better than fine-tuning the image encoder? Why would locking the vision encoder help compared to an unlocked image encoder, as shown in the ‘Design choice comparison’ chart?

—

## CoCa: Contrastive Captioners Architecture

So that’s locked-in tuning, which was essentially CoCa for the image encoder, using a pre-trained state-of-the-art model and locking it, and forcing the unimodal text decoder to align with the space that has been learned from ImageNet pre-training.

CoCa Architecture Overview

The slide compares the CoCa (Contrastive Captioners) approach to single-encoder, dual-encoder, and encoder-decoder models for image captioning and multimodal understanding tasks.

—

## Understanding CoCa: Combining Multiple Approaches

In the multimodal space, you could align the Image Encoder and the Unimodal Text Decoder. And so this is essentially like cross-modal alignment or for image captioning, you could have cross-attention between an Image Encoder and a Unimodal Text Decoder.

CoCa Architecture Components

And so these are essentially encoder-decoder models. And what Contrastive Captioners (CoCa) does is to combine these different approaches.

### CoCa’s Dual Loss Function

And so there’s a Contrastive Loss, it’s trained on image and text and you train a contrastive loss to align the Image Encoder and the Unimodal Text Decoder. But then through Cross-Attention, you also have a Captioning Loss. And that’s done with a Multimodal Text Decoder. The thing can attend to the Image Encoder with cross-attention.

And so you have these different losses and essentially to combine them to get a model where a single model can handle multiple problems. Simultaneously produces aligned unimodal image and text embeddings and joint multimodal representations, which allows a single model to be used for a variety of downstream tasks.

CoCa Multi-task Capabilities

And you’ll see that this is kind of like a theme in this literature is trying to get a single model that does a bunch of things.

—

## Summary of Contrastive Models

And so that was contrastive models. So these are models that you train them on images and captions, at least in kind of, you don’t even have to be images and captions. But that’s been the majority of work in this space. So train them on images and captions to produce an aligned space where the embedding for a text is similar to the embedding for an image of that text.

Contrastive Models Overview

And you can do that, CLIP did it from scratch on a very large number of image captions. You could lock the vision encoder and use the state of the art vision encoder like Locked in Tuning (LiT). You could potentially combine a contrastive loss with an encoder decoder loss where you’re feeding in an image and producing a caption as a kind of a text generation, which may be useful if you wanted to use a single model to do multiple tasks like Contrastive Captioners (CoCa).

—

## Multimodal Classification: A Different Approach

So now I want to say a little bit about multimodal classification. And so what I mean by multimodal classification is that you have an image and you have a text and you want to jointly classify that pair, as shown in the ‘Multimodal Classification’ section of the outline.

Multimodal Classification Task

Whereas here I’m talking about jointly classifying an image-text pair, where you already have the image and its associated text and you want to feed those into a classifier. And so unlike with the contrastive models that we talked about where you’re using nearest neighbor or clustering approaches with the embeddings, for multimodal classification, the spaces don’t necessarily need to be aligned, as mentioned in the slide.

—

## Multimodal Bitransformers: Early Fusion Approach

And so the first model is called Multimodal Bitransformers, which is by Facebook. It’s all the way back from 2020. And so what they do is to concatenate linear projections of a ResNet output with BERT token embeddings.

Multimodal Bitransformers Architecture

This was really before like VIT was a thing. So they take linear projections of ResNet and concatenate that output with BERT token embeddings. And this concatenated sequence is put into a transformer. And they use the first output of the final layer, like a class token, input to their classification layer.

### How Multimodal Bitransformers Work

And so this is again just taking linear projections of ResNet concatenated with a text and passing the whole thing through a transformer, which will allow for flexible cross-attention between the image inputs and the text token inputs, and then doing classification. And just like you would if this whole thing was just text, but in this case, part of this sequence is an image.

### Benchmark Datasets and Examples

So their benchmarks are an IMDb dataset, like classifying the genre movie data where you have an image and a text description. Food 101, which I mentioned we saw before, where they reclassifying zero shot, the images in Food 101.

Dataset Examples

But in the benchmark, you have an image of a cupcake and you have the description of that cupcake recipe. And so you could kind of input both of those into the transformer, or there’s this V-SNLI dataset, which is NLI. But in addition to the text premise and hypothesis, you have an image of children smiling and waving at the camera.

—

## Performance Results and Comparisons

And this is the result. So BOW is a bag of words. And then image is just a ResNet model alone and BERT is just BERT alone.

Performance Comparison Results

Let’s focus on the middle column, which shows results on the MM-IMDB dataset. The other ones kind of show a similar picture. And so the bag of words models like Bow and Img don’t really do great. And actually the image alone in this particular dataset does less well, as shown by the Img row.

### Fusion Methods Comparison

Late Fusion is a method that takes the Softmax scores from a BERT model and a separate ResNet model and averages those to make predictions. And it actually does pretty well, achieving 91.1 accuracy on the MM-IMDB dataset.

ConcatBow is a method that concatenates the class tokens from BERT and ResNet models and sends that to a classification head. And then MMBT is their transformer model where they pass both the text and image representations into the transformer.

And you can see that MMBT performs better than Late Fusion where they’re just averaging the Softmax scores, and better than ConcatBert where they concatenate the class tokens, though not dramatically better than ConcatBert.

—

## Performance Analysis: Food101 Dataset Results

And so again, let’s look at the Food101 Dataset example, which shows Banana Bread Pancakes with Cinnamon Cream Cheese Syrup labeled as ‘Pancakes’. We kind of see what we did with the other paper of ViT alone doesn’t do that great, scoring 81.8 on the Food101 dataset. BERT does okay alone, scoring 87.2, but not fantastic. CLIP performs a bit better, scoring 88.8.

Food101 Performance Comparison

There’s MMBT that we just saw and CMA-CLIP is beating it again, scoring 93.1 compared to MMBT’s 92.1, not by a huge amount, but if you look at the kind of 8% of mistakes that are made, it’s knocking off a fair number of those.

And so I think that there’s kind of lots of interesting applications like this to documents.

—

## Multimodal Bottleneck Transformer: Sparse Attention

The models we saw allowed for cross-attention between all vision and text tokens, but there are sparser ways to parametrize attention. And so this is a paper by Nagrani et al., 2021, called the multimodal bottleneck transformer.

Fusion Methods Comparison

And instead of, so what they call late fusion is that you just, you don’t allow for cross-attention between modalities. You just combine the modality representations at the end.

### Bottleneck Fusion Mechanism

And bottleneck fusion is where you allow for attention between modalities like video (represented by blue circles) and audio (represented by pink circles), but you do this in a bottleneck. So you’re forcing the model to combine all the representations from one modality before the other modality can see it.

It finds that bottleneck mid fusion, which applies both forms of restriction in conjunction, works well for video applications. It finds that bottleneck attention works well for video applications.

—

## Advanced Applications and AudioSet

Bottleneck attention forces the model, within a given layer, to condense information from one modality before sharing it with the other, while still allowing attention flow within a modality. Apply to sound classification and action recognition. So the sound classification is this data set called audio set.

AudioSet Dataset Details

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

—

## Multimodal Classification vs Zero-Shot Learning

And so you’re using both of those in a multimodal classification problem, where you have models like CLIP, Locked-in Tuning (LiT), and Contrastive Captioners (CoCa) as contrasting models, versus the zero-shot classification we saw that contrastive learning allows for.

Classification Approaches Overview

Now you have this meaningful metric space, and you can just have whatever class you want at inference time, because you can just embed the text of that class. The slide also covers multimodal classification, multimodal transformers like CMA-CLIP and Multimodal Bottleneck Transformer, and multimodal learning as prefix tuning.

—

## Prefix Tuning for Multimodal Learning

So I just finally wanted to say a word about work in the prefix tuning space, which we also talked a little bit about last class. So this was the paper ‘Few-Shot Learning with Frozen Language Models’ by Tsimpoukelli et al., 2021.

Few-Shot Learning Architecture

And here they take a language model with self-attention layers, and they freeze it, but they train the vision encoder. And so they’re doing backprop through the vision encoder, but allowing for full cross-attention between the vision encodings and the frozen language encodings.

And then at test time, they can use this to do 0-shot visual question answering and few-shot image classification, which is pretty cool where they give the model examples of things with made up words, and then see if the model is able to learn that, with just a few examples.

—

## ClipCap: Mapping Vision to Language Space

Instead of freezing the LM and learning the vision embeddings, you could freeze both and learn a mapping function from the pre-trained vision embeddings into the LM space. ClipCap maps CLIP embeddings into embedding vectors, each with the same dimension as a word embedding in GPT2.

ClipCap Architecture

Increasing the prefix size leads to better performance. When the LM is frozen, the preferred mapping function is a transformer, with 8 multi-head self-attention layers with 8 heads each; when LM can be fine-tuned, a MLP is enough.

—

## Conclusion

All right, so that’s all I have on multi-modal learning using the ClipCap model that combines image and text inputs through a mapping network to generate captions or answer visual questions. I know it’s kind of a bit of a grab bag of models, but hopefully it introduces everyone to kind of the basic ideas behind how we can combine information from multiple modalities like image and text using models like ClipCap, which I think is going to become an increasingly prevalent application.

Multimodal Learning Summary

The field of multimodal deep learning continues to evolve rapidly, with researchers exploring various approaches to effectively combine information from different data types. From the foundational work of CLIP demonstrating large-scale contrastive learning, to sophisticated architectures like CoCa that combine multiple loss functions, and innovative approaches like LiT that strategically freeze certain components while training others, we’ve seen how the field is moving toward more efficient and effective multimodal systems.

The exploration of different fusion strategies – from early concatenation in Multimodal Bitransformers to the sparse attention mechanisms in Bottleneck Transformers – shows that there’s no one-size-fits-all solution. The choice of architecture depends heavily on the specific application, available computational resources, and the nature of the modalities being combined.

As we move forward, the trend toward unified models that can handle multiple tasks simultaneously, combined with techniques like prefix tuning that allow for efficient adaptation without full retraining, suggests that multimodal AI systems will become increasingly practical and widely deployed across various domains – from document understanding and visual question answering to complex reasoning tasks that require integrating information from multiple sources.

dH #005: Exploring Multimodal Deep Learning: From CLIP to Contrastive Models

dH #005: Exploring Multimodal Deep Learning: From CLIP to Contrastive Models

Recent Posts

Search

The hundred-page Computer Vision book

What are morphological transformations?

Learn how to align Faces in OpenCV in Python

Thanks for showing interest in our book!

Enter the following details to obtain the book sample

If you don't see email in your inbox, check the spam folder

We will send the code to your email

If you don't see email in your inbox, check the spam folder

Please fill in the following: