dH #006: Understanding CLIP: Bridging Images and Text with Contrastive Learning

dH #006: Understanding CLIP: Bridging Images and Text with Contrastive Learning

*Exploring how OpenAI’s CLIP revolutionizes multimodal AI through contrastive language-image pre-training*

## Introduction

So Clip stands for contrastive language image pre-training. This model learns transferable visual representations from natural language supervision, with the goals of understanding how CLIP models are trained, classifying images with CLIP using a zero-shot approach, creating image and text embeddings with a pretrained CLIP model, and discussing CLIP’s uses and limitations.

CLIP Introduction

So we’re going to understand how the Clip models are trained using contrastive learning on image-text pairs. We’re going to classify images using them using a zero-shot approach and we’re going to create some image and text embeddings with a pre-trained Clip model. We’re also going to discuss the use cases of Clip and its limitations.

## Understanding Context and Embeddings

The context is very important for the embedding of the word ‘bank’. And notice how this is encoding the world, the ‘Bank of England’ differently depending on its context, relating it to finance, institution, and money. A plane banks relates to aeroplane and flight, and a grassy bank relates to grass and fields, like the Bank of England relates to finance, institution and money.

Contextual Word Embeddings

So good embeddings make semantically similar words close to each other in the embedding space. So then we can evaluate this similarity using metrics like the Euclidean distance and the cosine similarity, as shown in the formulas. You see metrics like the Euclidean distance and the cosine similarity exactly like we saw before with images.

## CLIP Architecture: Two-Encoder Design

The CLIP Architecture has these two components, the text encoder and the image encoder. Each one has its own weights, but they’re going to be updated by the same loss function. It’s going to be a symmetric cross entropy loss that we’re going to see next.

Both encoders, the Text encoder and the Image encoder (ResNet), map their respective inputs to 512-dimensional embeddings. So that means that the embeddings of the text are going to have the same size as the embeddings of the images. And as the embeddings have the same 512 dimension, then we can evaluate their distance using the cosine similarity formula shown in the image.

CLIP Architecture

## Training Process: Contrastive Learning in Action

And this is how we train. So here we have a batch of images and their corresponding text embeddings as inputs. CLIP is going to maximize the cosine similarity between embeddings of matching text and image pairs, as shown by the cosine similarity formula in the image.

Training Matrix

Consider this matrix of text and image embeddings. The text embeddings T1, T2, T3 and the image embeddings I1, I2, I3 are shown. The elements in the diagonal represent matching text and image embeddings pairs. And this is going to have very high cosine similarity. The non-matching pairs are pushed to have low similarity.

## Symmetric Cross-Entropy Loss Function

This means symmetric cross entropy loss, which is computed as (H(p, q) + H(q, p)) / 2 according to the formula shown in the image. We’re going to be computing first the image to text loss (H(p, q)) and then the text to image loss (H(q, p)) according to the formulas shown. So cross entropy is not commutative. That’s why we need to compute these two losses and then average them, as shown by the symmetric CE loss formula in the image.

Symmetric Cross-Entropy Loss

The training algorithm, as shown in the pseudocode, is relatively simple and involves extracting features from image and text inputs, computing joint embeddings, and optimizing the symmetric cross entropy loss.

## Zero-Shot Classification: CLIP’s Superpower

Clip enables something very powerful, which is the zero-shot classification with CLIP (Contrastive Language-Image Pre-training). That means that to classify an image, you can just create a text prompt like ‘a photo of a [object]’, as shown in the list of example prompts.

Zero-Shot Classification

Then we’re going to encode this prompt using the text encoder, as illustrated in the diagram. We’re going to be encoding the image using the image encoder, as shown in the diagram. Then we’re going to calculate the cosine similarity between all the possible text descriptions that we have and the encoded image representation, as depicted by the arrows and mathematical notation in the diagram.

If we do the softmax here, we’re going to get a probability that is going to be very high for the correct class, as represented by the softmax equation shown in the diagram. This is able to classify these new images without a specific training for this new classification task. This is what we call zero-shot classification.

## Practical Implementation with Hugging Face

This is the CLIP (Contrastive Language-Image Pre-training) model from OpenAI, pre-trained on the ‘openai/clip-vit-base-patch32’ checkpoint. We see in the code how to produce the embeddings for an image (‘cats.jpg’) and text prompts (‘a cat’, ‘a dog’, ‘two cats’, etc.), and how to calculate the cosine similarities between them.

Code Implementation

The image shows a bar chart displaying the probabilities or cosine similarity scores between the image and text embeddings. The example shows how producing embeddings with CLIP (2/2) for an image of cats using different prompts is going to have different probabilities.

The more descriptive you make your prompt to match what you have in your image, the higher the matching probability will be. If we make the prompt ‘two cats lying on a sofa next to a TV controller’, this is going to have the highest probability. It’s going to have a much higher probability than ‘two cats’ or ‘a cat’ or ‘a dog’ of course.

## Performance and Limitations

In which they took the CLIP model, trained on a dataset of 400 million images scraped from the Internet, versus a ResNet101 pretrained on the smaller ImageNet dataset of 1.28 million training images. We see that on the original ImageNet dataset, the CLIP model performs about as well as the ResNet101, and then on the other variations, it has some remarkable performance gains, but consider this: the CLIP vision transformer model has around 428 million parameters, while ResNet101 has around 44.5 million parameters. CLIP was also trained on 400 times more data, so this superior performance is expected.

Performance Comparison

So, clip usually doesn’t compare, doesn’t compete, it’s not as good as a fully supervised model that has been fine-tuned for a specific task. In this example, we see clip struggling with object counting on the CLEVR Count dataset, and there are many other examples where clip is not as good as just creating a fine-tuned model. The most glaring example of this is MNIST.

Even then, the model is super useful and super powerful. It’s very, very interesting to use it on semantic search. We can relate the text embeddings that we get out of the terms that we want to search with the image embeddings that we have in the CLIP dataset. And it’s amazing, it performs as you would expect in something like Google Image Search for finding semantically similar images.

## Real-World Applications

Clip guides image generation of diffusion models like Dali, Dali2 and stable diffusion, which are using clip to actually guide the generation process. So, when we write prompts like ‘Create an image of an astronaut riding a horse in pencil drawing style’, what is enabling the model to relate what we want to see with the prompt that we’re writing is actually CLIP.

Diffusion Model Integration

It’s very useful for semantic search involving images and text. We can also control this probability with a temperature parameter of softmax. That means that we can use the CLIP model to create some candidate labels for our dataset.

## Summary

CLIP learns joint embeddings. It creates a shared embedding space for images and text, and these multimodal embeddings have applications on semantic search and image generation. We saw that CLIP excels at zero-shot classification, that means it performs well on unseen tasks without additional training.

Summary Points

But we also saw that CLIP has limitations. CLIP is not always better than models fine-tuned for specific tasks. And retraining CLIP with a ViT base is expensive both computationally and data-wise. We also saw that CLIP learns some very useful multimodal embeddings that have so many applications.

CLIP represents a significant breakthrough in multimodal AI, demonstrating how contrastive learning can create powerful connections between visual and textual information. While it has limitations compared to task-specific models, its versatility in zero-shot scenarios and applications in semantic search and image generation make it an invaluable tool in the modern AI toolkit.