dH #004: CLIP: Bridging Images and Language with Contrastive Learning
*How Contrastive Language-Image Pre-training revolutionizes multimodal AI by learning shared representations without explicit supervision*
—
## The Foundation: Transformers for Multimodal Understanding
So we’ve been looking at how language can be modeled using transformer architectures and we’ve also seen that images can be modeled using the same architecture. So we’ve seen that transformers are probably the most powerful models we’ve got for understanding text and images.
We know that transformers are very powerful models for understanding text and images, and a joint model is useful for image-to-text generation, image captioning etc. So in this video we’re going to see how a joint model can be built that allows us to relate images and language together.
## The Challenge: Training Data for Joint Models
So the question is where can we get training data to train a joint model? Well one option is to just go mine images from the internet and whatever surrounding text they have, so captions or some other descriptive text near the image.
Internet data mining concept
But there’s a problem with this. It’s not realistic to expect a model to produce the exact caption that came with the image. So if we give an image, there’s probably hundreds of plausible captions. And there’s nothing to say that a different caption is wrong, it’s just different to whatever the original one was.
## CLIP: A Clever Solution
So this paper clip, which stands for Contrastive Language Image Pre-Training, has a really neat way of getting around this problem and learning a really powerful joint model of images and language.
CLIP architecture diagram
## The Architecture: Dual Encoders in Shared Space
So the idea is to simultaneously train a text encoder and an imaging encoder. In practice, these are transformer models, very similar to what we’ve seen before. So you feed in either a sequence of text tokens or a sequence of image patches represented like tokens.
Text and image encoder architecture
Okay, so this text encoder is going to take a sentence and it’s going to encode it to some vector. And the imaging encoder is going to take an image and encode it also to some vector.
What we’re going to do during the training is force these two vectors to lie in the same space, so to share a common embedding space. Well, it means that if I embed a caption that describes an image and then I embed an image, if the two things have the same meaning, so this caption is a good description of this image, then they should embed close together. And if they’re different, they should embed far apart.
## The Training Process: Contrastive Learning at Scale
What we do is we choose a batch of images along with their corresponding captions. So how can we make that happen?
Batch processing diagram
And then what our loss function does is it tries to maximize the similarity where similarity is measured by just the dot product between the two vectors. So maximize the similarity of the correct pairs.
Similarity matrix visualization
So there’s going to be any of those my batch of images, so those are along the diagonal. So image one corresponds to caption one, I want that dot product to be high. And then I want to try and minimize the similarity of all of the remaining n squared minus n incorrect pairing. So all the off diagonals, in those cases, the caption doesn’t correspond to the image, and I want them to have dissimilar embeddings.
## Scale and Impact
We use this loss at very large scale, so we train on perhaps billions of images with a very large text and imaging code. So similar in scale to the sort of GPT models that we’ve talked about before.
Now, once you’ve trained a model like this, you can use it for something very powerful, and immediately without any additional training.
## From Language Models to Practical Applications
Language model basics
The goal is to look back to the source material, do account of what words followed this stem, it was the, and return a probability dictionary that describes what word you might predict next. Just based on the training data, what word followed this stem the most frequently?
We can then take that and use that, turn that language model into a generative model by randomly sampling from this probability dictionary. And so we can use that same autoregressive approach where we generate one word or one token, feed it in, append it to the end, and then update your context window and slide over to generate new text sampled from this distribution.
## The Problem with Simple Language Models
And so it gets stuck and repeats itself. Keep this example in your mind, because this is one of the simplest examples I could come up with to try to illustrate some of the things that are going on when you hear people talking about a language model hallucinating.
Model performance comparison
## Building Better Chatbots
So that’s a basic language model. I’d like to jump forward to how do we take one of these and build something that looks like a chatbot?
Model comparison chart
So in this case, we’re using one of the older language models from Google. This is a model called Lambda. And so there’s differences between how Lambda will behave and how something like Gemini or Chatchy Poutine or any other more modern ones will behave.
But it’s useful because it doesn’t yet have a lot of the post training that changes the language model’s behavior. And I think this part is the most telling.
## Understanding Training Data Influence
And it is just trying to recreate the training data that it saw. And so hopefully this gives a glimpse into that phenomenon that we’ll talk about a lot, which is language models are like fuzzy lookups back into their training data.
Performance metrics
That text appeared probably totally consistent every time it appeared.
—
## Conclusion
CLIP represents a fundamental breakthrough in multimodal AI by solving the challenge of learning joint representations of images and text without requiring exact caption matching. Through contrastive learning at massive scale, CLIP creates a shared embedding space where semantically similar images and captions cluster together, while dissimilar pairs are pushed apart. This approach enables powerful zero-shot capabilities and has become a foundation for many modern multimodal applications.
The key insight is that instead of trying to predict exact captions, CLIP learns to distinguish between correct and incorrect image-text pairs, creating robust representations that generalize well to new tasks without additional training. This methodology has influenced countless subsequent developments in multimodal AI and continues to be a cornerstone of modern computer vision and natural language processing systems.