dH #002: Multimodal AI: Building the Future of Human-Computer Interaction
Exploring three powerful approaches to creating AI systems that see, hear, and understand like humans
________________
What Are Multimodal Models?
Starting from the basic question of what is a multimodal model, which the slide defines as LLMs that can see and hear. Here, I’ll define them as AI systems that can process multiple data modalities like text, images, and audio, either as input, output, or both.
Multimodal definition
Some notable examples of multimodal models include GPT-40, which can take in text, images, and audio, and return back text. Another one is multimodal models, which are AI systems that can process multiple data modalities as input or output (or both), such as GPT-4o which can perform X-to-text tasks like taking in an image and generating text descriptions.
Multimodal examples
Although these are all examples of multimodal models, they are all developed in very different ways: GPT-4o for text-to-text, FLUX for text-to-image, and Suno AI for text-to-music.
Different multimodal approaches
Why Build on Large Language Models?
So here, I’m going to talk about a specific type of multimodal model, which builds on top of large language models. These are sometimes called multimodal (large) language models, or even multimodal language models.
Multimodal LLMs definition
And so the basic idea here is that we take a large language model, whether it’s pre-trained or we take the architecture of a large language model, and we augment it in some way to be able to process multiple data modalities.
What that might look like, we have our LLM as a reasoning engine, we augment it to also be able to take in images and audios as inputs, and then add some components to the end of it so that it can also generate images and audio as outputs.
LLM augmentation concept
The key idea here is that these multimodal (large) language models use LLM as a core reasoning engine.
LLM as reasoning engine
We might ask ourselves why use language models to go multimodal, aren’t these optimized specifically for text? Well, there are a couple of reasons why we might want to do this.
Multimodal (Large) Language Models are models built on top of LLMs, where the LLM acts as a reasoning engine that can process text, images, and audio inputs and outputs. And it also learns a lot about concepts which we might not initially associate with text.
LLM world knowledge
And there they were showing examples of how GPT-4 was able to draw images and understand spatial concepts and do spatial reasoning even though it was strictly trained on text. So it seems that even though large language models like the one shown in the image are only trained on text, they also learn concepts which may be helpful for processing other types of data like images and audio.
LLM conceptual understanding
Another thing is that text prompts is a natural way to enable zero-shot capabilities. So the meaning of zero-shot is basically using a model to perform a task that it was never trained on specifically. Text prompts which are native to large language models are a natural way to steer a model to perform these novel tasks through basic text descriptions.
Zero-shot capabilities
Three Paths to Multimodal AI
So here I’m going to talk about multimodal (large) language models built on top of LLMs, where the LLM acts as a reasoning engine capable of acquiring world knowledge and zero-shot capabilities.
So the first is going to be augmenting LLMs with Tools. The second is going to be augmenting LLMs using so-called Adapters. And then the third and final approach is to create Unified Models which are trained from scratch.
Three paths overview
________________
Path 1: LLM + Tools – The Simple Approach
So starting with path one, this consists of adding external modules to do x to text or text to x where x represents any arbitrary data modality. One path is to add external modules to do X-to-text or text-to-X.
LLM + Tools approach
So the way this works is that we’ll have our large language model which is only capable of taking text in and spitting text out, and to add external modules to do X-to-text or text-to-X. And if we wanted to say add the ability to process audio or speech, a very simple way of doing this would be to pass any input speech into the system through a speech to text.
Speech-to-text pipeline
The path involves using a speech-to-text model like Whisper to transcribe audio into text, and then passing that subsequent text into the large language model (LLM). To perform reasoning tasks, we can prompt the large language model with the transcribed text to accomplish various tasks.
Whisper integration
For example, if you wanted to summarize YouTube videos, you could take the audio, transcribe it with Whisper (the speech-to-text model shown in the image), and then pass the transcription to the large language model, and prompt it to summarize the YouTube video transcript.
Text-to-Image Generation
So let’s say we wanted to generate an image using a large language model and an external image generation tool. We can tell the large language model to generate a prompt that will be passed into an image generation model like DALL-E or Stable Diffusion.
Text-to-image workflow
For example, it could generate a prompt for an image generation tool like Stable Diffusion, and then that tool will generate an image based on the prompt.
Advantages and Limitations
And indeed, this augmentation of large language models (LLM) with external tools like Whisper for speech-to-text is how the early versions of ChatGPT were able to process multiple data modalities.
Early ChatGPT implementation
And so some pros of this approach is that this is very simple to implement. You don’t have to train any models. You don’t have to be a machine learning engineer. You can get by with basic software engineering and connecting some APIs together.
The other pro is that no training data is needed. You don’t need to train a speech-to-text model like Whisper. You can just use all these pre-trained models like Whisper for speech-to-text, LLM, FLUX for text-to-image out of the box and stitch them together to fit your needs.
Pre-trained models
But of course, there are some cons. One is that this type of system will have limited capabilities. For example, there may not be a pre-trained model that can reasonably translate an arbitrary modality into text. For example, if you have sensor data collecting information about the weather, there may not be a natural way to translate that into text and pass it into the large language model.
Another limitation here is that you can only customize the model through prompting the large language model. So this may make it challenging to steer the models performance in a particular direction. Another note about customization is that since these are all independently developed modules, there’s no guarantee that they’re going to play nice together.
________________
Path 2: LLM + Adapters – Better Integration
So a way we can unlock better customization of this type of multimodal model is by training so-called adapters to glue together different components of the system, as shown in the diagram with ‘LLM + Tools’ and external modules for text-to-X or X-to-text tasks. So specifically adding encoders and decoders to the LLM, which are aligned via fine tuning.
LLM + Adapters architecture
So to show what this looks like, we still have our LLM, which only takes in text and outputs text, as depicted in the diagram. Then let’s say if we want to add images, we can use a so-called image encoder like CLIP, which takes the image and translates it into a dense vector representation, as shown in the diagram.
And then what we can do is train a relatively small number of parameters called an adapter to translate the dense vector representation of the image encoder to a representation that is compatible with the LLM’s internal concept space. And so this can be done in a few different ways, popular ways by using image caption pairs, as illustrated in the diagram.
So you can train this adapter such that the image representation is similar to the LLM’s representation of the corresponding caption. And then we can do an analogous thing on the output side.
Let’s say we want to generate an image, we can get an image decoder like FLUX, which translates a dense vector representation into an actual image, as shown in the diagram. And then we’ll glue together the vector output of the large language model to the vector input of a stable diffusion model using a decoding adapter, as depicted in the diagram.
Training Requirements
A key thing here is that there isn’t a need to retrain the image encoder, language model (LLM), or text decoder in this case. The only things that need to be trained are the adapter weights for the vision adapter and language adapter.
Adapter training
So either the Vision Encoder Adapter or Decoder Adapter or both. The fire emojis are indicating the trainable parameters while the snowflake emojis are indicating the frozen parameters.
Trainable vs frozen parameters
Advantages and Challenges
Some of the pros to this approach is that you get better customization. Through adapters, you have a way of aligning the different pre-trained models so that they work nicely together via fine-tuning as shown in the image titled ‘Path 2: LLM + Adapters’.
Adapter benefits
Another upside is better customization. Since we’re only training a relatively small number of parameters through these decoder and encoder adapters, you don’t need massive training datasets. You can probably get good performance with on the order of hundreds or thousands of examples.
The cons of course is that this requires training data. While in the previous approach, we didn’t require any training data. So this is a bit more work on the machine learning engineering side with LLM and adapters. And also this is a bit more technically sophisticated than before, involving fine-tuning of encoders/decoders that are aligned.
Technical complexity
So now you have the pros of better customization and data efficiency, but the con of requiring training data.
Advanced Considerations
You can’t really get by with just basic software engineering and API calls. You’re going to need to put on your data engineering hat. You’re going to need to put on your data scientist hat and your machine learning engineer hat in order to develop a system like this that adds encoders/decoders which are aligned via fine-tuning to a large language model (LLM) along with modality adapters.
Required expertise
So there are a wide range of models in the literature that follow this approach of adding encoders/decoders and modality adapters to a large language model. And sometimes they’ll take it one step further in that not only training the adapters, but after doing so, there’ll be a second stage of fine-tuning where they’ll do an end-to-end fine-tuning of the system.
Another nuance here is that sometimes people will add cross attention layers inside the large language model, which will allow the model to mix the different modalities together.
________________
Real-World Example: LLaMA 3.2 Vision in Action
In this image, we can also pass image and question to model. And then with that, we can just print the response of the model.
LLaMA Vision example
And then I asked LLaMA 3.2 Vision for Image-based Tasks, what is in this image and this is what it said. It said this image shows a man wearing a uniform sitting on a bench with his hands clasped.
Image description demo
So I would say for the most part, it does a pretty good job. So it does a pretty good job just from this image without any additional context, showing an example of Visual Question Answering from the LLaMA 3.2 Vision for Image-based Tasks.
Streaming Capabilities
Another thing we can do is streaming. So on my machine, I have the M1 chip, but it still took like 30 seconds or a minute to generate that previous response. To make that waiting a bit more bearable, we can enable streaming.
So we can easily enable streaming with LLaMA using the syntax shown in the code snippet. So we just have the same LLaMA.chat, same exact input. But then we just have this stream=True argument. And then what we can do is print text from the stream in a for loop as shown in the code snippet.
Streaming implementation
I asked it, ‘can you write a caption for this image?’ and it gave the different answer ‘images: [‘images/show_sitting_jpeg’]’, which is pretty interesting.
Meme Understanding
Describing objective images is one task shown in the example, but describing humor and memes requires a different level of let’s say intelligence. So let’s see how well the model can explain memes to me.
Basically the same exact code as before, but now we’re just gonna pass it a different image and a different prompt using the LLaMA 3.2 Vision for Image-based Tasks example code. So I’m asking it to explain this meme image to me.
Meme analysis setup
So I asked LLaMA 3.2 Vision for Image-based Tasks to explain this meme to me and this is what it said. The meme depicts Patrick’s star from SpongeBob SquarePants, surrounded by various AI tools and symbols, with the text ‘Trying to build with AI today…’ at the top.
Meme explanation
The image humorously illustrates the challenges of using AI in building projects, implying that it can be overwhelming and frustrating. So I think it got the gist of the meme, it recognized Patrick Star from Spongebob Squarepants, surrounded by various AI tools and symbols. The caption reads “Trying to build with AI today…”, illustrating the challenges and pains of using AI in building projects, implying that it can be overwhelming and frustrating.
Meme understanding result
Optical Character Recognition
And then the final thing is optical character recognition. The code looks exactly the same, but now we’re going to prompt it to say, ‘can you transcribe the text from this screenshot in a markdown format?’ and it’ll pass in a different image.
OCR example setup
And so not only do I want it to parse this text and write it out, I want it in the markdown format as shown in the code example.
It says ‘Example code: LLaMA 3.2 Vision for Image-based Tasks’ and shows ‘OCR (Optical Character Recognition)’ as one of the tasks. The only thing that’s missing here is that ‘OCR (Optical Character Recognition)’ should be a header, so there should be the hashtag or a couple of hashtags in front of it.
OCR results
You know, I didn’t see any spelling mistakes and then it got the numbering and the bullet points, right?
OCR formatting accuracy
The Power of Zero-Shot Capabilities
The amazing thing about these emerging multimodal systems is really this zero-shot capability. Now before if I wanted to do OCR, I needed an OCR-specific model and that model may not do markdown. But now I have a single model that can not only do markdown, it could probably do HTML, it could do plain text.
You can easily customize these models simply by changing the prompt, no fine-tuning or model training required. I barely scratched the surface with these three examples, but hopefully this gives you an idea of what these models can do.
________________
What’s Next: Multimodal Embedding Models
So this was actually the first video in a series on LLaMA 3.2 Vision for Image-based Tasks. In the next video of the series, we’re going to talk about multimodal embedding models.
Future topics
And basically the way CLIP works is that it’ll take in both text and images and it will represent them in a shared vector space such that the text, for example, a silhouette of a dog, will appear close to an image of a silhouette of a dog.
So I’ll talk about how we can train these multimodal embedding models using a clip architecture that takes in a silhouette of a dog and outputs silhouettes of a pig and a cat, as well as the word ‘Oilama’.
CLIP architecture preview
________________
Conclusion
The journey from simple API-based multimodal systems to sophisticated adapter-based architectures represents a fundamental shift in how we build AI that can truly understand and interact with the world across multiple modalities. Each approach offers distinct advantages: tools-based systems provide simplicity and immediate deployment, adapter-based models offer better customization and integration, while unified models promise the ultimate in performance and capability.
As we’ve seen through practical examples with LLaMA 3.2 Vision, these systems are already demonstrating remarkable zero-shot capabilities across image description, meme understanding, and optical character recognition. The ability to customize behavior simply through prompting, without requiring additional training, opens up countless possibilities for developers and researchers.
The future of multimodal AI lies not just in building systems that can process multiple data types, but in creating truly intelligent assistants that can reason across modalities, understand context and nuance, and adapt to new tasks with minimal guidance. As these technologies continue to evolve, we’re moving closer to AI systems that interact with the world as naturally and flexibly as humans do.