dH #009: Understanding Diffusion Models: From Theory to Stable Diffusion
*A deep dive into the revolutionary generative AI technology that’s transforming image synthesis*
—
## Introduction
The concept of a diffusion model dates back to a 2015 paper by researchers at Stanford and Berkeley titled “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics”. This foundational work introduced a fascinating approach to generative modeling that would eventually revolutionize the field of AI-generated imagery.
Original diffusion paper
I feel like an interesting tradition of like people with an applied physics background making some of the fundamental basic science concepts to Deep Learning. But actually a lot of the original ideas, deep back to academics who just have a really deep understanding of applied math.
## The Core Concept: Forward and Reverse Diffusion
The idea is to “slowly destroy structure in a data distribution through an iterative forward diffusion process”. Then “learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data”.
Diffusion process illustration
And so you have this forward process where you’re adding noise to this image and this backward process where the original image is reconstructed kind of one step at a time, as illustrated by the arrow pointing from the noisy images to the clean image of a dog.
Whereas the second part of this is really hard. It’s going to depend on the entire data distribution. And when we have to learn some really hard, complicated process, how do we do that with the neural network?
## Early Results and Initial Reception
The first paper doesn’t yield particularly impressive results, and my impression is that people didn’t pay a huge amount of attention to this initially. You see the training data from the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset on the left (a) and random samples generated by the diffusion model on the right (b).
Early diffusion results
## The 2021 Breakthrough: Improved Denoising Diffusion Probabilistic Models
And then in 2021, there’s this paper, ‘Improved Denoising Diffusion Probabilistic Models’. And this paper observes that predicting the reverse diffusion process is difficult; depends on the entire data distribution.
DDPM paper insights
Use a neural network for this, but what should the neural network predict? Can predict the original image or predict and remove noise. But what this paper shows is that it works much better to predict the noise.
## Technical Innovations: From VAE to Denoising Autoencoders
Original lit trained a neural network to optimize the variational autoencoder’s variational lower bound. DDPM: instead training a denoising autoencoder.
Technical comparison
And in particular, it can predict the original image or it can predict the noise and then you can subtract the noise out to get the original image, which is an equivalent problem. But what this paper shows is that it works much better to predict than noise.
## Optimization Discoveries
If you use more diffusion steps, it works a lot better previously. People had used around 1,000 diffusion steps, but as shown in the plot, using 4,000 diffusion steps results in a significantly lower loss.
Diffusion steps comparison
And they also find that the noise schedule matters, with the latents in the last quarter of the linear schedule being almost purely noise, whereas the cosine schedule adds noise more slowly.
Noise schedules
## Diffusion Models Beat GANs
So there’s this paper and then the same authors, like a couple of months later, had a paper called diffusion models beat GANs on image synthesis. GANs can employ synthetic labels to improve performance (see Lucic et al. (2019), “High-fidelity image generation with fewer labels.”). They incorporate similar strategies into DDPM.
Diffusion vs GANs paper
## Enter Stable Diffusion: A Game Changer
Rather than training diffusion models on pixel images, can train on the latent image space created by perceptual image compression. Much more computationally tractable; can run Stable Diffusion on Colab. Also completely open-source, versus having to go through the OpenAI interface for DALL-E 2. As usual, the most instructive way to get a high level overview of what it is doing is to talk through Jay Alamar’s blog post on the topic.
Stable diffusion overview
## Technical Architecture of Stable Diffusion
Instead of reconstructing the pixels or predicting the noise at the pixel level and subtracting that out, you can train on the latent image states created by something called perceptual image compression, which adds noise more slowly according to a cosine schedule as shown in the bottom image.
Latent space visualization
And this is important because it’s way more tractable because you have this compressed latent space that you’re training on.
## The Three-Component Architecture
The image generator consists of an Image Information Creator (UNet and scheduler – where diffusion takes place) and an Image Decoder. And for the text encoder, it’s going to use Text Encoder from CLIP.
Architecture components
The CLIP text encoder (CLIPText) takes a text prompt of up to 77 tokens and projects the token embeddings; the Image Information Creator takes the encoded text and a noise tensor as input to create the processed image information tensor that the Image Decoder takes as input to generate the image.
Stable diffusion pipeline
## The Diffusion Process in Detail
So you’re going to have a random image information tensor and you’re going to combine that with the token embeddings from the Text Encoder (CLIPText) and feed that into this Image Information Creator (UNet + Scheduler). It’s going to give you again this processed image information tensor, which is then going to be decoded into the image by the Image Decoder (Autoencoder decoder).
Tensor flow process
And so let’s kind of delve into this pink box, which is where the diffusion process happens, as illustrated by the sequence of UNet steps from 1 to 50 operating on the input latents array to produce another latents array.
Diffusion steps detail
And so each step of diffusion operates on the input latents array, producing another latents array. So remember that one of the big insights of stable diffusion is you can do this diffusion process on the latents, instead of doing it on the pixels.
Latent arrays progression
And so you can see each of these latent arrays. If you take it and decode it depending on which step you are in diffusion, it will be a more or less noisy version of the image.
## Training Process: Forward Diffusion
Training examples are created by generating noise and adding an amount of it to the images in the training dataset (forward diffusion). So you’re adding different amounts of noise. And you can generate many training examples from a single image by just adding different amounts of noise.
Forward diffusion training
So you pick an amount of noise and you know how much noise you’ve added. And so this allows you to create many training examples from your input training data set.
## The Training Process: Step by Step
So you pick a training example from the training dataset. And so that’s an image that has had a certain amount of noise added to it, with the noise amount being 3 in this example.
Training example selection
But remember, we’re actually operating on the latent space and not on the pixel space. And so that’s going to go into this UNet model and we’ll talk about what UNets are in a minute. And so that’s going to give a prediction for the noise.
And then we have our actual noise (Label), right? Because we’re the ones that created the noise to noise the image. And we use that to compute our loss. And then we update the model parameters of the UNet through backpropagation.
And so that’s how we train this thing super, super straight forward, like compared to GANs, which seemed like, in many ways, quite a lot more convoluted.
## Reverse Diffusion: From Noise to Image
The noise is predicted such that if we subtract it from the slightly de-noised image, we get an image that’s closer to the distribution of images that the model was trained on.
Denoising process
This process generates images by reverse diffusion (denoising), as illustrated in the diagram. And so far, like what I’ve described here, that this hasn’t been conditional, right? It hasn’t described yet how we can condition it to control what is generated.
Image generation process
And so we just essentially start with some amount of, you know, a noise sample with noise amount 2, which is then predicted by the Noise Predictor in Step 1. The noise is predicted such that if we subtract it from the slightly de-noised image, we get an image that’s closer to the distribution of images that the model was trained on.
Noise prediction step
## Unconditional Image Generation
Let’s see how stable diffusion generates a training example with a different image, noise sample and noise amount (forward diffusion) by: 1) Picking an image, 2) Generating some random noise, 3) Picking an amount of noise, and 4) Adding noise to the image in that amount.
Training example generation
Okay, so this illustrates the stable diffusion process for unconditional image generation. And I want to give a few more details on this stable diffusion process before I discuss how we would incorporate a text prompt to condition it.
Stable diffusion illustration
## The Key Innovation: Latent Space Compression
A major innovation of stable diffusion is to speed up the diffusion process by running it on a compressed, latent version of the image. The compression and decompression are done via autoencoder. Compresses the image into latent space with an encoder and reconstructs the compressed information in pixel space using a decoder.
Autoencoder architecture
The image encoder compresses the image into latent space. And then it reconstructs the compressed information in pixel space using a decoder.
And so we have this image encoder and then we generate training examples with the forward process by adding different amounts of noise to their compressed/latent version, as shown by the ‘Compressed image (latent)’ and ‘Latent + noise’ boxes with increasing noise levels.
Latent space noise addition
And once we’ve trained the model, we generate images with this reverse process of Image Generation by Reverse Diffusion (Denoising) as shown in the diagram.
Reverse diffusion generation
## Text Conditioning: The Missing Piece
So how are we going to condition on text? I mean, because this is all good and well, and I can generate pictures of random things, but what I really want is to condition on a text prompt perhaps. And so how are we going to do that?
Well, we’re going to take the Noise Predictor (UNet) with Text Conditioning, which is a unit that we’ve trained, and we want to train this not just to take noise as input, but to also take the prompt text information (token embeddings) as input.
Text conditioning architecture
And so remember here we’re working in latent space, so both the input noise images and predicted noise sample are in latent space, even though in this illustration it is showing like a picture of the noise.
## Training with Text Inputs
And so you see here, we have our inputs, which have the step number and the image, and we have the noise sample as the output or label. And then we have a text encoding and the output is a noise sample that then we’re going to predict from the input noise to generate our image using a Noise Predictor (UNet) with Text Conditioning.
Training inputs and outputs
And now we want the unit to take text conditioning.
## UNet Architecture: The Heart of Diffusion
So let’s first of all just explain how this Noise Predictor (UNet) module works without text conditioning. So just doing unconditional image generation.
And so you can see that the UNet consists of these ResNets and they’re going to have residual connections for the reasons we’ve seen earlier in the course. Each layer of the network uses the previous layer’s output as input.
UNet architecture
And so essentially this has like ResNets in it that look somewhat kind of akin to stuff that we’ve seen earlier in the course. And now we want to be able to condition on the text.
## Adding Text Conditioning with Attention
To incorporate text conditioning, add attention layers between the resnet blocks. The resnets don’t look directly at text but the attention layers merge text representations into the latent image representations.
Text conditioning with attention
## Real-World Applications and Limitations
And Shao-Yu’s artwork on the left depicts a digital illustration of a person with green and yellow hair, which appears to be an artistic interpretation rather than a likeness of a famous Japanese actress, but stable diffusion is essentially trained on, you know, stuff like image caption pairs in English. And it has no idea who this illustrated person is.
Stable diffusion examples
You know, even if it sees something occasionally in training, we know it’s way more able to memorize its training data if it sees something many, many times. And so stable diffusion knows who Donald Trump is. That is not a problem for stable diffusion. There are many, many images and captions of him on the English speaking internet. But that’s not sort of universally true.
## The GlyphDraw Innovation
So the final thing I want to talk about is a paper called Glyphdraw that just came out a few days ago. And so this paper was released by one of the largest telecom companies in China.
Recall that previous approaches employed a four-network architecture: Generator network, Discriminator network to keep images realistic, Style network to classify styles, Recognizer network to promote accuracy of image texts. And so this just, you know, it felt very convoluted.
Previous approaches
And it felt like, suppose we wanted to extend this to Chinese handwriting, we worry about how complicated this thing would be to train and to work with. But it’s kind of the best that exists.
## Simplified Approach with Massive Scale
They continue pre-training a Chinese stable diffusion. Pre-train on a 100M Chinese image-text labels dataset. This takes 80 (!!!!!) A100 GPUs.
Training specifications
And so, you know, even like eight, a 100 GPUs is going to put you out, I don’t know, like several hundred thousand dollars, maybe half a million dollars. And so this is really, really a very, very, very, very compute intensive task that you could do this if you’re the largest telecon company in China or your open AI. But certainly none of us are going to do this.
## Technical Modifications for Chinese Text
Then they keep training a modified stable diffusion. Two key modifications: Replace the image latent with a concatenation of the image latent vector, a text mask, and a glyph image. The original condition (text prompt) in Stable Diffusion just uses a CLIP text encoder. They have a fusion module that combines the clip image and text encodings.
Modified architecture
There’s an image of a man wearing a red shirt with the Chinese characters ‘中国’ (meaning ‘China’) printed on it, and there’s a glyph showing those same characters, along with a mask highlighting the glyph’s location, and a caption describing the image.
Training data example
And then instead of you know for the input they’re concatenating these three things: the image, the text, and the glyph or mask.
Input concatenation
## Limitations and Future Directions
And due to the nature of the training data only generates short text you know because their whole goal is to generate these scene text and glyph representations, and it is kind of very impressive how it does this but it’s not necessarily something that we would use to create kind of synthetic training data for an OCR module. Hand reading transformers is still kind of the way to go on that.
I still think that there’s kind of potential promise of this approach you know so we were talking in a reading group about how you know maybe instead of having you know the edge detection of a deer you could detect key points from Chinese characters or key points from sort of from document layouts and use those for conditioning.
Key point detection concept
So as long as you didn’t have too many key points in a character that would give you a lot of scope to have variation in the font.
## Training Efficiency Considerations
And the good thing about this is it’s incredibly efficient to train because you’re not you’re totally locking everything about stable diffusion and you’re just training this control net portion of it. So it’s quite efficient to train whereas in the approach shown in these images, they’re using canny edge detection on the source image to condition the generation of images depicting deer in various environments. So it’s very compute intensive.
Efficiency comparison
They trained this model on 80 A100 GPUs, but due to the nature of their training data, it only generates short texts.
Training challenges
—
## Conclusion
Diffusion models represent a fundamental shift in generative AI, moving from the complex adversarial training of GANs to a more straightforward and intuitive approach based on noise prediction. The journey from the 2015 Stanford-Berkeley paper introducing non-equilibrium thermodynamics to modern implementations like Stable Diffusion showcases remarkable progress in both theoretical understanding and practical applications.
The key breakthrough came with the realization that predicting noise rather than reconstructing original images leads to superior results, combined with the innovation of operating in compressed latent space rather than pixel space. This approach not only makes diffusion models computationally tractable but also opens the door to powerful text-to-image generation through attention-based conditioning mechanisms.
Stable Diffusion’s three-component architecture—the CLIP text encoder, UNet-based Image Information Creator, and autoencoder decoder—demonstrates how modular design can create flexible and powerful generative systems. The UNet’s use of ResNet blocks with residual connections, enhanced by attention layers for text conditioning, provides both the representational power and the conditioning capabilities needed for high-quality, controllable generation.
Recent innovations like GlyphDraw highlight both the potential and challenges of extending diffusion models to specialized domains. While the computational requirements remain substantial—requiring hundreds of GPUs and massive datasets—the results demonstrate that diffusion models can adapt to complex tasks like Chinese text generation with appropriate architectural modifications.
Looking forward, the field continues to evolve rapidly. The tension between training efficiency and model capability remains a central challenge, but approaches like ControlNet show promise for achieving sophisticated conditioning without prohibitive computational costs. The potential for detecting key points from characters or document layouts for conditioning suggests exciting possibilities for document AI and multilingual applications.
As diffusion models continue to mature, they’re not just changing how we generate images—they’re reshaping our understanding of what’s possible in AI-driven content creation, from art and design to technical applications like OCR and document processing. The simplicity of their training paradigm, combined with their impressive results, positions diffusion models as a cornerstone technology in the evolving landscape of generative artificial intelligence.