LLM_log #014: Stable Diffusion & Conditional Latent Diffusion — From VAE Compression to Cross-Attention Conditioning

datahacker.rs Other 12.03.2026 | 0

Highlights:

Stable Diffusion doesn’t paint an image in one shot — it sculpts one from static, guided by your words. In this post we disassemble the entire machine. We start with the VAE that compresses pixels into a tractable latent space, walk through the forward and reverse diffusion processes, open up the UNet to see how cross-attention physically connects text tokens to spatial regions, and finish with the complete Latent Diffusion architecture diagram that ties every piece together. By the end, you will be able to trace a single forward pass — from a text prompt and a seed of Gaussian noise to a generated 512×512 image — and explain exactly what happens at every tensor boundary. So let’s begin!

Tutorial Overview:

The Generative Model Landscape
DDPM: Forward and Reverse Markov Chains
The VAE — A Neural JPEG Compressor
VAE Quality Proof: Original vs. Reconstruction
Forward Diffusion in Latent Space
The Complete Forward + Reverse Process
Single Denoising Step — The Core Subtraction
Progressive Denoising — Image Emerges from Noise
The Three Components of Stable Diffusion
The Pipeline with Tensor Dimensions
CLIP Training: Three Steps to a Shared Embedding Space
UNet: Three Inputs, One Output
Inside the UNet — ResNet + Cross-Attention
The UNet Training Loop
Training Dataset Structure
Conditional Generation on MNIST
Classifier-Free Guidance
LDM Complete Architecture
The Full Inference Pipeline
LDM Architecture — QKV Cross-Attention Focus
Summary

1. The Generative Model Landscape

Before we build Stable Diffusion, let’s orient ourselves. The generative model landscape is crowded, but fundamentally, every model here is trying to solve the exact same mathematical problem: mapping a simple, sampleable distribution (like a standard Gaussian $\mathbf{z} \sim \mathbf{N}(0, I)$) to the impossibly complex distribution of natural images $p(\mathbf{x})$.

Four families have emerged, each with a radically different strategy:

GANs learn this mapping implicitly through an adversarial game — a counterfeiter vs. police dynamic — bypassing density estimation entirely.
VAEs act as a “neural JPEG,” forcing data through a low-dimensional bottleneck and optimizing a variational lower bound on the log-likelihood.
Flow models use strict mathematical bijectivity to guarantee exact reconstruction, but at the cost of constrained, carefully designed architectures.
Diffusion models break from the single-pass paradigm. Instead of mapping $\mathbf{z}$ to $\mathbf{x}$ in one leap, they define a Markov chain that slowly diffuses data into noise, and train a neural network to learn the incremental reverse steps.

Taxonomy of 4 generative model families: GAN, VAE, Flow, Diffusion — Fig 1. The four families of generative models. Each solves the same core problem — mapping noise to data — but with fundamentally different mechanisms. Diffusion models (bottom-right, highlighted) are the focus of this post.

Diffusion is the youngest of these families, but it has rapidly become dominant. The key insight: by decomposing generation into hundreds of small, easy steps instead of one massive leap, we sidestep the training instabilities of GANs, avoid the blurry outputs of VAEs, and bypass the architectural constraints of flows. The cost? We need multiple neural network passes at inference time — but as we’ll see, Latent Diffusion makes this cost manageable.

2. DDPM: Forward and Reverse Markov Chains

The mathematical core of a diffusion model is two opposing Markov chains. The forward chain is fixed and cheap. The reverse chain is learned and expensive. Understanding this asymmetry is essential.

The forward process $q$ takes a clean image $\mathbf{x}_0$ and sequentially injects small amounts of Gaussian noise at each step until, after $T$ steps (typically 1,000), the signal is completely destroyed and we’re left with pure static $\mathbf{x}_T \sim \mathbf{N}(0, I)$:

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathbf{N}(\mathbf{x}_t; \sqrt{1 – \beta_t}\, \mathbf{x}_{t-1},\; \beta_t \mathbf{I})$$

Think of this as dissolving a sugar cube in water — deterministic in its overall direction toward maximum entropy, even though the exact molecular path is stochastic. No neural network is needed here; it’s pure scheduled noise injection.

The reverse process $p_\theta$ is where the learning happens. Starting from pure noise $\mathbf{x}_T$, a neural network learns to undo each corruption step, predicting a slightly cleaner version at each iteration:

$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathbf{N}(\mathbf{x}_{t-1};\; \mu_\theta(\mathbf{x}_t, t),\; \sigma_t^2 \mathbf{I})$$

A critical empirical insight: instead of predicting the final clean image $\mathbf{x}_0$ from a noisy state, it is far more stable to train the network to predict the specific noise tensor $\varepsilon$ that was added during that step — and then subtract it.

DDPM forward and reverse Markov chain diagram — Fig 2. The DDPM framework: a fixed forward process (dashed arrows) adds noise step by step, while a learned reverse process (solid arrows) removes it. The network predicts the noise at each step rather than the final clean image.

While the mathematics of this Markov chain are elegant, running thousands of sequential neural network passes directly on high-resolution $3 \times 512 \times 512$ pixel tensors is computationally agonizing. This brings us to the necessity of Latent Diffusion.

3. The VAE — A Neural JPEG Compressor

Before we can generate images, we need to solve the curse of dimensionality. A $512 \times 512$ RGB image is a vector of 786,432 numbers. Running self-attention over this space incurs a catastrophic $O(N^2)$ cost. The solution: compress first, diffuse later.

The Variational Autoencoder (VAE) acts as a neural JPEG. The encoder applies a lossy compression that shrinks a $3 \times 512 \times 512$ pixel image down to a $4 \times 64 \times 64$ latent tensor — discarding imperceptible high-frequency noise while retaining the core semantic structure. This is a 48× compression ratio (786,432 values → 16,384 values).

VAE encoder-decoder pipeline showing 48x compression — Fig 3. The VAE autoencoder: a neural JPEG compressor. The encoder squeezes 786K pixel values into 16K latent values (48× compression), and the decoder faithfully reconstructs the image.

The spatial resolution drops by a factor of 8 ($512 \div 8 = 64$) while the channel depth increases from 3 to 4. Once the generative process is complete, the decoder simply decompresses that latent tensor back into a visually faithful $512 \times 512$ output image. This compression is the key to Stable Diffusion’s efficiency — it makes consumer-GPU inference possible.

4. VAE Quality Proof: Original vs. Reconstruction

Before we trust our entire generative pipeline to the latent space, we need to verify that the VAE preserves what actually matters. The test is simple: encode an image into the $4 \times 64 \times 64$ latent bottleneck, decode it back, and compare.

Original photograph — VAE quality proof — Fig 4. The original $3 \times 512 \times 512$ photograph: sharp feather barbs, specular highlights in the eye, natural background bokeh.

VAE reconstruction — visually indistinguishable from original — Fig 5. The VAE reconstruction after passing through the $4 \times 64 \times 64$ bottleneck. The result is visually indistinguishable — global composition, color palette, and semantic identity are perfectly preserved.

Think of the VAE as a highly optimized neural JPEG: it strategically discards pixel-level high-frequency noise that our visual system barely notices, while perfectly preserving global composition, background bokeh, and semantic identity. The reconstruction isn’t mathematically pixel-perfect — if you subtract the two tensors, the delta won’t be zero — but it shouldn’t be. What matters is that every visual feature carrying meaning survived the 48× compression, proving we can safely perform our computationally expensive diffusion math entirely in the latent space.

5. Forward Diffusion in Latent Space

Here is where Stable Diffusion fundamentally diverges from standard pixel-space diffusion models. Instead of adding noise directly to a $3 \times 512 \times 512$ image, we first pass the image through the VAE encoder to extract a compressed $4 \times 64 \times 64$ latent representation $z_0$.

The forward diffusion process then injects Gaussian noise exclusively into this latent tensor at various timesteps $t$. By constructing our training data entirely within this compressed space — mapping a noisy latent $z_t$ to the specific noise $\varepsilon$ added at that step — we reduce spatial complexity by 64×, making training vastly cheaper and faster without losing semantic detail.

Forward diffusion in latent space — noise added to compressed latent, not pixels — Fig 6. Forward diffusion in latent space. The Image Encoder compresses, then Gaussian noise is progressively injected into the small latent tensor — not the full-resolution image. Each (noisy latent, timestep, noise) triplet becomes a training example.

A critical subtlety: because we have a closed-form expression for the noise schedule, we can compute $z_t$ at any timestep directly from $z_0$ without iterating through all previous steps:

$$z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1 – \bar{\alpha}_t}\, \varepsilon, \quad \varepsilon \sim \mathbf{N}(0, I)$$

This is fast, requires no neural network, and allows random sampling of timesteps during training — a key efficiency trick.

6. The Complete Forward + Reverse Process

Let’s bring the whole architecture together. The system has two paths — one for training, one for generation — and they are heavily asymmetric:

Forward (training): Original Image → VAE Encoder → clean latent $z_0$ → add noise (closed-form) → noisy latent $z_t$. This is fast and trivial.

Reverse (generation): Pure noise $z_T$ → UNet Step 1 → UNet Step 2 → … → clean latent $z_0$ → VAE Decoder → Generated Image. This requires ~50 sequential neural network passes.

Complete forward and reverse process in latent space — Fig 7. The complete Latent Diffusion pipeline. Forward (top): the VAE compresses, then noise is injected. Reverse (bottom): the UNet iteratively denoises, then the VAE decodes back to pixels. All the heavy computation happens in the compressed $4 \times 64 \times 64$ space.

Because this intensive iterative loop happens in the highly compressed latent space rather than high-resolution pixel space, we avoid catastrophic memory limits and can generate novel scenes on a consumer GPU with 8GB VRAM. This spatial compression is the central insight of the Latent Diffusion Model paper (Rombach et al., 2022).

7. Single Denoising Step — The Core Subtraction

Every UNet step performs the same fundamental operation. This is the “aha moment” of diffusion models — the mechanism that makes it all work:

Take the current noisy latent $z_t$ and the timestep $t$
Feed them through the UNet → it outputs a predicted noise pattern $\hat{\varepsilon}$
Subtract the predicted noise from the noisy image

$$z_{t-1} \approx z_t – \hat{\varepsilon}_\theta(z_t, t, c)$$

(The actual scheduler formula includes scaling factors, but this is the conceptual core.)

Single denoising step — the subtraction operation — Fig 8. The core mechanism: the UNet predicts the noise pattern, and we subtract it from the noisy image. The minus sign is the most important operation in diffusion models. Repeat ~50 times and a coherent image emerges from pure static.

The elegance here is that the network doesn’t need to hallucinate the final image from scratch. It only needs to answer a much simpler question: “given this noisy blob and the current noise level, what does the noise look like?” The image emerges naturally from repeated subtraction.

8. Progressive Denoising — Image Emerges from Noise

If we force the VAE decoder to render intermediate latent tensors back into pixel space at various timesteps, we can observe the true mechanics of the reverse process. The result is one of the most satisfying visualizations in machine learning.

Progressive denoising steps — image gradually emerging from noise — Fig 9. Progressive denoising visualized. Steps 1–2: chaotic noise. Steps 4–5: blurry color blobs. Step 10: recognizable composition. Steps 30–50: fine detail refinement. The process is like a sculptor — early steps define the silhouette, later steps polish the texture.

Notice how the denoising schedule behaves like a sculptor. The early UNet passes — operating where the noise variance $\beta_t$ is highest — hack away massive “chunks” of noise to establish the global composition and color palette. By step 10, the layout of the scene is already locked in. The remaining dozens of steps are dedicated almost entirely to high-frequency refinement — sharpening textures and polishing edges as $\beta_t$ smoothly anneals toward zero.

Key insight: Diffusion is heavily front-loaded. Global structure is decided in the first ~10% of steps. The remaining 90% is fine-grit sandpaper. This is why accelerated schedulers like DDIM can skip steps with minimal quality loss.

9. The Three Components of Stable Diffusion

To deconstruct Stable Diffusion, we must view it not as a monolith, but as an assembly of three distinct, composable modules — each trained separately and typically frozen during inference:

Text Encoder (CLIPText) — maps a raw string of text into a dense semantic embedding. It understands the prompt.
Image Information Creator (UNet + Scheduler) — the core engine that does the heavy lifting: iteratively sculpting the image structure by removing noise, entirely within latent space.
Image Decoder (VAE decoder) — projects the final latent tensor back into pixel space. It decompresses the result.

Three components of Stable Diffusion: Text Encoder, UNet, Image Decoder — Fig 10. Stable Diffusion’s three modules. The UNet (center) is noticeably larger — it does the heavy computational work. The other two are relatively lightweight: one translates text, the other decompresses the result.

Think of this as a translating assembly line. The Text Encoder acts as the translator, the UNet is the factory floor, and the Image Decoder is the packaging department. Each module has a well-defined input and output; they communicate through precisely shaped tensors.

10. The Pipeline with Tensor Dimensions

To appreciate why Stable Diffusion can run on consumer hardware, you have to look closely at the tensor dimensions flowing through the architecture:

Input text → tokenized to 77 tokens → Text Encoder (CLIPText) → $[77 \times 768]$ token embeddings
UNet + Scheduler operates on $[4 \times 64 \times 64]$ latent tensors — 48× smaller than pixel space
Image Decoder outputs the final $[3 \times 512 \times 512]$ RGB image

Stable Diffusion pipeline with explicit tensor dimensions at each stage — Fig 11. The tensor dimensions at every stage. The $[77 \times 768]$ text tensor, the $[4 \times 64 \times 64]$ latent, and the final $[3 \times 512 \times 512]$ output. Notice how the UNet operates in a dramatically smaller space than the final image.

A crucial detail: CLIP retains the full unpooled $77 \times 768$ token sequence rather than collapsing it into a single summary vector. This is critical — it allows the UNet’s cross-attention mechanisms to dynamically map specific textual concepts (like the word “cosmic”) directly to localized spatial regions (the sky area) as the latent image takes shape.

11. CLIP Training: Three Steps to a Shared Embedding Space

To bridge the gap between pixels and language, Stable Diffusion relies on CLIP (Contrastive Language-Image Pre-training) as its translation layer. CLIP was trained on roughly 400 million image-text pairs scraped from the internet, using a three-step contrastive learning loop:

Step 1: Embed — A Vision Transformer encodes the image into a fixed-length vector. Independently, a text Transformer encodes the caption into a vector of the same dimension. At this stage, the two encoders know nothing about each other.

CLIP Step 1: embed image and text independently — Fig 12. CLIP Step 1: Two independent encoders produce fixed-length embedding vectors — one for the image, one for the text caption.

Step 2: Compare — We compute cosine similarity between every image embedding and every text embedding in the batch, forming an $N \times N$ matrix. Matching pairs (the diagonal) should score high; mismatched pairs should score low.

CLIP Step 2: compare embeddings via cosine similarity matrix — Fig 13. CLIP Step 2: The contrastive comparison. Matching image-text pairs (diagonal) are pushed toward similarity score 1, while all mismatched pairs are pushed toward 0.

Step 3: Backpropagate — Gradients flow backward from the contrastive loss into both encoders simultaneously, geometrically aligning the shared embedding space. Matched pairs are pulled closer; mismatched pairs are pushed apart.

CLIP Step 3: backpropagation updates both encoders — Fig 14. CLIP Step 3: Gradient arrows flow backward from the loss to both encoders, aligning the shared embedding space. After training on 400M pairs, “cute dog” lives near “fluffy puppy” and far from “toaster.”

Because this contrastive training forces the text encoder to deeply understand visual concepts, we can safely discard the image encoder, freeze the text encoder, and use its rich $77 \times 768$ embeddings to condition our diffusion model. The text encoder becomes the semantic steering wheel for image generation.

12. UNet: Three Inputs, One Output

Here we see the complete input-output signature of the conditional UNet $\varepsilon_\theta$. It takes three distinct inputs:

Noisy latent $z_t$ — a $[4 \times 64 \times 64]$ spatial tensor representing the current state of the image
Timestep $t$ — encoded using sinusoidal positional embeddings (like a Transformer) so the network knows the current noise scale
Text embeddings — a $[77 \times 768]$ tensor of token embeddings from the frozen CLIP text encoder

And produces one output: the predicted noise $\hat{\varepsilon}$ — a $[4 \times 64 \times 64]$ tensor of the exact same shape as the input latent.

UNet noise predictor — three inputs converging, one output — Fig 15. The UNet’s input-output signature. Three inputs converge from the left (noisy latent, timestep, text embeddings), the UNet processes them, and outputs a single predicted noise tensor on the right.

The network explicitly needs the timestep because a heavily noised input requires entirely different feature extraction logic than a nearly clean input near the end of the chain. The training objective is surprisingly simple:

$$\mathbf{L} = \| \varepsilon – \varepsilon_\theta(z_t, t, c) \|^2$$

Minimize the squared difference between the actual noise added and the network’s prediction. That’s it — a straightforward MSE loss.

13. Inside the UNet — ResNet + Cross-Attention

To understand how text mathematically guides image generation, we must look inside the UNet’s alternating layers. The architecture repeats two types of blocks:

ResNet blocks (spatial processing) — these process the latent’s spatial features and absorb the timestep embedding via FiLM-style conditioning (Feature-wise Linear Modulation). Think of them as the sculptor’s hands — they shape the material based on how much noise remains.
Cross-Attention blocks (text conditioning) — these are where the text embeddings physically intersect with the spatial features. The spatial features become the Query (Q), while the text embeddings provide the Keys (K) and Values (V):

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V$$

Inside UNet: alternating ResNet and Attention blocks with Q, K, V labels — Fig 16. Inside the UNet. ResNet blocks (teal) process spatial features with timestep injection. Attention blocks (purple) perform cross-attention where spatial features query the text embeddings. Skip connections bridge the encoder and decoder halves.

This cross-attention mechanism acts as a GPS: it allows specific regions of the image to “look up” matching semantic concepts from specific text tokens. When generating a “cosmic beach,” the sky region’s Query vectors will attend strongly to the “cosmic” token’s Key/Value, while the foreground attends to “beach.” The text doesn’t control the image globally — it guides it locally through attention weights.

14. The UNet Training Loop

The training loop is elegantly simple — four steps, repeated millions of times:

Pick a training example — take a real image, encode it via the VAE, sample a random timestep $t$, and add the corresponding amount of noise to get $z_t$
Predict noise — feed $z_t$, $t$, and the text embeddings through the UNet → get predicted noise $\hat{\varepsilon}$
Compare to actual noise — compute $\text{MSE}(\varepsilon, \hat{\varepsilon})$
Update model — backpropagate and adjust UNet weights

UNet 4-step training loop in circular arrangement — Fig 17. The four-step training loop. Because we injected the noise ourselves, we always have a mathematically perfect ground-truth label — making this a self-supervised regression task.

Key insight: This is self-supervised learning. We added the noise ourselves, so we always have perfect labels. No human annotation is needed. A finite set of images effectively becomes an infinite dataset — each image can be corrupted at any of 1,000 timesteps with any random noise sample.

15. Training Dataset Structure

Let’s look at how a training batch is actually constructed. Each row is one training example with:

Input: a random timestep $t$ + the noisy latent $z_t$ + the text embeddings from the caption
Target/Label: the exact noise tensor $\varepsilon$ that was added

Training dataset structure showing multiple examples with inputs and noise targets — Fig 18. The training dataset structure. Each row combines a random timestep, noisy latent, and text embeddings as input, with the known noise sample as the target. The model learns to predict noise — not images.

By randomly sampling $t$ and $\varepsilon$ dynamically during training, a finite set of images effectively becomes an infinite dataset of unique noisy training pairs. This is what makes diffusion models scale so elegantly without requiring massive human-annotated datasets — the generation problem becomes a robust, self-supervised regression task.

16. Conditional Generation on MNIST

Before tackling the complexity of $3 \times 512 \times 512$ text-to-image generation, it is instructive to validate conditional diffusion on a simpler $1 \times 28 \times 28$ problem like MNIST.

Conditional MNIST generation grid — 5 rows of digits with intra-class variety — Fig 19. Conditional MNIST generation. Each row is conditioned on a different class label (1–5). The eight columns show diverse samples within each class — note the healthy intra-class variance (different styles of “4”) proving the model learned the true distribution, not a single prototype.

The class embedding successfully biases the reverse diffusion trajectory, steering the denoising process down five completely distinct paths without bleeding into adjacent classes. Think of the conditioning label as a GPS destination — the model finds multiple valid, distinct routes from various noisy starting points to that destination.

17. Classifier-Free Guidance

Conditioning is not binary — it’s a continuous dial. The Classifier-Free Guidance (CFG) scale $w$ controls how strictly the model follows the conditioning signal:

Effect of guidance weight w on MNIST generation — w=0.0, 0.5, 2.0 — Fig 20. The guidance weight as a strictness dial. At $w=0.0$: the model barely follows instructions (a “6” appears when a “9” was requested). At $w=2.0$: maximum fidelity, but diversity drops to “average-looking” digits.

At $w = 0$, the model is effectively ignoring the conditioning signal — high diversity but poor fidelity (wrong classes appear). At $w = 2.0$, the model becomes extremely strict: every digit is bold and unmistakable, but stylistic diversity collapses. In practice, Stable Diffusion uses $w \approx 7.5$ — a carefully chosen balance between prompt adherence and creative freedom.

Mathematically, CFG works by running the UNet twice per step — once with the conditioning and once without — then extrapolating in the direction of the conditioned output:

$$\hat{\varepsilon} = \varepsilon_\text{uncond} + w \cdot (\varepsilon_\text{cond} – \varepsilon_\text{uncond})$$

This doubles the compute per step but dramatically improves output quality. It’s why Stable Diffusion inference is slower than you might expect from the step count alone.

18. LDM Complete Architecture

This is the diagram you come back to. Everything we’ve built — the VAE, the diffusion process, the CLIP encoder, the cross-attention conditioning — lives inside this one figure.

LDM complete architecture — Pixel Space, Latent Space, Conditioning zones — Fig 21. The Latent Diffusion Model architecture (after Rombach et al., 2022). Three color-coded zones: Pixel Space (left, VAE), Latent Space (center, UNet with cross-attention), and Conditioning (right, domain encoder feeding K and V). The legend at bottom defines all symbols.

Reading left to right:

Pixel Space (left): Our VAE acts as a neural JPEG, compressing $x$ into latent $z$ via encoder $\mathbf{E}$. The decoder $\mathbf{D}$ reconstructs $z$ back to pixel space as $\tilde{x}$.
Latent Space (center): The forward diffusion process corrupts $z$ into $z_T$. The Denoising U-Net $\varepsilon_\theta$ iteratively recovers it through $T{-}1$ reverse steps. Inside the UNet, QKV cross-attention blocks connect the spatial features to external conditioning.
Conditioning (right): Text, semantic maps, or any other modality is processed by a domain-specific encoder $\tau_\theta$. The outputs feed the U-Net’s cross-attention as Keys (K) and Values (V), while the spatial features provide Queries (Q).

The “switch” icon at the junction indicates that conditioning can either enter via cross-attention (multiplicative, attention-weighted) or via direct concatenation with the latent tensor. Text conditioning uses cross-attention; spatial conditioning (like segmentation maps) typically uses concatenation.

19. The Full Inference Pipeline

Now let’s trace a single inference run from start to finish. Two inputs enter the system:

Latent Seed: a $4 \times 64 \times 64$ tensor of Gaussian noise $z_T \sim \mathbf{N}(0, I)$
User Prompt: “An astronaut riding a horse” → Frozen CLIP Text Encoder → $[77 \times 768]$ embeddings

Complete Stable Diffusion inference pipeline with tensor shapes and scheduler loop — Fig 22. The full inference pipeline. The UNet + Scheduler loop is the computational core, iterating N times while the text embeddings remain frozen. Only at the very end does the VAE Decoder convert the clean latent to pixels.

The UNet + Scheduler loop is where all the computation happens. Over $N$ scheduler iterations (typically 50 steps with DDIM, which mathematically approximates the trajectory to skip steps), the UNet progressively predicts and removes noise. The $77 \times 768$ CLIP text embeddings remain completely frozen throughout — a static conditioning anchor that guides the UNet at every single iteration.

Only at the very end does the VAE Decoder transform the clean $4 \times 64 \times 64$ latent tensor $z_0$ into a $3 \times 512 \times 512$ RGB image. This final decode happens once, not at every step — another key efficiency gain of latent diffusion.

20. LDM Architecture — QKV Cross-Attention Focus

Our final diagram provides an alternative perspective on the LDM architecture, zooming into the cross-attention plumbing that makes flexible conditioning possible.

LDM architecture with explicit QKV cross-attention blocks and conditioning path — Fig 23. The LDM architecture with cross-attention emphasis (after Weng, 2021). Multiple QKV cross-attention blocks are shown explicitly inside the UNet. The domain-flexible encoder $\tau_\theta$ can process any modality — text, images, semantic maps — as long as it outputs a sequence of vectors.

The key insight of this view: the conditioning encoder $\tau_\theta$ is brilliantly flexible. Whether you feed it text via CLIP to get a $77 \times 768$ tensor, or a segmentation map via a CNN, or even another image via a vision encoder — as long as $\tau_\theta$ formats the output into a sequence of vectors, the UNet’s cross-attention layers can query it without changing their underlying architecture. This modular design is why Stable Diffusion has been so successfully extended to ControlNet, IP-Adapter, and dozens of other conditioning mechanisms.

The entire framework boils down to two ideas, unified in one pipeline. The VAE compresses the problem into a space where diffusion is computationally tractable, crunching $3 \times 512 \times 512$ images into $4 \times 64 \times 64$ latents. CLIP and cross-attention give that denoising process a semantic compass. Together, they are Stable Diffusion.

21. Summary

Let’s crystallize what we’ve built in this post:

Component	Role	Key Tensor
VAE Encoder $\mathbf{E}$	Compress pixels → latent	$3 \times 512 \times 512 \to 4 \times 64 \times 64$
VAE Decoder $\mathbf{D}$	Decompress latent → pixels	$4 \times 64 \times 64 \to 3 \times 512 \times 512$
CLIP Text Encoder	Translate prompt → embeddings	$\text{string} \to [77 \times 768]$
UNet $\varepsilon_\theta$	Predict noise (conditioned)	$[4 \times 64 \times 64] \to [4 \times 64 \times 64]$
Scheduler	Control denoising trajectory	Timesteps + scaling factors

The generation process:

Sample Gaussian noise $z_T \sim \mathbf{N}(0, I)$ as a $4 \times 64 \times 64$ tensor
Encode the text prompt via frozen CLIP → $[77 \times 768]$ embeddings
For each scheduler step $t = T, T{-}1, \ldots, 1$: feed $z_t$, $t$, and text embeddings into the UNet → predict $\hat{\varepsilon}$ → subtract to get $z_{t-1}$
Decode the final clean latent $z_0$ via the VAE Decoder → $3 \times 512 \times 512$ image

That’s it. Every piece fits. The VAE makes the problem tractable. CLIP makes it steerable. Cross-attention makes it precise. And the iterative denoising makes it beautiful.

In the next post, we’ll open the PyTorch source code and trace these tensors through actual function calls.

Take care! 🙂

LLM_log #014: Stable Diffusion & Conditional Latent Diffusion — From VAE Compression to Cross-Attention Conditioning