TL;DR
- Latent Diffusion Models (Rombach et al., 2021, arXiv:2112.10752) perform diffusion in the latent space of a pretrained autoencoder, not on raw pixels.
- A VAE compresses images to a small latent (typically 8× downsampled, 4 channels); diffusion operates on this latent; the decoder reconstructs pixels at the end.
- Compute drops roughly 50× versus pixel-space diffusion at 512×512, enabling training on academic budgets and deployment on commodity GPUs.
- Every modern open image generator — Stable Diffusion 1/2/3, SDXL, FLUX.1, Playground — is a latent diffusion model.
The Problem Latent Diffusion Solves#
Diffusion in pixel space is expensive. A 512×512 RGB image is 786,432 dimensions; the U-Net or DiT that denoises it must process that full tensor at every sampling step, every training step. For a frontier text-to-image model at 1024×1024, the cost roughly quadruples.
Rombach et al.'s 2021 paper observed that most of the perceptually relevant information in an image lives in a much lower-dimensional manifold. A well-trained autoencoder can compress 512×512×3 images into a 64×64×4 latent — 48× smaller — and decode them with negligible perceptual loss. Why diffuse in the high-dimensional pixel space when the low-dimensional latent suffices?
The Architecture#
Latent Diffusion has three components, trained in two stages:
Sampling: start with random latent noise z_T, run T denoising steps via the U-Net/DiT to obtain z_0, then decode the image as x = D(z_0). All the expensive iterative computation happens on the small latent; pixel decoding is a single forward pass at the end.
- Encoder E and Decoder D — a VAE (or VQ-VAE) trained on the target image distribution. E maps images to latents; D maps latents back to images. Trained first, then frozen.
- Diffusion U-Net or DiT — operates entirely on the VAE latent space. Trained second, with the VAE frozen.
- Conditioning encoder — typically a CLIP or T5 text encoder whose embeddings feed cross-attention layers in the U-Net/DiT.
The VAE#
The VAE is trained with a combination of reconstruction loss (L1 or L2 in pixel space), perceptual loss (LPIPS), KL divergence to a standard normal prior, and often an adversarial term to sharpen outputs. The KL prevents the latent from being arbitrary and ensures Gaussian-noise diffusion is well-defined on top.
Stable Diffusion 1.x and 2.x use an 8× downsampling VAE (512×512×3 → 64×64×4). SDXL uses the same factor. SD3 and FLUX use 8× with 16 channels — a higher-bandwidth latent that captures finer detail. The VAE choice trades reconstruction fidelity against latent dimensionality, and is one of the levers tuned at each model generation.
Why It Was the Stable-Diffusion Moment#
Pixel-space diffusion at 256×256 took roughly the compute of training a small LLM. Latent diffusion at 512×512 fits on a single 8-GPU node for the U-Net training. Stable Diffusion's August 2022 release was the first open frontier-quality image generator that fit in 4 GB of VRAM at inference and could be fine-tuned on a single consumer GPU.
The result was a Cambrian explosion of fine-tunes, LoRAs, ControlNets and community models. The open ecosystem that grew around Stable Diffusion would have been impossible without the compute reduction latent diffusion enabled.
When inspecting a latent-diffusion checkpoint, the 'VAE' and 'U-Net' (or 'transformer') weights are separable. You can swap a sharper VAE into an older U-Net to improve detail without retraining the diffusion model itself — the SDXL VAE upgrade trick.
Frontier Iterations#
| Model | Year | Latent shape (512²) | Backbone |
|---|---|---|---|
| Stable Diffusion 1.5 | 2022 | 64×64×4 | U-Net (~860M) |
| Stable Diffusion XL | 2023 | 128×128×4 (1024² input) | U-Net (~3.5B) |
| Stable Diffusion 3 | 2024 | 128×128×16 | MM-DiT (~2-8B) |
| FLUX.1 [dev] | 2024 | 128×128×16 | MM-DiT (12B) |
| FLUX.1 [pro] | 2024 | 128×128×16 | MM-DiT (closed) |
Limits#
The VAE introduces a ceiling on reconstruction quality: even with a perfect diffusion model, output cannot be sharper than what D(z) reconstructs from the optimal latent. For very fine detail (small text, fine textures), this is a real limit. Pixel-space cascaded diffusion (Imagen, eDiff-I) chained a small pixel diffuser at the end to add detail, but latent + better VAE has become the dominant recipe.