The forward process: scheduled destruction
Take a data point \(x_0\) — an image, an audio clip, a latent — and corrupt it over \(T\) steps with small additions of Gaussian noise. Each step is trivial; the composition has a closed form, so any noise level is reachable in one jump:
# Forward noising in 1-D: a bimodal dataset dissolving (EQ 10.1)
import numpy as np
rng = np.random.default_rng(0)
n = 300
x0 = np.concatenate([rng.normal(-2, 0.3, n//2), rng.normal(2, 0.3, n//2)])
ts = [0.0, 0.25, 0.5, 1.0]
xs, rows, labels = [], [], []
for i, t in enumerate(ts):
abar = np.exp(-6 * t) # noise schedule: abar 1 -> ~0
xt = np.sqrt(abar)*x0 + np.sqrt(1-abar)*rng.normal(0, 1, n)
sep = xt[x0 > 0].mean() - xt[x0 < 0].mean()
print(f"t={t:4.2f} abar={abar:5.3f} mode separation {sep:5.2f} "
f"noise sd {np.sqrt(1-abar):4.2f}")
xs += list(xt); rows += [i]*n; labels += [i]*n
print("\nthe modes start 4.0 apart; separation shrinks as 4*sqrt(abar)")
print("while the noise floor grows to sd 1. by t=1 both modes have")
print("melted into N(0,1) -- the state every sampler will start from.")
plot_scatter(xs, rows, labels)
Drag the slider below to run EQ 10.1 on a real 2-D dataset — six Gaussian clusters arranged in a ring — and watch structure dissolve:
Learning to reverse: noise prediction = score matching
Reversing the corruption requires only one ingredient: at every noise level, which direction points toward the data? That direction is the score, \( \nabla_x \log p_t(x) \). The DDPM training objective looks almost embarrassingly simple — predict the noise that was added:
# Reverse diffusion for real: annealed Langevin, analytic score
import numpy as np
rng = np.random.default_rng(0)
mu, s0 = np.array([-2.0, 2.0]), 0.3 # the true two-mode GMM
def score(x, sig): # exact grad log p of smoothed GMM
s2 = s0**2 + sig**2
d = x[:, None] - mu[None, :]
w = np.exp(-d**2 / (2*s2)); w /= w.sum(1, keepdims=True)
return -(w * d).sum(1) / s2
x = rng.normal(0, 3.0, 600) # start from pure noise
for sig in np.geomspace(3.0, 0.05, 25): # anneal the noise level down
eps = 0.5 * sig**2
for _ in range(20): # Langevin steps at this level
x += eps * score(x, sig) + np.sqrt(2*eps) * rng.normal(0, 1, len(x))
m_lo, m_hi = x[x < 0].mean(), x[x > 0].mean()
print(f"recovered mode means : {m_lo:+.3f} {m_hi:+.3f} (true -2.000 +2.000)")
print(f"both within +/-0.05 : {bool(abs(m_lo+2) < 0.05 and abs(m_hi-2) < 0.05)}")
print(f"mass split lo/hi : {np.mean(x < 0):.2f} / {np.mean(x > 0):.2f}")
print("\n600 points walked from N(0,9) back to the bimodal density using")
print("nothing but the score -- what a U-Net/DiT learns to estimate.")
plot_scatter(x, rng.normal(0, 0.2, len(x)), (x > 0).astype(int))
Autoregression factors over sequence positions; diffusion factors over noise levels. An LLM makes T sequential decisions, one per token, each conditioned on a growing prefix. A diffusion model makes ~50 global refinements, each touching every pixel/token at once. That difference — local-and-sequential vs global-and-parallel — explains everything in §10.5.
Sampling & classifier-free guidance
Raw conditional diffusion follows the prompt loosely. The fix used by essentially every image generator is classifier-free guidance: train the model with the condition randomly dropped (so it learns both \( \varepsilon_\theta(x, c) \) and \( \varepsilon_\theta(x, \varnothing) \)), then at sampling time exaggerate the difference:
Step-count compression is the active frontier: consistency models, progressive distillation, and adversarial distillation (SDXL-Turbo, SD3-Turbo lineage) collapse 50 steps into 1–4 by training a student to jump straight along the probability-flow ODE — diffusion's own version of Chapter 07.
Latent diffusion & flow matching
Two upgrades define the modern stack:
- Latent diffusion (Stable Diffusion's move): a VAE compresses 1024² RGB into a ~128² latent; diffusion runs there, 50–100× cheaper, and the VAE decoder restores pixels. Practically all production image/video diffusion is latent. The backbone meanwhile migrated from U-Nets to DiT — diffusion transformers — so the two halves of this manual now share an architecture.
- Flow matching / rectified flow (SD3, Flux, much of video): skip the stochastic-process scaffolding; define a straight path between noise and data and regress its constant velocity:
Text diffusion: the parallel challenger
Gaussian noise makes no sense for discrete tokens — so discrete diffusion corrupts differently: mask tokens with probability growing over the schedule (absorbing-state diffusion). The model — a plain bidirectional transformer — learns to fill every mask simultaneously; generation runs the corruption backwards:
State of play: LLaDA-8B showed masked diffusion matching same-size autoregressive models on standard benchmarks; Mercury (Inception Labs) and Gemini Diffusion demonstrated 5–10× decode throughput on code; open efforts (Dream, LLaDA-MoE) are scaling the recipe. The honest ledger:
| Autoregressive | Masked diffusion | |
|---|---|---|
| Generation order | strictly left→right | any order, parallel |
| Passes for n tokens | n (cheap each, KV-cached) | K ≈ 4–64 (expensive each: full bidirectional attention, no trivial KV cache) |
| Infilling / editing | awkward (needs special training) | native — it is the training task |
| Reversal curse, planning | struggles | bidirectional context helps measurably |
| Streaming UX, ecosystem, scaling proof | mature at 10²⁶ FLOPs | young — largest public runs ~10²³–10²⁴ |
Where diffusion meets the LLM stack
- The output side of multimodality. "Native image generation" in frontier assistants is predominantly a diffusion (or flow) decoder conditioned on the LLM's hidden states or generated tokens — the LLM plans, diffusion paints. Same pattern for music and increasingly video (Sora-class models: DiT over spacetime latent patches).
- Speech: flow-matching vocoders and TTS (Voicebox/F5 lineage) deliver the naturalness; the LLM supplies the words and prosody plan.
- Robotics & agents: diffusion policies generate action trajectories (smooth, multimodal distributions over continuous controls) while a VLM/LLM does the task reasoning — the same division of labor.
- Drafting hybrids: a diffusion LM is a natural parallel drafter for speculative decoding (Chapter 08) — propose a whole block in one pass, let the AR model verify.
- World models: interactive video generation (Genie-class) uses diffusion as a learned simulator — a possible training ground for agents beyond text.
Every piece is now on the table — two generative families, the full training stack, alignment, compression, serving. The Capstone assembles all of it: design a frontier model end-to-end with live numbers, then watch a prompt travel the entire pipeline you've just read.
Further reading
- Sohl-Dickstein, Weiss, Maheswaranathan & Ganguli (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. — the original forward-noise / reverse-denoise framework.
- Ho, Jain & Abbeel (2020). Denoising Diffusion Probabilistic Models (DDPM). — the noise-prediction objective and the practical recipe this chapter follows.
- Song et al. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. — unifies diffusion with score matching and the SDE view.
- Ho & Salimans (2022). Classifier-Free Diffusion Guidance. — the guidance trick behind controllable, high-fidelity sampling.
- Rombach, Blattmann, Lorenz, Esser & Ommer (2022). High-Resolution Image Synthesis with Latent Diffusion Models. — latent diffusion (Stable Diffusion).
- Lipman, Chen, Ben-Hamu, Nickel & Le (2023). Flow Matching for Generative Modeling. — the continuous-flow reformulation now common in frontier image/video models.
- Lou, Meng & Ermon (2024). Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (SEDD). — a leading approach to diffusion language models.