AI // ENCYCLOPEDIA / VOL II / 10 / DIFFUSION INDEX NEXT: 2026 FRONTIER →
CHAPTER 10 / 10

Diffusion

The second major family of generative models works differently from next-token prediction. The procedure is to destroy data with noise, then learn to walk the destruction backwards. Diffusion dominates images, audio, and video, drives the output side of most multimodal models, and in masked form now poses a credible challenge to autoregressive text generation.

READING TIME≈ 25 MIN BUILDS ONCH 01, 09 INSTRUMENTSREVERSE-DIFFUSION SANDBOX · MASKED dLLM
10.1

The forward process: scheduled destruction

Take a data point \(x_0\) — an image, an audio clip, a latent — and corrupt it over \(T\) steps with small additions of Gaussian noise. Each step is trivial; the composition has a closed form, so any noise level is reachable in one jump:

EQ 10.1 — FORWARD (NOISING) PROCESS $$ q(x_t \mid x_{t-1}) = \mathcal{N}\!\big(\sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I\big), \qquad q(x_t \mid x_0) = \mathcal{N}\!\big(\sqrt{\bar\alpha_t}\, x_0,\; (1-\bar\alpha_t)\, I\big) $$
\(\beta_t\) is the noise schedule; \(\bar\alpha_t = \prod_{s \le t}(1 - \beta_s)\) decays from 1 to ≈0. Equivalently: \(x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, I)\). At \(t = T\) every dataset becomes the same boring isotropic Gaussian — which is the point: we know how to sample that.
The sandbox uses the schedule \(\bar\alpha(t) = e^{-6t}\). At noise level \(t = 0.5\), what is \(\bar\alpha\)?
\(\bar\alpha(0.5) = e^{-6\times 0.5} = e^{-3} = \) 0.0498. The signal coefficient \(\sqrt{\bar\alpha}\approx 0.22\) is already small — halfway through the schedule the data is mostly noise.
Using \(x_t = \sqrt{\bar\alpha}\,x_0 + \sqrt{1-\bar\alpha}\,\varepsilon\) with \(\bar\alpha = 0.36\), a clean value \(x_0 = 2\), and noise draw \(\varepsilon = 1\), what is the noised value \(x_t\)?
\(\sqrt{0.36} = 0.6\) and \(\sqrt{1-0.36} = \sqrt{0.64} = 0.8\). So \(x_t = 0.6\times 2 + 0.8\times 1 = 1.2 + 0.8 = \) 2.0.
PYTHON · RUNNABLE IN-BROWSER
# Forward noising in 1-D: a bimodal dataset dissolving (EQ 10.1)
import numpy as np
rng = np.random.default_rng(0)
n = 300
x0 = np.concatenate([rng.normal(-2, 0.3, n//2), rng.normal(2, 0.3, n//2)])

ts = [0.0, 0.25, 0.5, 1.0]
xs, rows, labels = [], [], []
for i, t in enumerate(ts):
    abar = np.exp(-6 * t)                # noise schedule: abar 1 -> ~0
    xt = np.sqrt(abar)*x0 + np.sqrt(1-abar)*rng.normal(0, 1, n)
    sep = xt[x0 > 0].mean() - xt[x0 < 0].mean()
    print(f"t={t:4.2f}  abar={abar:5.3f}  mode separation {sep:5.2f}  "
          f"noise sd {np.sqrt(1-abar):4.2f}")
    xs += list(xt); rows += [i]*n; labels += [i]*n

print("\nthe modes start 4.0 apart; separation shrinks as 4*sqrt(abar)")
print("while the noise floor grows to sd 1. by t=1 both modes have")
print("melted into N(0,1) -- the state every sampler will start from.")
plot_scatter(xs, rows, labels)
edits are live — break it on purpose

Drag the slider below to run EQ 10.1 on a real 2-D dataset — six Gaussian clusters arranged in a ring — and watch structure dissolve:

INSTRUMENT 10.1 — DIFFUSION SANDBOXREAL REVERSE DIFFUSION · ANALYTIC SCORE
REVERSE PROCESS
The slider is the forward process. SAMPLE runs the genuine reverse process: 520 points drawn from pure noise descend the score field ∇log p (computable in closed form for this mixture — no neural network needed) through 60 annealed noise levels until the ring of clusters re-emerges. This is exactly what an image model does in a billion dimensions, with a U-Net/DiT estimating the score instead.
10.2

Learning to reverse: noise prediction = score matching

Reversing the corruption requires only one ingredient: at every noise level, which direction points toward the data? That direction is the score, \( \nabla_x \log p_t(x) \). The DDPM training objective looks almost embarrassingly simple — predict the noise that was added:

EQ 10.2 — THE DIFFUSION LOSS $$ \mathcal{L} = \mathbb{E}_{x_0, \varepsilon, t} \Big[ \big\lVert \varepsilon - \varepsilon_\theta\big( \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \varepsilon,\; t \big) \big\rVert^2 \Big] $$
Sample a training image, a noise level, a noise vector; corrupt; ask the network to recover the noise; L2 loss. This is secretly denoising score matching: the optimal noise predictor and the score differ only by scale, \( s_\theta(x_t, t) = -\,\varepsilon_\theta(x_t, t) / \sqrt{1 - \bar\alpha_t} \). One network learns the score at every noise level, indexed by the timestep embedding.
EQ 10.3 — REVERSE (DENOISING) STEP $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\, \varepsilon_\theta(x_t, t) \right) + \sigma_t\, z, \qquad z \sim \mathcal{N}(0, I) $$
Subtract the predicted noise (rescaled), inject a little fresh randomness, repeat from \(t = T\) down to 0. DDIM makes the walk deterministic and skippable (50 steps instead of 1,000); the whole process can equivalently be written as an SDE or a probability-flow ODE — the formulation modern samplers and distillation methods build on.
PYTHON · RUNNABLE IN-BROWSER
# Reverse diffusion for real: annealed Langevin, analytic score
import numpy as np
rng = np.random.default_rng(0)
mu, s0 = np.array([-2.0, 2.0]), 0.3      # the true two-mode GMM

def score(x, sig):                       # exact grad log p of smoothed GMM
    s2 = s0**2 + sig**2
    d = x[:, None] - mu[None, :]
    w = np.exp(-d**2 / (2*s2)); w /= w.sum(1, keepdims=True)
    return -(w * d).sum(1) / s2

x = rng.normal(0, 3.0, 600)              # start from pure noise
for sig in np.geomspace(3.0, 0.05, 25):  # anneal the noise level down
    eps = 0.5 * sig**2
    for _ in range(20):                  # Langevin steps at this level
        x += eps * score(x, sig) + np.sqrt(2*eps) * rng.normal(0, 1, len(x))

m_lo, m_hi = x[x < 0].mean(), x[x > 0].mean()
print(f"recovered mode means : {m_lo:+.3f}  {m_hi:+.3f}   (true -2.000 +2.000)")
print(f"both within +/-0.05  : {bool(abs(m_lo+2) < 0.05 and abs(m_hi-2) < 0.05)}")
print(f"mass split lo/hi     : {np.mean(x < 0):.2f} / {np.mean(x > 0):.2f}")
print("\n600 points walked from N(0,9) back to the bimodal density using")
print("nothing but the score -- what a U-Net/DiT learns to estimate.")
plot_scatter(x, rng.normal(0, 0.2, len(x)), (x > 0).astype(int))
edits are live — break it on purpose
CONTRAST

Autoregression factors over sequence positions; diffusion factors over noise levels. An LLM makes T sequential decisions, one per token, each conditioned on a growing prefix. A diffusion model makes ~50 global refinements, each touching every pixel/token at once. That difference — local-and-sequential vs global-and-parallel — explains everything in §10.5.

10.3

Sampling & classifier-free guidance

Raw conditional diffusion follows the prompt loosely. The fix used by essentially every image generator is classifier-free guidance: train the model with the condition randomly dropped (so it learns both \( \varepsilon_\theta(x, c) \) and \( \varepsilon_\theta(x, \varnothing) \)), then at sampling time exaggerate the difference:

EQ 10.4 — CLASSIFIER-FREE GUIDANCE $$ \tilde{\varepsilon} = \varepsilon_\theta(x_t, \varnothing) \;+\; w \cdot \big( \varepsilon_\theta(x_t, c) - \varepsilon_\theta(x_t, \varnothing) \big) $$
\(w = 1\) is plain conditioning; \(w \approx 5\text{–}10\) pushes samples toward regions where the condition is most informative — sharper prompt adherence, less diversity, and at extremes the over-saturated “AI look”. Guidance is diffusion's temperature dial: the single most user-visible sampling knob.
Classifier-free guidance forms \(\tilde\varepsilon = \varepsilon_\varnothing + w\,(\varepsilon_c - \varepsilon_\varnothing)\). With unconditional prediction \(\varepsilon_\varnothing = 0.2\), conditional \(\varepsilon_c = 0.5\), and guidance scale \(w = 5\), what is the guided prediction \(\tilde\varepsilon\)?
\(\tilde\varepsilon = 0.2 + 5\,(0.5 - 0.2) = 0.2 + 5\times 0.3 = 0.2 + 1.5 = \) 1.7. The guided estimate is extrapolated far past the conditional one — sharper adherence at the cost of diversity.

Step-count compression is the active frontier: consistency models, progressive distillation, and adversarial distillation (SDXL-Turbo, SD3-Turbo lineage) collapse 50 steps into 1–4 by training a student to jump straight along the probability-flow ODE — diffusion's own version of Chapter 07.

10.4

Latent diffusion & flow matching

Two upgrades define the modern stack:

  • Latent diffusion (Stable Diffusion's move): a VAE compresses 1024² RGB into a ~128² latent; diffusion runs there, 50–100× cheaper, and the VAE decoder restores pixels. Practically all production image/video diffusion is latent. The backbone meanwhile migrated from U-Nets to DiT — diffusion transformers — so the two halves of this manual now share an architecture.
  • Flow matching / rectified flow (SD3, Flux, much of video): skip the stochastic-process scaffolding; define a straight path between noise and data and regress its constant velocity:
EQ 10.5 — FLOW MATCHING (RECTIFIED FLOW) $$ x_t = (1-t)\, x_0 + t\, x_1, \qquad \mathcal{L}_{\text{FM}} = \mathbb{E}_{x_0, x_1, t} \Big[ \big\lVert v_\theta(x_t, t) - (x_1 - x_0) \big\rVert^2 \Big] $$
\(x_0 \sim \mathcal{N}(0, I)\) is noise, \(x_1\) is data; the target velocity is just their difference. Straighter paths ⇒ fewer integration steps at sampling; a cleaner objective ⇒ easier scaling. Diffusion's EQ 10.2 is recoverable as a special case with curved paths — flow matching is the simplification that won.
Rectified flow interpolates linearly: \(x_t = (1-t)\,x_0 + t\,x_1\). With noise endpoint \(x_0 = 0\), data endpoint \(x_1 = 8\), at \(t = 0.25\), what is \(x_t\)?
\(x_t = (1-0.25)\times 0 + 0.25\times 8 = 0 + 2 = \) 2. A quarter of the way along the straight path from noise to data.
In flow matching the network regresses the target velocity \(x_1 - x_0\). For data \(x_1 = 8\) and noise \(x_0 = 2\), what velocity should it predict?
\(x_1 - x_0 = 8 - 2 = \) 6. Because the path is a straight line, this velocity is constant along it — the property that makes sampling need so few steps.
10.5

Text diffusion: the parallel challenger

Gaussian noise makes no sense for discrete tokens — so discrete diffusion corrupts differently: mask tokens with probability growing over the schedule (absorbing-state diffusion). The model — a plain bidirectional transformer — learns to fill every mask simultaneously; generation runs the corruption backwards:

EQ 10.6 — MASKED DIFFUSION LM OBJECTIVE $$ \mathcal{L} = \mathbb{E}_{t, x_0, x_t} \left[ \frac{1}{t} \sum_{i \,:\, x_t^i = \texttt{[M]}} -\log p_\theta\big( x_0^i \mid x_t \big) \right] $$
Mask a random fraction \(t\) of positions, predict the originals given the rest, weight by the masking rate — a principled ELBO, and recognizably BERT's objective put to generative work. At sampling time: start fully masked, predict everything, keep the most confident fraction, re-mask the rest, repeat for \(K \ll n\) steps.
To produce a \(64\)-token sequence, an autoregressive model needs one forward pass per token; a masked diffusion LM finishes in \(K = 8\) parallel passes. How many times fewer forward passes does the diffusion model use (the ratio \(n/K\))?
\(\dfrac{n}{K} = \dfrac{64}{8} = \) 8× fewer passes. Each diffusion pass is heavier (full bidirectional attention, no cheap KV cache), so wall-clock speedup is smaller than 8× — but the parallelism is real.
INSTRUMENT 10.2 — MASKED DIFFUSION LMPARALLEL UNMASKING · K STEPS
FORWARD PASSES: DIFFUSION vs AUTOREGRESSIVE
An autoregressive model needs one forward pass per token — strictly left to right. The diffusion LM fills the whole sequence in K passes, easy tokens first, hard ones last, with full bidirectional context throughout. Fewer steps = faster but rougher: the same compute/quality dial as image diffusion.

State of play: LLaDA-8B showed masked diffusion matching same-size autoregressive models on standard benchmarks; Mercury (Inception Labs) and Gemini Diffusion demonstrated 5–10× decode throughput on code; open efforts (Dream, LLaDA-MoE) are scaling the recipe. The honest ledger:

AutoregressiveMasked diffusion
Generation orderstrictly left→rightany order, parallel
Passes for n tokensn (cheap each, KV-cached)K ≈ 4–64 (expensive each: full bidirectional attention, no trivial KV cache)
Infilling / editingawkward (needs special training)native — it is the training task
Reversal curse, planningstrugglesbidirectional context helps measurably
Streaming UX, ecosystem, scaling proofmature at 10²⁶ FLOPsyoung — largest public runs ~10²³–10²⁴
10.6

Where diffusion meets the LLM stack

  • The output side of multimodality. "Native image generation" in frontier assistants is predominantly a diffusion (or flow) decoder conditioned on the LLM's hidden states or generated tokens — the LLM plans, diffusion paints. Same pattern for music and increasingly video (Sora-class models: DiT over spacetime latent patches).
  • Speech: flow-matching vocoders and TTS (Voicebox/F5 lineage) deliver the naturalness; the LLM supplies the words and prosody plan.
  • Robotics & agents: diffusion policies generate action trajectories (smooth, multimodal distributions over continuous controls) while a VLM/LLM does the task reasoning — the same division of labor.
  • Drafting hybrids: a diffusion LM is a natural parallel drafter for speculative decoding (Chapter 08) — propose a whole block in one pass, let the AR model verify.
  • World models: interactive video generation (Genie-class) uses diffusion as a learned simulator — a possible training ground for agents beyond text.
NEXT

Every piece is now on the table — two generative families, the full training stack, alignment, compression, serving. The Capstone assembles all of it: design a frontier model end-to-end with live numbers, then watch a prompt travel the entire pipeline you've just read.

§

Further reading

  • Sohl-Dickstein, Weiss, Maheswaranathan & Ganguli (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. — the original forward-noise / reverse-denoise framework.
  • Ho, Jain & Abbeel (2020). Denoising Diffusion Probabilistic Models (DDPM). — the noise-prediction objective and the practical recipe this chapter follows.
  • Song et al. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. — unifies diffusion with score matching and the SDE view.
  • Ho & Salimans (2022). Classifier-Free Diffusion Guidance. — the guidance trick behind controllable, high-fidelity sampling.
  • Rombach, Blattmann, Lorenz, Esser & Ommer (2022). High-Resolution Image Synthesis with Latent Diffusion Models. — latent diffusion (Stable Diffusion).
  • Lipman, Chen, Ben-Hamu, Nickel & Le (2023). Flow Matching for Generative Modeling. — the continuous-flow reformulation now common in frontier image/video models.
  • Lou, Meng & Ermon (2024). Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (SEDD). — a leading approach to diffusion language models.