10 · Diffusion — LLM Field Manual

10.1

The forward process: scheduled destruction

Take a data point $x_0$ — an image, an audio clip, a latent — and corrupt it over $T$ steps with small additions of Gaussian noise. Each step is trivial; the composition has a closed form, so any noise level is reachable in one jump:

EQ 10.1 — FORWARD (NOISING) PROCESS $$ q(x_t \mid x_{t-1}) = \mathcal{N}\!\big(\sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I\big), \qquad q(x_t \mid x_0) = \mathcal{N}\!\big(\sqrt{\bar\alpha_t}\, x_0,\; (1-\bar\alpha_t)\, I\big) $$

$\beta_t$ is the noise schedule; $\bar\alpha_t = \prod_{s \le t}(1 - \beta_s)$ decays from 1 to ≈0. Equivalently: $x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, I)$. At $t = T$ every dataset becomes the same boring isotropic Gaussian — which is the point: we know how to sample that.

The sandbox uses the schedule $\bar\alpha(t) = e^{-6t}$. At noise level $t = 0.5$, what is $\bar\alpha$?

$\bar\alpha(0.5) = e^{-6\times 0.5} = e^{-3} = $ 0.0498. The signal coefficient $\sqrt{\bar\alpha}\approx 0.22$ is already small — halfway through the schedule the data is mostly noise.

Using $x_t = \sqrt{\bar\alpha}\,x_0 + \sqrt{1-\bar\alpha}\,\varepsilon$ with $\bar\alpha = 0.36$, a clean value $x_0 = 2$, and noise draw $\varepsilon = 1$, what is the noised value $x_t$?

$\sqrt{0.36} = 0.6$ and $\sqrt{1-0.36} = \sqrt{0.64} = 0.8$. So $x_t = 0.6\times 2 + 0.8\times 1 = 1.2 + 0.8 = $ 2.0.

PYTHON · RUNNABLE IN-BROWSER

# Forward noising in 1-D: a bimodal dataset dissolving (EQ 10.1)
import numpy as np
rng = np.random.default_rng(0)
n = 300
x0 = np.concatenate([rng.normal(-2, 0.3, n//2), rng.normal(2, 0.3, n//2)])

ts = [0.0, 0.25, 0.5, 1.0]
xs, rows, labels = [], [], []
for i, t in enumerate(ts):
    abar = np.exp(-6 * t)                # noise schedule: abar 1 -> ~0
    xt = np.sqrt(abar)*x0 + np.sqrt(1-abar)*rng.normal(0, 1, n)
    sep = xt[x0 > 0].mean() - xt[x0 < 0].mean()
    print(f"t={t:4.2f}  abar={abar:5.3f}  mode separation {sep:5.2f}  "
          f"noise sd {np.sqrt(1-abar):4.2f}")
    xs += list(xt); rows += [i]*n; labels += [i]*n

print("\nthe modes start 4.0 apart; separation shrinks as 4*sqrt(abar)")
print("while the noise floor grows to sd 1. by t=1 both modes have")
print("melted into N(0,1) -- the state every sampler will start from.")
plot_scatter(xs, rows, labels)

edits are live — break it on purpose

Drag the slider below to run EQ 10.1 on a real 2-D dataset — six Gaussian clusters arranged in a ring — and watch structure dissolve:

INSTRUMENT 10.1 — DIFFUSION SANDBOXREAL REVERSE DIFFUSION · ANALYTIC SCORE

FORWARD NOISE LEVEL t = 0.00

REVERSE PROCESS

—

The slider is the forward process. SAMPLE runs the genuine reverse process: 520 points drawn from pure noise descend the score field ∇log p (computable in closed form for this mixture — no neural network needed) through 60 annealed noise levels until the ring of clusters re-emerges. This is exactly what an image model does in a billion dimensions, with a U-Net/DiT estimating the score instead.

10.2

Learning to reverse: noise prediction = score matching

Reversing the corruption requires only one ingredient: at every noise level, which direction points toward the data? That direction is the score, $ \nabla_x \log p_t(x) $. The DDPM training objective looks almost embarrassingly simple — predict the noise that was added:

EQ 10.2 — THE DIFFUSION LOSS $$ \mathcal{L} = \mathbb{E}_{x_0, \varepsilon, t} \Big[ \big\lVert \varepsilon - \varepsilon_\theta\big( \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \varepsilon,\; t \big) \big\rVert^2 \Big] $$

Sample a training image, a noise level, a noise vector; corrupt; ask the network to recover the noise; L2 loss. This is secretly denoising score matching: the optimal noise predictor and the score differ only by scale, $ s_\theta(x_t, t) = -\,\varepsilon_\theta(x_t, t) / \sqrt{1 - \bar\alpha_t} $. One network learns the score at every noise level, indexed by the timestep embedding.

EQ 10.3 — REVERSE (DENOISING) STEP $$ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\, \varepsilon_\theta(x_t, t) \right) + \sigma_t\, z, \qquad z \sim \mathcal{N}(0, I) $$

Subtract the predicted noise (rescaled), inject a little fresh randomness, repeat from $t = T$ down to 0. DDIM makes the walk deterministic and skippable (50 steps instead of 1,000); the whole process can equivalently be written as an SDE or a probability-flow ODE — the formulation modern samplers and distillation methods build on.

PYTHON · RUNNABLE IN-BROWSER

# Reverse diffusion for real: annealed Langevin, analytic score
import numpy as np
rng = np.random.default_rng(0)
mu, s0 = np.array([-2.0, 2.0]), 0.3      # the true two-mode GMM

def score(x, sig):                       # exact grad log p of smoothed GMM
    s2 = s0**2 + sig**2
    d = x[:, None] - mu[None, :]
    w = np.exp(-d**2 / (2*s2)); w /= w.sum(1, keepdims=True)
    return -(w * d).sum(1) / s2

x = rng.normal(0, 3.0, 600)              # start from pure noise
for sig in np.geomspace(3.0, 0.05, 25):  # anneal the noise level down
    eps = 0.5 * sig**2
    for _ in range(20):                  # Langevin steps at this level
        x += eps * score(x, sig) + np.sqrt(2*eps) * rng.normal(0, 1, len(x))

m_lo, m_hi = x[x < 0].mean(), x[x > 0].mean()
print(f"recovered mode means : {m_lo:+.3f}  {m_hi:+.3f}   (true -2.000 +2.000)")
print(f"both within +/-0.05  : {bool(abs(m_lo+2) < 0.05 and abs(m_hi-2) < 0.05)}")
print(f"mass split lo/hi     : {np.mean(x < 0):.2f} / {np.mean(x > 0):.2f}")
print("\n600 points walked from N(0,9) back to the bimodal density using")
print("nothing but the score -- what a U-Net/DiT learns to estimate.")
plot_scatter(x, rng.normal(0, 0.2, len(x)), (x > 0).astype(int))

edits are live — break it on purpose

CONTRAST

Autoregression factors over sequence positions; diffusion factors over noise levels. An LLM makes T sequential decisions, one per token, each conditioned on a growing prefix. A diffusion model makes ~50 global refinements, each touching every pixel/token at once. That difference — local-and-sequential vs global-and-parallel — explains everything in §10.5.

10.3

Sampling & classifier-free guidance

Raw conditional diffusion follows the prompt loosely. The fix used by essentially every image generator is classifier-free guidance: train the model with the condition randomly dropped (so it learns both $ \varepsilon_\theta(x, c) $ and $ \varepsilon_\theta(x, \varnothing) $), then at sampling time exaggerate the difference:

EQ 10.4 — CLASSIFIER-FREE GUIDANCE $$ \tilde{\varepsilon} = \varepsilon_\theta(x_t, \varnothing) \;+\; w \cdot \big( \varepsilon_\theta(x_t, c) - \varepsilon_\theta(x_t, \varnothing) \big) $$

$w = 1$ is plain conditioning; $w \approx 5\text{–}10$ pushes samples toward regions where the condition is most informative — sharper prompt adherence, less diversity, and at extremes the over-saturated “AI look”. Guidance is diffusion's temperature dial: the single most user-visible sampling knob.

Classifier-free guidance forms $\tilde\varepsilon = \varepsilon_\varnothing + w\,(\varepsilon_c - \varepsilon_\varnothing)$. With unconditional prediction $\varepsilon_\varnothing = 0.2$, conditional $\varepsilon_c = 0.5$, and guidance scale $w = 5$, what is the guided prediction $\tilde\varepsilon$?

$\tilde\varepsilon = 0.2 + 5\,(0.5 - 0.2) = 0.2 + 5\times 0.3 = 0.2 + 1.5 = $ 1.7. The guided estimate is extrapolated far past the conditional one — sharper adherence at the cost of diversity.

Step-count compression is the active frontier: consistency models, progressive distillation, and adversarial distillation (SDXL-Turbo, SD3-Turbo lineage) collapse 50 steps into 1–4 by training a student to jump straight along the probability-flow ODE — diffusion's own version of Chapter 07.

10.4

Latent diffusion & flow matching

Two upgrades define the modern stack:

Latent diffusion (Stable Diffusion's move): a VAE compresses 1024² RGB into a ~128² latent; diffusion runs there, 50–100× cheaper, and the VAE decoder restores pixels. Practically all production image/video diffusion is latent. The backbone meanwhile migrated from U-Nets to DiT — diffusion transformers — so the two halves of this manual now share an architecture.
Flow matching / rectified flow (SD3, Flux, much of video): skip the stochastic-process scaffolding; define a straight path between noise and data and regress its constant velocity:

EQ 10.5 — FLOW MATCHING (RECTIFIED FLOW) $$ x_t = (1-t)\, x_0 + t\, x_1, \qquad \mathcal{L}_{\text{FM}} = \mathbb{E}_{x_0, x_1, t} \Big[ \big\lVert v_\theta(x_t, t) - (x_1 - x_0) \big\rVert^2 \Big] $$

$x_0 \sim \mathcal{N}(0, I)$ is noise, $x_1$ is data; the target velocity is just their difference. Straighter paths ⇒ fewer integration steps at sampling; a cleaner objective ⇒ easier scaling. Diffusion's EQ 10.2 is recoverable as a special case with curved paths — flow matching is the simplification that won.

Rectified flow interpolates linearly: $x_t = (1-t)\,x_0 + t\,x_1$. With noise endpoint $x_0 = 0$, data endpoint $x_1 = 8$, at $t = 0.25$, what is $x_t$?

$x_t = (1-0.25)\times 0 + 0.25\times 8 = 0 + 2 = $ 2. A quarter of the way along the straight path from noise to data.

In flow matching the network regresses the target velocity $x_1 - x_0$. For data $x_1 = 8$ and noise $x_0 = 2$, what velocity should it predict?

$x_1 - x_0 = 8 - 2 = $ 6. Because the path is a straight line, this velocity is constant along it — the property that makes sampling need so few steps.

10.5

Text diffusion: the parallel challenger

Gaussian noise makes no sense for discrete tokens — so discrete diffusion corrupts differently: mask tokens with probability growing over the schedule (absorbing-state diffusion). The model — a plain bidirectional transformer — learns to fill every mask simultaneously; generation runs the corruption backwards:

EQ 10.6 — MASKED DIFFUSION LM OBJECTIVE $$ \mathcal{L} = \mathbb{E}_{t, x_0, x_t} \left[ \frac{1}{t} \sum_{i \,:\, x_t^i = \texttt{[M]}} -\log p_\theta\big( x_0^i \mid x_t \big) \right] $$

Mask a random fraction $t$ of positions, predict the originals given the rest, weight by the masking rate — a principled ELBO, and recognizably BERT's objective put to generative work. At sampling time: start fully masked, predict everything, keep the most confident fraction, re-mask the rest, repeat for $K \ll n$ steps.

To produce a $64$-token sequence, an autoregressive model needs one forward pass per token; a masked diffusion LM finishes in $K = 8$ parallel passes. How many times fewer forward passes does the diffusion model use (the ratio $n/K$)?

$\dfrac{n}{K} = \dfrac{64}{8} = $ 8× fewer passes. Each diffusion pass is heavier (full bidirectional attention, no cheap KV cache), so wall-clock speedup is smaller than 8× — but the parallelism is real.

INSTRUMENT 10.2 — MASKED DIFFUSION LMPARALLEL UNMASKING · K STEPS

DENOISING STEPS 4 steps

FORWARD PASSES: DIFFUSION vs AUTOREGRESSIVE

—

An autoregressive model needs one forward pass per token — strictly left to right. The diffusion LM fills the whole sequence in K passes, easy tokens first, hard ones last, with full bidirectional context throughout. Fewer steps = faster but rougher: the same compute/quality dial as image diffusion.

State of play: LLaDA-8B showed masked diffusion matching same-size autoregressive models on standard benchmarks; Mercury (Inception Labs) and Gemini Diffusion demonstrated 5–10× decode throughput on code; open efforts (Dream, LLaDA-MoE) are scaling the recipe. The honest ledger:

	Autoregressive	Masked diffusion
Generation order	strictly left→right	any order, parallel
Passes for n tokens	n (cheap each, KV-cached)	K ≈ 4–64 (expensive each: full bidirectional attention, no trivial KV cache)
Infilling / editing	awkward (needs special training)	native — it is the training task
Reversal curse, planning	struggles	bidirectional context helps measurably
Streaming UX, ecosystem, scaling proof	mature at 10²⁶ FLOPs	young — largest public runs ~10²³–10²⁴

10.6

Where diffusion meets the LLM stack

The output side of multimodality. "Native image generation" in frontier assistants is predominantly a diffusion (or flow) decoder conditioned on the LLM's hidden states or generated tokens — the LLM plans, diffusion paints. Same pattern for music and increasingly video (Sora-class models: DiT over spacetime latent patches).
Speech: flow-matching vocoders and TTS (Voicebox/F5 lineage) deliver the naturalness; the LLM supplies the words and prosody plan.
Robotics & agents: diffusion policies generate action trajectories (smooth, multimodal distributions over continuous controls) while a VLM/LLM does the task reasoning — the same division of labor.
Drafting hybrids: a diffusion LM is a natural parallel drafter for speculative decoding (Chapter 08) — propose a whole block in one pass, let the AR model verify.
World models: interactive video generation (Genie-class) uses diffusion as a learned simulator — a possible training ground for agents beyond text.

Every piece is now on the table — two generative families, the full training stack, alignment, compression, serving. The Capstone assembles all of it: design a frontier model end-to-end with live numbers, then watch a prompt travel the entire pipeline you've just read.

§