The text-to-image landscape
Text-to-image is a conditional generative modeling problem: given a caption \(c\), sample an image \(x\) from \(p(x \mid c)\). Three families have taken turns owning it, and the order matters because each fixed the previous one's failure.
GANs (Chapter on adversarial networks, Vol N) produced the first sharp synthetic faces but were notoriously hard to scale to open-vocabulary prompts — the discriminator's signal collapses as the conditioning space explodes, and mode collapse drops whole concepts. Autoregressive models (DALL·E 1, Parti) reframed an image as a sequence of discrete tokens and predicted them like text; they scale beautifully but pay an \(O(N)\) generation cost over thousands of tokens. Diffusion models won the open-domain prize after 2021: stable to train, mode-covering by construction, and — once moved into a latent space (§3.3) — fast enough to run on a laptop GPU.
| Family | How it generates | Strengths | Weaknesses |
|---|---|---|---|
| GAN | one forward pass G(z) | Instant sampling; razor-sharp at narrow domains | Training instability, mode collapse; hard to scale to arbitrary text |
| Autoregressive | predict image tokens one by one | Reuses the LLM stack; clean likelihood; unifies with text | Slow (\(O(N)\) steps); needs a good discrete tokenizer (VQ) |
| Diffusion | iterative denoising, ~20–50 steps | Stable training, full mode coverage, controllable via guidance | Many sequential steps; pixel-space versions are compute-hungry |
| Masked / parallel | unmask token batches in a few rounds | Far fewer steps than AR; token-native editing | Slightly behind diffusion on raw fidelity at scale |
The boundaries blur in 2026: state-of-the-art systems are increasingly diffusion transformers (DiT) that borrow the transformer backbone from the autoregressive camp, and frontier multimodal models (covered in MM 02) fold image generation into a single token stream. The denoising idea is the connective tissue, so we develop it first.
Diffusion for images, in one page
A diffusion model is defined by two opposing processes. The forward process takes a clean image \(x_0\) and adds Gaussian noise in \(T\) small steps until nothing but static remains. It is fixed, has no parameters, and — crucially — admits a closed form that jumps straight to any timestep \(t\):
The reverse process is what we learn. A neural network \(\varepsilon_\theta(x_t, t)\) is trained to predict the noise that was added, and the entire loss is a denoising regression:
Conditioning on a caption turns \(\varepsilon_\theta(x_t, t)\) into \(\varepsilon_\theta(x_t, t, c)\): the text embedding \(c\) (from a frozen text encoder such as CLIP or T5) is injected through cross-attention at every block. Everything else is identical. The architecture that carries the denoiser is historically a U-Net; since 2023 the field has shifted to diffusion transformers (DiT), which tile the latent into patches and process them with a plain transformer — the design behind Stable Diffusion 3 and Sora.
# Toy latent diffusion: denoise a 2D latent toward a target (EQ MM3.1-3.2)
import numpy as np
rng = np.random.default_rng(0)
x0 = np.array([1.5, -0.8]) # the "clean" latent we want to recover
T = 40
betas = np.linspace(1e-3, 0.08, T) # noise schedule
abar = np.cumprod(1 - betas) # signal fraction at each step (EQ MM3.1)
# An ORACLE denoiser: a real net learns eps_theta; here we know the true noise.
xt = rng.normal(0, 1, 2) # start from pure noise x_T ~ N(0,I)
for t in range(T - 1, -1, -1):
eps_hat = (xt - np.sqrt(abar[t]) * x0) / np.sqrt(1 - abar[t]) # implied noise
x0_hat = (xt - np.sqrt(1 - abar[t]) * eps_hat) / np.sqrt(abar[t])
a_prev = abar[t - 1] if t > 0 else 1.0
xt = np.sqrt(a_prev) * x0_hat + np.sqrt(1 - a_prev) * rng.normal(0, 1, 2) * (t > 0)
print("target latent x0 :", x0.round(3))
print("recovered x_hat:", xt.round(3))
print("L2 error :", float(np.linalg.norm(xt - x0).round(4)))
print("signal fraction abar: start", round(float(abar[-1]),3),
"-> end", round(float(abar[0]),3))
Latent diffusion — DALL·E, Imagen, Stable Diffusion
Pixel-space diffusion works but is brutally expensive: a single denoising step on a \(512\times512\times3\) image touches ~786k values, and you need dozens of steps. The 2022 breakthrough — latent diffusion — was to stop diffusing pixels. First train a VAE autoencoder that compresses an image into a small latent \(z = \mathcal{E}(x)\) and decodes it back, \(x \approx \mathcal{D}(z)\). Then run the entire diffusion process in that latent space, decoding only once at the end.
The three landmark systems differ mostly in where they spend their compute:
| System | Diffusion space | Text encoder | Signature idea |
|---|---|---|---|
| DALL·E 2 (2022) | CLIP latent → pixels | CLIP | A prior maps text → CLIP image embedding, then a diffusion decoder renders it ("unCLIP") |
| Imagen (2022) | pixel-space, cascaded | T5-XXL (frozen) | A large frozen language model gives the best prompt fidelity; super-resolution cascade 64→256→1024 |
| Stable Diffusion (2022) | VAE latent (f=8) | CLIP / T5 (SD3) | Latent diffusion (EQ MM3.3) — the open-weights model that democratized the field |
Quality at sampling time is shaped by classifier-free guidance (CFG). The model is trained to denoise both with the caption and, occasionally, with the caption dropped (the unconditional case). At sampling you run it both ways and extrapolate along the direction the caption points:
# Classifier-free guidance: interpolate/extrapolate cond vs uncond (EQ MM3.4)
import numpy as np
eps_uncond = np.array([0.20, -0.10, 0.05]) # prediction with prompt dropped
eps_cond = np.array([0.50, 0.30, 0.05]) # prediction with the prompt
direction = eps_cond - eps_uncond # the "conditional direction"
print(" s guided eps ||guided|| vs cond")
for s in [0.0, 1.0, 3.0, 7.5, 15.0]:
guided = eps_uncond + s * direction # EQ MM3.4
print(f"{s:5.1f} {np.round(guided,3)} {np.linalg.norm(guided):7.3f}")
print("\ns=0 -> ignores prompt (uncond); s=1 -> plain conditional;")
print("s>1 EXTRAPOLATES past the conditional sample: stronger prompt adherence,")
print("but the norm keeps growing -> the oversaturation you see at high CFG.")
plot_xy([0,1,3,7.5,15],
[np.linalg.norm(eps_uncond + s*direction) for s in [0,1,3,7.5,15]])
Autoregressive & masked image models
Diffusion is not the only route. The autoregressive camp treats an image the way a language model treats text: tokenize it, then predict the tokens. The tokenizer is a VQ-VAE / VQGAN — an autoencoder whose bottleneck snaps each latent vector to the nearest entry in a learned codebook, turning a \(32\times32\) latent grid into a sequence of \(1024\) discrete indices. A transformer then models that sequence exactly like language:
Masked image modeling (MaskGIT, Muse) breaks the sequential bottleneck. Instead of one token at a time, the model is trained like a bidirectional masked autoencoder and, at generation, unmasks many tokens in parallel over a handful of rounds: predict all tokens, keep the most confident, re-mask the rest, repeat. A \(256\times256\) image that costs ~256 autoregressive steps drops to ~8–12 parallel rounds — comparable to a fast diffusion sampler, with the editing convenience of a token grid.
Where it stands in 2026. Pure autoregressive image models lost the fidelity crown to diffusion, but the architecture is resurgent inside unified multimodal models, where generating image tokens in the same stream as text is worth the speed penalty. Masked/parallel decoding sits between the two and increasingly underpins fast in-context image editing.
Video generation — Sora-class models
Video is an image problem with one extra axis: time. The naïve approach — generate frames independently — fails immediately, because nothing forces frame \(k+1\) to resemble frame \(k\); objects flicker, identities swap, motion is incoherent. The whole challenge is temporal coherence: the model must denoise the entire clip jointly so that pixels are consistent across both space and time.
The Sora-class recipe (2024 onward) is latent diffusion lifted into space-time. A video autoencoder compresses the clip into a grid of spacetime latent patches — small cubes spanning a few pixels and a few frames — and a diffusion transformer denoises the whole grid at once, attending across space and time together:
Two honest caveats, current as of 2026. First, these systems are sometimes marketed as "world simulators": by predicting coherent future frames they implicitly learn 3D structure, object permanence, and crude physics. That is real but partial — they still violate physics (objects pass through each other, liquids defy conservation, counts drift) because nothing enforces a physical prior; coherence is learned statistically, not derived. Second, costs are steep: a few seconds of high-resolution video is many thousands of spacetime patches denoised over dozens of steps, which is why generation latency and price remain the binding constraints, not quality. Audio is usually generated by a separate model and aligned afterward (MM 04).
You can now paint a still or a moving image from a sentence — but the world also makes sound. Chapter 04 turns the denoising and tokenization machinery toward audio: text-to-speech, neural codecs, and music generation, where the time axis is everything and a single second is tens of thousands of samples.
References
- Ho, J., Jain, A. & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models.
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.
- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents.
- Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance.
- Saharia, C. et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.
- Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers.
- Chang, H., Zhang, H., Jiang, L., Liu, C. & Freeman, W. T. (2022). MaskGIT: Masked Generative Image Transformer.
- Brooks, T. et al. (2024). Video Generation Models as World Simulators.