AI // ENCYCLOPEDIA / MULTIMODAL / 03 / IMAGE & VIDEO GEN INDEX NEXT: SPEECH & AUDIO →
MULTIMODAL & WORLD MODELS · CHAPTER 03 / 06

Image & Video Generation

Text-to-image generation cycled through adversarial, autoregressive, and energy-based models before diffusion displaced them. One denoising procedure now produces photorealistic images and coherent video from a text prompt. A network is trained to remove a small amount of noise; run a few dozen times, it condenses a structured image out of static. Moving the process into a compressed latent space makes it run on a single GPU, and a transformer over space and time extends it to video.

LEVELCORE READING TIME≈ 24 MIN BUILDS ONVol II · CH 10 · MM 02 INSTRUMENTSLATENT vs PIXEL · CFG SCALE · VIDEO COHERENCE
3.1

The text-to-image landscape

Text-to-image is a conditional generative modeling problem: given a caption \(c\), sample an image \(x\) from \(p(x \mid c)\). Three families have taken turns owning it, and the order matters because each fixed the previous one's failure.

GANs (Chapter on adversarial networks, Vol N) produced the first sharp synthetic faces but were notoriously hard to scale to open-vocabulary prompts — the discriminator's signal collapses as the conditioning space explodes, and mode collapse drops whole concepts. Autoregressive models (DALL·E 1, Parti) reframed an image as a sequence of discrete tokens and predicted them like text; they scale beautifully but pay an \(O(N)\) generation cost over thousands of tokens. Diffusion models won the open-domain prize after 2021: stable to train, mode-covering by construction, and — once moved into a latent space (§3.3) — fast enough to run on a laptop GPU.

FamilyHow it generatesStrengthsWeaknesses
GANone forward pass G(z)Instant sampling; razor-sharp at narrow domainsTraining instability, mode collapse; hard to scale to arbitrary text
Autoregressivepredict image tokens one by oneReuses the LLM stack; clean likelihood; unifies with textSlow (\(O(N)\) steps); needs a good discrete tokenizer (VQ)
Diffusioniterative denoising, ~20–50 stepsStable training, full mode coverage, controllable via guidanceMany sequential steps; pixel-space versions are compute-hungry
Masked / parallelunmask token batches in a few roundsFar fewer steps than AR; token-native editingSlightly behind diffusion on raw fidelity at scale

The boundaries blur in 2026: state-of-the-art systems are increasingly diffusion transformers (DiT) that borrow the transformer backbone from the autoregressive camp, and frontier multimodal models (covered in MM 02) fold image generation into a single token stream. The denoising idea is the connective tissue, so we develop it first.

3.2

Diffusion for images, in one page

A diffusion model is defined by two opposing processes. The forward process takes a clean image \(x_0\) and adds Gaussian noise in \(T\) small steps until nothing but static remains. It is fixed, has no parameters, and — crucially — admits a closed form that jumps straight to any timestep \(t\):

EQ MM3.1 — FORWARD (NOISING) PROCESS $$ x_t \;=\; \sqrt{\bar\alpha_t}\, x_0 \;+\; \sqrt{1 - \bar\alpha_t}\;\varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I), \qquad \bar\alpha_t = \prod_{s=1}^{t}\alpha_s $$
\(\bar\alpha_t\) is the surviving signal fraction: it slides from \(\bar\alpha_0 \approx 1\) (the clean image) to \(\bar\alpha_T \approx 0\) (pure noise). The whole forward chain collapses into one reparameterized draw — you never simulate \(T\) steps to make a training example, you sample a random \(t\), corrupt \(x_0\) once, and ask the network to undo it. This is the DDPM training trick. Full derivation: Vol II · Ch 10.

The reverse process is what we learn. A neural network \(\varepsilon_\theta(x_t, t)\) is trained to predict the noise that was added, and the entire loss is a denoising regression:

EQ MM3.2 — DENOISING OBJECTIVE $$ \mathcal{L} \;=\; \mathbb{E}_{x_0,\,\varepsilon,\,t}\Big[\,\big\lVert\, \varepsilon - \varepsilon_\theta\!\big(\sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon,\; t\big) \big\rVert^2\,\Big] $$
Predict the noise, not the image — equivalent up to the affine relation in EQ MM3.1, but far easier to optimize because the target \(\varepsilon\) is unit-variance at every \(t\). At sampling time you start from pure noise \(x_T \sim \mathcal{N}(0,I)\) and walk backward, subtracting a slice of the predicted noise at each step. The same trained \(\varepsilon_\theta\) is reused at every timestep — depth in time comes from iteration, not from more parameters.

Conditioning on a caption turns \(\varepsilon_\theta(x_t, t)\) into \(\varepsilon_\theta(x_t, t, c)\): the text embedding \(c\) (from a frozen text encoder such as CLIP or T5) is injected through cross-attention at every block. Everything else is identical. The architecture that carries the denoiser is historically a U-Net; since 2023 the field has shifted to diffusion transformers (DiT), which tile the latent into patches and process them with a plain transformer — the design behind Stable Diffusion 3 and Sora.

At timestep \(t\) you measure the signal fraction \(\bar\alpha_t = 0.36\). Using EQ MM3.1, what is the coefficient \(\sqrt{\bar\alpha_t}\) that multiplies the clean image \(x_0\)?
\(\sqrt{\bar\alpha_t} = \sqrt{0.36} = \) 0.6. The clean image contributes 60% of its amplitude, and the noise term carries \(\sqrt{1-0.36}=\sqrt{0.64}=0.8\) — the two coefficients satisfy \(0.6^2 + 0.8^2 = 1\), so total variance is preserved at every step.
PYTHON · RUNNABLE IN-BROWSER
# Toy latent diffusion: denoise a 2D latent toward a target (EQ MM3.1-3.2)
import numpy as np
rng = np.random.default_rng(0)

x0 = np.array([1.5, -0.8])                 # the "clean" latent we want to recover
T  = 40
betas = np.linspace(1e-3, 0.08, T)         # noise schedule
abar  = np.cumprod(1 - betas)              # signal fraction at each step (EQ MM3.1)

# An ORACLE denoiser: a real net learns eps_theta; here we know the true noise.
xt = rng.normal(0, 1, 2)                   # start from pure noise  x_T ~ N(0,I)
for t in range(T - 1, -1, -1):
    eps_hat = (xt - np.sqrt(abar[t]) * x0) / np.sqrt(1 - abar[t])   # implied noise
    x0_hat  = (xt - np.sqrt(1 - abar[t]) * eps_hat) / np.sqrt(abar[t])
    a_prev  = abar[t - 1] if t > 0 else 1.0
    xt = np.sqrt(a_prev) * x0_hat + np.sqrt(1 - a_prev) * rng.normal(0, 1, 2) * (t > 0)

print("target latent  x0   :", x0.round(3))
print("recovered      x_hat:", xt.round(3))
print("L2 error            :", float(np.linalg.norm(xt - x0).round(4)))
print("signal fraction abar: start", round(float(abar[-1]),3),
      "-> end", round(float(abar[0]),3))
edits are live — break it on purpose
3.3

Latent diffusion — DALL·E, Imagen, Stable Diffusion

Pixel-space diffusion works but is brutally expensive: a single denoising step on a \(512\times512\times3\) image touches ~786k values, and you need dozens of steps. The 2022 breakthrough — latent diffusion — was to stop diffusing pixels. First train a VAE autoencoder that compresses an image into a small latent \(z = \mathcal{E}(x)\) and decodes it back, \(x \approx \mathcal{D}(z)\). Then run the entire diffusion process in that latent space, decoding only once at the end.

EQ MM3.3 — LATENT COMPRESSION RATIO $$ z = \mathcal{E}(x),\quad x \in \mathbb{R}^{H \times W \times 3},\quad z \in \mathbb{R}^{\frac{H}{f} \times \frac{W}{f} \times c_z}, \qquad \text{ratio} = \frac{3\,H\,W}{c_z\,(H/f)(W/f)} = \frac{3 f^2}{c_z} $$
With the canonical Stable Diffusion settings — downsampling factor \(f = 8\) and latent channels \(c_z = 4\) — the spatial grid shrinks 64× and the value count drops by \(3 f^2 / c_z = 3\cdot 64 / 4 = 48\times\). Diffusion runs on the 48×-smaller tensor; the VAE handles all the perceptual detail. This single move is what put a text-to-image model on consumer hardware. (SD3 uses 16 latent channels — better fidelity, a smaller 12× ratio.)

The three landmark systems differ mostly in where they spend their compute:

SystemDiffusion spaceText encoderSignature idea
DALL·E 2 (2022)CLIP latent → pixelsCLIPA prior maps text → CLIP image embedding, then a diffusion decoder renders it ("unCLIP")
Imagen (2022)pixel-space, cascadedT5-XXL (frozen)A large frozen language model gives the best prompt fidelity; super-resolution cascade 64→256→1024
Stable Diffusion (2022)VAE latent (f=8)CLIP / T5 (SD3)Latent diffusion (EQ MM3.3) — the open-weights model that democratized the field

Quality at sampling time is shaped by classifier-free guidance (CFG). The model is trained to denoise both with the caption and, occasionally, with the caption dropped (the unconditional case). At sampling you run it both ways and extrapolate along the direction the caption points:

EQ MM3.4 — CLASSIFIER-FREE GUIDANCE $$ \tilde\varepsilon_\theta(x_t, t, c) \;=\; \varepsilon_\theta(x_t, t, \varnothing) \;+\; s\,\big[\,\varepsilon_\theta(x_t, t, c) - \varepsilon_\theta(x_t, t, \varnothing)\,\big] $$
\(s\) is the guidance scale. \(s = 1\) is ordinary conditional sampling; \(s = 0\) ignores the prompt entirely. Pushing \(s\) up (typical range 5–12) increases prompt adherence at the cost of diversity and, too high, of realism — colors oversaturate and textures fry. The term in brackets is the "conditional direction"; CFG simply walks \(s\) times further along it than the model would on its own. No separate classifier is needed — hence the name.
A pixel's unconditional noise estimate is \(\varepsilon_\varnothing = 0.20\) and its conditional estimate is \(\varepsilon_c = 0.50\). At guidance scale \(s = 7.5\), what guided noise value \(\tilde\varepsilon\) does EQ MM3.4 produce?
\(\tilde\varepsilon = 0.20 + 7.5\,(0.50 - 0.20) = 0.20 + 7.5 \times 0.30 = 0.20 + 2.25 = \) 2.45. The guided estimate is pushed far beyond either endpoint — that extrapolation is exactly why high CFG sharpens the prompt but can overshoot into oversaturation.
True or false: raising the classifier-free guidance scale \(s\) increases how closely the sample obeys the prompt, but reduces the diversity of samples drawn from the same caption. (Answer true or false.)
EQ MM3.4 extrapolates \(s\) times along the conditional direction \([\varepsilon_c - \varepsilon_\varnothing]\). A larger \(s\) pulls every sample harder toward what the caption specifies — improving adherence — while collapsing the spread of outcomes, since all samples are dragged toward the same direction; push it too far and realism degrades into oversaturation. The statement is true.
True or false: latent diffusion runs the noising/denoising process in a compressed latent space produced by a VAE encoder, not directly on pixels — that is what makes EQ MM3.3's 48× reduction possible. (Answer true or false.)
This is precisely the latent-diffusion construction. The VAE encoder \(\mathcal{E}\) compresses the image to \(z\), the U-Net/DiT denoises \(z\), and the decoder \(\mathcal{D}\) renders pixels only once at the end. Diffusing the 48×-smaller latent — rather than the full pixel grid — is what dropped the compute enough to run on a single GPU. The statement is true.
PYTHON · RUNNABLE IN-BROWSER
# Classifier-free guidance: interpolate/extrapolate cond vs uncond (EQ MM3.4)
import numpy as np

eps_uncond = np.array([0.20, -0.10, 0.05])    # prediction with prompt dropped
eps_cond   = np.array([0.50,  0.30, 0.05])    # prediction with the prompt
direction  = eps_cond - eps_uncond            # the "conditional direction"

print(" s     guided eps              ||guided||   vs cond")
for s in [0.0, 1.0, 3.0, 7.5, 15.0]:
    guided = eps_uncond + s * direction       # EQ MM3.4
    print(f"{s:5.1f}  {np.round(guided,3)}   {np.linalg.norm(guided):7.3f}")

print("\ns=0 -> ignores prompt (uncond); s=1 -> plain conditional;")
print("s>1 EXTRAPOLATES past the conditional sample: stronger prompt adherence,")
print("but the norm keeps growing -> the oversaturation you see at high CFG.")
plot_xy([0,1,3,7.5,15],
        [np.linalg.norm(eps_uncond + s*direction) for s in [0,1,3,7.5,15]])
edits are live — break it on purpose
INSTRUMENT MM3.1 — LATENT vs PIXEL DIFFUSIONEQ MM3.3 · COMPRESSION ↔ COST
PIXEL TENSOR (3·H·W)
LATENT TENSOR
COMPRESSION (3f²/c_z)
At f = 1 the latent is the pixel grid — you are doing pixel-space diffusion, and the cost bar fills the canvas. Slide f to 8 (Stable Diffusion) and the per-step cost collapses ~48×. Push channels up (SD3 uses 16) to trade a little compression back for fidelity. The cost roughly scales with the latent value count squared in attention — small grids matter a lot.
INSTRUMENT MM3.2 — GUIDANCE SCALE (CFG)EQ MM3.4 · ADHERENCE ↔ DIVERSITY
REGIME
PROMPT ADHERENCE
SAMPLE DIVERSITY
The grey cloud is unconditional samples; the mint arrow is the conditional direction; the bright dot is where guidance lands a sample at scale s. At s = 0 samples ignore the prompt and scatter widely; sweet spot is ~5–9; past ~15 the dot shoots far outside the data cloud — that is the over-saturated, fried look of excessive CFG.
3.4

Autoregressive & masked image models

Diffusion is not the only route. The autoregressive camp treats an image the way a language model treats text: tokenize it, then predict the tokens. The tokenizer is a VQ-VAE / VQGAN — an autoencoder whose bottleneck snaps each latent vector to the nearest entry in a learned codebook, turning a \(32\times32\) latent grid into a sequence of \(1024\) discrete indices. A transformer then models that sequence exactly like language:

EQ MM3.5 — AUTOREGRESSIVE IMAGE LIKELIHOOD $$ p_\theta(z_{1:N} \mid c) \;=\; \prod_{i=1}^{N} p_\theta\!\big(z_i \mid z_{<i},\, c\big), \qquad z_i \in \{1,\dots,K\}\ \text{(codebook of size } K) $$
Identical to next-token prediction (Vol II · Ch 01), just over image tokens. This is how DALL·E 1 and Parti work, and it unifies cleanly with text — a single transformer can interleave words and image tokens. The cost is \(N\) sequential steps; for \(N = 1024\) that is 1024 forward passes per image, the autoregressive tax. Fidelity is bounded by the codebook reconstruction quality.

Masked image modeling (MaskGIT, Muse) breaks the sequential bottleneck. Instead of one token at a time, the model is trained like a bidirectional masked autoencoder and, at generation, unmasks many tokens in parallel over a handful of rounds: predict all tokens, keep the most confident, re-mask the rest, repeat. A \(256\times256\) image that costs ~256 autoregressive steps drops to ~8–12 parallel rounds — comparable to a fast diffusion sampler, with the editing convenience of a token grid.

Where it stands in 2026. Pure autoregressive image models lost the fidelity crown to diffusion, but the architecture is resurgent inside unified multimodal models, where generating image tokens in the same stream as text is worth the speed penalty. Masked/parallel decoding sits between the two and increasingly underpins fast in-context image editing.

3.5

Video generation — Sora-class models

Video is an image problem with one extra axis: time. The naïve approach — generate frames independently — fails immediately, because nothing forces frame \(k+1\) to resemble frame \(k\); objects flicker, identities swap, motion is incoherent. The whole challenge is temporal coherence: the model must denoise the entire clip jointly so that pixels are consistent across both space and time.

The Sora-class recipe (2024 onward) is latent diffusion lifted into space-time. A video autoencoder compresses the clip into a grid of spacetime latent patches — small cubes spanning a few pixels and a few frames — and a diffusion transformer denoises the whole grid at once, attending across space and time together:

EQ MM3.6 — SPACETIME PATCH COUNT $$ N \;=\; \underbrace{\frac{T_f}{p_t}}_{\text{temporal}} \times \underbrace{\frac{H}{p_h}\,\frac{W}{p_w}}_{\text{spatial}}, \qquad \text{cost}_{\text{attn}} \sim O(N^2) $$
\(T_f\) frames are cut into chunks of \(p_t\); each frame's \(H\times W\) latent is cut into \(p_h\times p_w\) patches. Joint space-time attention is what enforces coherence — but its \(O(N^2)\) cost is why video is so much harder than images: doubling the clip length quadruples the attention bill. Production systems use factorized or windowed attention (separate spatial and temporal passes) to tame it. The same model handles variable resolution and duration by simply changing \(N\).

Two honest caveats, current as of 2026. First, these systems are sometimes marketed as "world simulators": by predicting coherent future frames they implicitly learn 3D structure, object permanence, and crude physics. That is real but partial — they still violate physics (objects pass through each other, liquids defy conservation, counts drift) because nothing enforces a physical prior; coherence is learned statistically, not derived. Second, costs are steep: a few seconds of high-resolution video is many thousands of spacetime patches denoised over dozens of steps, which is why generation latency and price remain the binding constraints, not quality. Audio is usually generated by a separate model and aligned afterward (MM 04).

INSTRUMENT MM3.3 — FRAME COHERENCEJOINT vs PER-FRAME DENOISING
REGIME
MEAN FRAME-TO-FRAME JUMP
ATTN COST ~O(N²)
Each strip is one frame; the dot is a tracked object. At coherence 0 the model denoises every frame independently — the object teleports, the classic flicker failure. Slide coherence up and joint space-time attention pins the trajectory into a smooth path. Add frames and watch the \(O(N^2)\) cost climb: that quadratic is why long video is the hard part.
NEXT

You can now paint a still or a moving image from a sentence — but the world also makes sound. Chapter 04 turns the denoising and tokenization machinery toward audio: text-to-speech, neural codecs, and music generation, where the time axis is everything and a single second is tens of thousands of samples.

3.R

References

  1. Ho, J., Jain, A. & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020 — the noise-prediction objective and forward reparameterization behind EQ MM3.1–MM3.2.
  2. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022 — Stable Diffusion: diffusion in a VAE latent (EQ MM3.3), the move that put text-to-image on a single GPU.
  3. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. DALL·E 2 — the CLIP-latent prior plus diffusion decoder ("unCLIP") of §3.3.
  4. Ho, J. & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS 2021 Workshop — the conditional/unconditional extrapolation of EQ MM3.4.
  5. Saharia, C. et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Imagen — a large frozen T5 text encoder plus a pixel-space super-resolution cascade.
  6. Peebles, W. & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023 — DiT, the transformer backbone that replaced the U-Net and scaled to SD3 and Sora.
  7. Chang, H., Zhang, H., Jiang, L., Liu, C. & Freeman, W. T. (2022). MaskGIT: Masked Generative Image Transformer. CVPR 2022 — parallel masked-token decoding, the few-round alternative to autoregression in §3.4.
  8. Brooks, T. et al. (2024). Video Generation Models as World Simulators. OpenAI (Sora) — spacetime latent patches and a diffusion transformer over video, the basis of §3.5 and EQ MM3.6.