AI // ENCYCLOPEDIA / DEEP LEARNING / 06 / GANs INDEX NEXT: TRAINING DEEP NETS →
DEEP LEARNING · CHAPTER 06 / 07

Generative Adversarial Networks

Most generative models estimate how likely the data is and climb that gradient. GANs discard the likelihood entirely. Adversarial training pits a generator against a discriminator that improve together, and it produced the first photorealistic generators. The generator never sees a real image directly; it learns from the verdicts of an opponent that is itself learning to catch it. A learned, moving loss function is what brought faces, fonts, and textures into focus where fixed objectives had blurred.

LEVELADVANCED READING TIME≈ 26 MIN BUILDS ONDL 03 · 05 INSTRUMENTSTRAINING SIM · MODE COLLAPSE · LATENT WALK
6.1

The adversarial game

A generative model wants to turn cheap noise into samples that look like real data. The autoencoders of the previous chapter did this by reconstructing inputs through a bottleneck and minimizing a pixel-wise reconstruction loss; the trouble is that pixel-wise losses reward blur — averaging two plausible faces gives a low error and a smeared ghost. GANs replace that hand-chosen loss with a second network whose only job is to tell real from fake, and they train the generator to defeat it.

The setup is two players. A generator \(G\) maps a latent vector \(z\), drawn from a fixed simple prior \(p_z\) (usually a unit Gaussian), to a sample \(G(z)\) in data space. A discriminator \(D\) takes any sample \(x\) and outputs \(D(x) \in (0,1)\), its estimated probability that \(x\) is real rather than generated. Goodfellow's 2014 metaphor has stuck because it is exact: \(G\) is a counterfeiter printing banknotes, \(D\) is the police learning to spot forgeries, and the two improve in lockstep until the fakes are indistinguishable from currency.

z ~ p(z) GENERATOR G(z) REAL DATA x DISCRIMINATOR D(x) → (0,1) REAL? FAKE?

The asymmetry of information is the whole trick. \(D\) is trained on labelled examples — it sees real data and generated data and knows which is which. \(G\) is never shown a single real example directly; its only learning signal is the gradient that flows back through \(D\) telling it which direction would have made its sample look more real. The loss function is therefore not fixed: it is \(D\) itself, and \(D\) is moving. This is the conceptual leap that separates GANs from everything before — the objective the generator climbs is learned and adversarial, sharpening exactly where the generator is currently weak.

Because there is no explicit density, a vanilla GAN cannot tell you the likelihood of a held-out image — it is an implicit generative model. You can sample from it freely but cannot score samples. That is a feature for image realism (no blur-inducing likelihood term) and a liability for evaluation, which is why the field leans on proxy metrics like the Fréchet Inception Distance instead of held-out log-likelihood.

6.2

The minimax objective

Write down what each player wants and you get a single two-player value function. The discriminator wants \(D(x)\) near 1 on real data and near 0 on fakes; the generator wants the opposite. Goodfellow et al. packaged both into one minimax game:

EQ N6.1 — THE MINIMAX GAME $$ \min_{G}\,\max_{D}\; V(D,G) \;=\; \mathbb{E}_{x \sim p_{\text{data}}}\!\big[\log D(x)\big] \;+\; \mathbb{E}_{z \sim p_z}\!\big[\log\big(1 - D(G(z))\big)\big] $$
The first term rewards \(D\) for scoring real data high; the second rewards \(D\) for scoring fakes low and rewards \(G\) (which minimizes) for pushing \(D(G(z))\) back toward 1. It is one objective, optimized in opposite directions by the two networks. In practice you alternate: a (few) gradient ascent step(s) on \(D\), then one gradient descent step on \(G\), each on a fresh minibatch of noise and data.

For a fixed generator, the inner maximization has a closed-form optimum. Treating \(V\) pointwise in \(x\), the optimal discriminator is the posterior probability that a sample is real under the two densities:

EQ N6.2 — THE OPTIMAL DISCRIMINATOR $$ D^{*}_{G}(x) \;=\; \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{g}(x)} $$
\(p_g\) is the (implicit) density the generator induces over data space. Where real and fake densities are equal, \(D^{*}(x) = \tfrac{1}{2}\): the detective is reduced to a coin flip. This is the fixed point of the whole game — when the generator's distribution matches the data, no discriminator can do better than chance, and the most informed possible verdict on every input is exactly 0.5.

Substitute \(D^{*}_G\) back into \(V\) and the generator's objective collapses to a recognizable distance between distributions. Up to constants it becomes the Jensen–Shannon divergence:

EQ N6.3 — WHAT G ACTUALLY MINIMIZES $$ \max_{D} V(D,G) \;=\; 2\,\mathrm{JSD}\!\big(p_{\text{data}} \,\|\, p_g\big) - \log 4, \qquad \mathrm{JSD}(p\|q) = \tfrac{1}{2}\mathrm{KL}\!\big(p \,\|\, m\big) + \tfrac{1}{2}\mathrm{KL}\!\big(q \,\|\, m\big),\;\; m = \tfrac{p+q}{2} $$
With the optimal \(D\) plugged in, the generator is minimizing the Jensen–Shannon divergence between the data and its own samples. The global minimum is \(\mathrm{JSD} = 0\), reached only when \(p_g = p_{\text{data}}\), giving value \(-\log 4 \approx -1.386\). The objective is principled — but JSD is also where the trouble starts (§6.3): when the two distributions barely overlap, JSD saturates to the constant \(\log 2\) and its gradient vanishes.

One practical wrinkle ships in every real implementation. The generator term \(\log(1 - D(G(z)))\) has almost no gradient early in training, exactly when \(D\) easily rejects the generator's garbage (\(D(G(z)) \approx 0\)). So Goodfellow recommended the non-saturating reformulation: instead of minimizing \(\log(1 - D(G(z)))\), the generator maximizes \(\log D(G(z))\). Same fixed point, far stronger gradients when the generator is losing — the version everyone actually trains.

At the global optimum of the GAN game the generator has matched the data distribution, \(p_g = p_{\text{data}}\). Using EQ N6.2, what value does the optimal discriminator \(D^{*}(x)\) output for every input \(x\)?
Substituting \(p_g = p_{\text{data}}\) into \(D^{*}_G(x) = \dfrac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} = \dfrac{p_{\text{data}}(x)}{2\,p_{\text{data}}(x)} = \dfrac{1}{2} = \) 0.5. The discriminator is reduced to a coin flip on every input — it can no longer tell real from fake, which is the definition of the generator having won.
When the data and generator distributions have disjoint support, the Jensen–Shannon divergence \(\mathrm{JSD}(p_{\text{data}}\|p_g)\) hits its maximum value. What is that maximum, in nats? (It is \(\ln 2\).)
For disjoint \(p\) and \(q\), the mixture \(m = (p+q)/2\) equals \(p/2\) wherever \(p\) lives and \(q/2\) wherever \(q\) lives, so each KL term is \(\int p \log\frac{p}{p/2} = \log 2\), giving \(\mathrm{JSD} = \tfrac12\log 2 + \tfrac12\log 2 = \log 2 = \ln 2 \approx \) 0.693 nats. Because this is a flat ceiling, its gradient is zero — the vanishing-gradient failure of §6.3.
PYTHON · RUNNABLE IN-BROWSER
# 1D GAN toy: a 1-param generator fits a target distribution; print D accuracy
import numpy as np
rng = np.random.default_rng(0)

def sig(x): return 1 / (1 + np.exp(-np.clip(x, -30, 30)))
real = rng.normal(2.0, 0.5, 2000)            # target: mean 2.0, std 0.5
a, b, mu, s = 1.0, 0.0, 0.0, 1.0             # D(x)=sig(a x + b); G(z)=mu + s z
lr = 0.05

for it in range(400):
    z = rng.normal(0, 1, 2000); fake = mu + s * z
    pr, pf = sig(a*real + b), sig(a*fake + b)            # D step: ascend V
    a += lr * (np.mean((1-pr)*real) - np.mean(pf*fake))
    b += lr * (np.mean(1-pr)        - np.mean(pf))
    z = rng.normal(0, 1, 2000); fake = mu + s * z         # G step: non-saturating
    pf = sig(a*fake + b)
    mu += lr * np.mean((1-pf) * a)
    s  += lr * np.mean((1-pf) * a * z)

print(f"target  : mean 2.00  std 0.50")
print(f"learned : mean {mu:5.2f}  std {abs(s):4.2f}")
zf = rng.normal(0, 1, 2000); fake = mu + s*zf
acc = 0.5*(np.mean(sig(a*real+b) > 0.5) + np.mean(sig(a*fake+b) < 0.5))
print(f"D accuracy at the end : {acc:.3f}   (0.5 = can't tell real from fake)")
edits are live — break it on purpose
INSTRUMENT N6.1 — ADVERSARIAL TRAINING SIMULATORG & D LOSSES · EQ N6.1–N6.3 · DETERMINISTIC
G LOSS (FINAL)
D LOSS (FINAL)
D(G(z)) — DETECTOR CONFIDENCE
A deterministic simulation of the two losses as the game runs (fixed seed, so it renders identically with zero interaction). The mint curve is the generator loss, the blue curve the discriminator loss; both oscillate around an equilibrium rather than monotonically falling — that is healthy adversarial training, not divergence. Crank the D learning rate or D-steps-per-G-step and watch the discriminator overpower the generator: D loss crashes toward 0, D(G(z)) toward 0, and G's gradient starves. Switch to SATURATING to see the early-training flat spot the non-saturating loss was invented to fix.
6.3

Instability & mode collapse

The theory of §6.2 assumes the inner maximization is solved exactly and the two distributions overlap. Reality grants neither, and the gap is where GANs earned their reputation for being temperamental. Three failure modes dominate.

Vanishing gradients. EQ N6.3 says \(G\) minimizes a JSD. But early on, \(p_g\) and \(p_{\text{data}}\) live on nearly disjoint low-dimensional manifolds inside a high-dimensional space — natural images occupy a vanishingly thin sliver of pixel space, and a fresh generator's outputs occupy a different one. Where the supports do not overlap, JSD is pinned at its maximum \(\log 2\) and is locally flat, so a near-optimal discriminator hands the generator a gradient of essentially zero. The detective becomes too good, and the forger stops learning. This is the precise sense in which a perfectly trained discriminator is bad for training.

Mode collapse. The minimax objective rewards the generator for fooling \(D\) on the current batch, not for covering the whole data distribution. A generator can win by mapping every \(z\) to a single hyper-realistic output — one perfect "7" for an MNIST GAN, one face. \(D\) eventually learns to reject that point, the generator hops to another single mode, \(D\) chases, and the two play whack-a-mole forever. The pathology is that the generator's loss has no term demanding diversity; covering one mode flawlessly scores as well as covering all of them.

WHY COLLAPSE IS STRUCTURAL

Mode collapse is not a bug in the optimizer — it is in the objective. Compare to maximum likelihood, whose KL term \(\mathrm{KL}(p_{\text{data}}\|p_g)\) is mode-covering: it explodes if \(p_g\) assigns near-zero probability anywhere the data has mass, forcing the model to spread out (and blur). The adversarial game has no such penalty for dropping a mode entirely, so it is free to be mode-seeking — crisp where it commits, blind to what it abandons. Sharper samples and dropped modes are two faces of the same coin.

Non-convergence and oscillation. Even with overlapping supports, simultaneous gradient descent on a minimax game is not guaranteed to converge — the dynamics can orbit the equilibrium indefinitely, like two players in rock-paper-scissors each best-responding to the other's last move. The losses you watch during GAN training oscillate by design; a generator loss that falls smoothly to zero usually means the discriminator has collapsed, not that you have won.

A fresh generator's samples and the real data occupy near-disjoint manifolds, so \(\mathrm{JSD}(p_{\text{data}}\|p_g)\) sits at its ceiling and the generator's effective loss \(2\,\mathrm{JSD} - \log 4\) is flat. What constant value (in nats) is \(\mathrm{JSD}\) stuck at, killing the gradient?
Disjoint support pins \(\mathrm{JSD}\) at its maximum \(\log 2 = \ln 2 \approx \) 0.693 nats — a flat plateau whose derivative is zero. No matter how the generator nudges its output, the loss does not move, so no useful gradient flows back. This is the mathematical core of the vanishing-gradient failure, and the precise problem the Wasserstein distance (§6.4) was designed to remove.
PYTHON · RUNNABLE IN-BROWSER
# Mode collapse: a single-Gaussian generator covers only ONE mode of a mixture
import numpy as np
rng = np.random.default_rng(1)

# target: 70% mass near +3, 30% near -3  (two well-separated modes)
heavy = rng.random(3000) < 0.7
real  = np.where(heavy, 3.0, -3.0) + rng.normal(0, 0.4, 3000)
def sig(x): return 1 / (1 + np.exp(-np.clip(x, -30, 30)))

a, b, mu, s = 0.3, 0.0, 0.0, 1.0            # G can only make ONE bump (1 Gaussian)
for it in range(800):
    z = rng.normal(0, 1, 3000); fake = mu + s*z
    pr, pf = sig(a*real + b), sig(a*fake + b)
    a += 0.02 * (np.mean((1-pr)*real) - np.mean(pf*fake))
    b += 0.02 * (np.mean(1-pr)        - np.mean(pf))
    z = rng.normal(0, 1, 3000); fake = mu + s*z; pf = sig(a*fake + b)
    mu += 0.02 * np.mean((1-pf)*a);  s += 0.02 * np.mean((1-pf)*a*z)

g = mu + abs(s)*rng.normal(0, 1, 8000)
near_major = np.mean(np.abs(g - 3) < 1.2)
near_minor = np.mean(np.abs(g + 3) < 1.2)
print(f"generator: mean {mu:5.2f}  std {abs(s):4.2f}")
print(f"mass near +3 (major mode): {near_major:.2f}")
print(f"mass near -3 (minor mode): {near_minor:.2f}  <- collapsed: mode abandoned")
edits are live — break it on purpose
INSTRUMENT N6.2 — MODE-COLLAPSE DEMO2D MIXTURE OF 8 GAUSSIANS · GENERATOR COVERAGE
MODES COVERED
COVERAGE
REGIME
Eight real modes sit on a ring (grey rings); the generator's samples are the mint cloud. With diversity pressure off, scrub training forward and watch the generator hop from mode to mode — at any moment it parks on one or two and abandons the rest, the signature of collapse. Raise diversity pressure (a stand-in for minibatch discrimination / unrolled-GAN style fixes) and the cloud spreads to cover the full ring. The lesson: nothing in the bare objective rewards coverage; you have to add it.
6.4

DCGAN, WGAN & the Wasserstein fix

The original GAN paper used multilayer perceptrons on small images and trained precariously. Two papers turned the idea into something that worked reliably — one architectural, one about the loss.

DCGAN (Radford, Metz & Chintala, 2015) is the architecture that made image GANs reproducible. Its recipe became boilerplate: replace pooling with strided convolutions (let the network learn its own up/down-sampling); use transposed convolutions in \(G\) to grow spatial resolution; apply batch normalization in both networks to stabilize activations; drop fully-connected hidden layers; use ReLU in \(G\) and LeakyReLU in \(D\). Beyond crisp 64×64 samples, DCGAN demonstrated that the learned latent space was structured — vector arithmetic on \(z\) (the famous "man with glasses − man + woman = woman with glasses") moved meaningfully in image space, the first hint that GANs learn a disentangled representation, not just a lookup table.

WGAN (Arjovsky, Chintala & Bottou, 2017) attacked the loss. The diagnosis was exactly §6.3: JSD gives no usable gradient when supports are disjoint. The fix is to measure the distance between distributions with the Wasserstein (earth-mover's) distance instead, which stays smooth and informative even when the distributions do not overlap. WGAN replaces the JS divergence with the Wasserstein distance.

EQ N6.4 — WASSERSTEIN-1 (EARTH MOVER'S) DISTANCE $$ W_1(p_{\text{data}}, p_g) \;=\; \inf_{\gamma \in \Pi(p_{\text{data}}, p_g)} \;\mathbb{E}_{(x,y)\sim\gamma}\big[\,\lVert x - y \rVert\,\big] $$
\(\Pi\) is the set of all transport plans \(\gamma\) with the right marginals; \(W_1\) is the minimum average "dirt × distance" to reshape one pile of probability into the other. Unlike JSD, \(W_1\) varies continuously with the generator's parameters even when supports are disjoint — move a far-away blob closer and \(W_1\) drops smoothly, giving a gradient where JSD gave a flat plateau.

The infimum over transport plans is intractable, so WGAN uses the Kantorovich–Rubinstein duality, which turns it into a maximization over 1-Lipschitz functions \(f\). The discriminator is repurposed as this \(f\) — now called a critic, because it outputs an unbounded real score, not a probability:

EQ N6.5 — KANTOROVICH–RUBINSTEIN DUAL (THE CRITIC OBJECTIVE) $$ W_1(p_{\text{data}}, p_g) \;=\; \sup_{\lVert f \rVert_{L} \le 1}\; \mathbb{E}_{x \sim p_{\text{data}}}[\,f(x)\,] \;-\; \mathbb{E}_{z \sim p_z}\big[\,f(G(z))\,\big] $$
The critic \(f\) maximizes the gap between its average score on real and on fake; the generator minimizes it. The constraint \(\lVert f\rVert_L \le 1\) (1-Lipschitz: \(f\) cannot change faster than its input) is what makes the dual equal \(W_1\). The original WGAN enforced it crudely by weight clipping; WGAN-GP replaced that with a gradient penalty pushing \(\lVert \nabla f \rVert\) toward 1 — far more stable, and the version in wide use.

The payoff is practical. Because EQ N6.5 estimates a genuine distance, the critic's value correlates with sample quality — for the first time a GAN's loss curve meant something you could read. WGAN tolerates a strong critic (train it to near-optimality between generator steps, the opposite of the vanilla advice), is far less prone to mode collapse, and removed much of the black-magic hyperparameter fiddling. It did not make GANs trivial, but it made them debuggable.

VariantDistribution distanceOutput networkHeadline contribution
Vanilla GANJensen–Shannondiscriminator → (0,1)The adversarial game itself (2014)
DCGANJensen–Shannonconv discriminatorStable conv architecture; structured latent space
WGANWasserstein-1critic → ℝ (clipped)Meaningful loss; far less collapse
WGAN-GPWasserstein-1critic + gradient penaltyLipschitz via penalty, not clipping
True or false: WGAN replaces the Jensen–Shannon divergence of the original GAN objective with the Wasserstein (earth-mover's) distance, precisely because the latter gives useful gradients even when the real and generated distributions do not overlap. (Answer true or false.)
This is exactly WGAN's thesis. Vanilla GANs minimize JSD (EQ N6.3), which is flat at \(\log 2\) for disjoint supports and hands the generator no gradient. \(W_1\) (EQ N6.4) instead varies continuously with how far apart the distributions are, so moving generated mass toward real mass always lowers the loss. The discriminator becomes a 1-Lipschitz critic (EQ N6.5). The statement is true.
INSTRUMENT N6.3 — LATENT-INTERPOLATION VISUALIZERWALK z FROM A → B IN LATENT SPACE · DCGAN-STYLE
|z| (LATENT NORM)
DECODED PATTERN
ENDPOINTS
A · B
Two latent codes \(z_A, z_B\) decode (via a small fixed toy generator) to two distinct procedural "textures"; slide \(t\) to walk between them. A smooth, gradual morph with no jumps is the signature of a well-trained generator — the latent space is continuous, so nearby codes give nearby images. Switch LINEAR → SLERP: straight-line interpolation in a Gaussian latent dips through the low-density origin (the \(|z|\) readout sags at \(t=0.5\)), giving washed-out midpoints, while spherical interpolation keeps \(|z|\) on the typical-radius shell and the morph stays crisp — the reason practitioners slerp.
6.5

StyleGAN & where GANs went

By 2018 GANs could generate small images reliably; the open question was control and resolution. Progressive growing (Karras et al., 2017) trained GANs by adding resolution layers one at a time, reaching 1024×1024. StyleGAN (Karras, Laine & Aila, 2019) then redesigned the generator itself and produced the photorealistic faces — thispersondoesnotexist.com — that put GANs in the popular imagination.

StyleGAN's central move was to stop feeding the latent code in at the bottom. Instead a learned mapping network turns \(z\) into an intermediate latent \(w\), and \(w\) controls the image by modulating the statistics of feature maps at every resolution via adaptive instance normalization. Coarse layers set pose and face shape; middle layers set features; fine layers set color and micro-texture. Injecting a different \(w\) at different layers ("style mixing") cleanly transplants, say, hair color without touching pose — disentanglement by construction. Per-pixel noise inputs supply the stochastic detail (freckles, stray hairs) that the structured \(w\) need not encode.

EQ N6.6 — STYLE MODULATION (AdaIN) $$ \mathrm{AdaIN}(x_i, w) \;=\; y_{s,i}(w)\,\frac{x_i - \mu(x_i)}{\sigma(x_i)} \;+\; y_{b,i}(w) $$
Each feature map \(x_i\) is normalized to zero mean and unit variance, then re-scaled and re-shifted by a per-channel style \((y_{s,i}, y_{b,i})\) computed from \(w\). The style controls the image purely through these scale/shift statistics — applied independently at each resolution, which is what separates coarse structure from fine texture. StyleGAN2 later replaced AdaIN with weight demodulation to remove the characteristic "droplet" artifacts, and StyleGAN3 fixed aliasing so features stick to surfaces under motion.

Where GANs stand in 2026 is an honest mixed picture. For unconditional and class-conditional image synthesis, diffusion models largely displaced GANs after 2021: they are far easier to train (a stable denoising regression, no adversary), cover modes better, and scale to text-to-image systems where GANs never caught up. The 2021 "diffusion beats GANs on image synthesis" result marked the turn. GANs did not vanish, though. Their one decisive advantage is speed: a GAN generates in a single forward pass, while diffusion needs many denoising steps — so GANs and GAN-style adversarial losses survive wherever latency matters: real-time super-resolution, image-to-image translation, neural vocoders for speech, and as the distillation target that compresses slow diffusion models into one-step generators. Adversarial training is now a component in a larger toolbox rather than the whole story.

CONTESTED

"GANs are obsolete" is too strong. The claim holds for large-scale text-to-image, where diffusion (and autoregressive token models) clearly won on quality and trainability. It does not hold for latency-bound generation, and the line is blurring: state-of-the-art few-step diffusion distillation often adds an adversarial loss to keep one-step samples sharp. The adversarial idea outlived the pure-GAN architecture. Treat anyone who says GANs are simply dead, or simply fine, with equal suspicion.

NEXT

Every model in this volume — autoencoder, GAN, the deep classifier — is only as good as the optimization that fits it. Chapter 07 leaves architectures behind for the craft of training deep nets: initialization, normalization, the vanishing/exploding-gradient problem these adversarial games quietly battle, learning-rate schedules, and the regularization that decides whether a network that can fit the data actually generalizes.

6.R

References

  1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS 2014 — the original adversarial game, the optimal discriminator (EQ N6.2), and the JSD reduction (EQ N6.3).
  2. Radford, A., Metz, L. & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 — DCGAN: the stable convolutional architecture and latent-space vector arithmetic.
  3. Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN. ICML 2017 — replacing JSD with the Wasserstein distance (EQ N6.4–N6.5) and the critic.
  4. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. (2017). Improved Training of Wasserstein GANs. NeurIPS 2017 — WGAN-GP: the gradient penalty that replaced weight clipping.
  5. Karras, T., Laine, S. & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR 2019 — StyleGAN: the mapping network, AdaIN style modulation (EQ N6.6), and style mixing.
  6. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. (2016). Improved Techniques for Training GANs. NeurIPS 2016 — minibatch discrimination and feature matching, the classic anti-collapse fixes behind Instrument N6.2.
  7. Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021 — the result marking diffusion's displacement of GANs for large-scale image generation (§6.5).