Games in AI — Self-Play, GANs & Multi-Agent

3.1

When learning becomes a game

The first two chapters treated games as a model of the world: rational players, payoff matrices, equilibria you solve for. This chapter inverts the relationship. Here the game is a training objective — a structure we impose on optimization so that the loss surface is no longer fixed but co-created by the learner itself. The defining feature is a moving target: the thing a model is trying to beat improves whenever the model does.

Static supervised learning has a ceiling. The objective is a frozen dataset, and the best you can do is fit it; once you match the labels, the gradient goes quiet. A game-based objective never goes quiet, because the opponent (an adversary, a past version of yourself, a population of peers) keeps raising the bar. Three families dominate modern practice:

Setup	The two sides	What the game produces	Canonical system
Adversarial	generator vs critic	A learned loss function that sharpens as samples improve	GANs
Self-play	agent vs its own past	An automatic curriculum of ever-stronger opponents	AlphaZero
Multi-agent	N agents in a shared world	Emergent strategy, cooperation, and convention	Pluribus, MADDPG

What unifies them is the minimax skeleton from Chapter 01: a value that one party maximizes and another minimizes. The mathematics of saddle points, best responses and equilibria — built for analyzing rational agents — turns out to be exactly the mathematics of training them. The catch, returned to throughout, is that gradient descent was designed to find minima, not saddle points, so these games are notoriously harder to optimize than ordinary losses.

FRAME

A useful slogan: supervised learning imitates a teacher; a game manufactures one. Everything below is a different answer to the question "where does the next, slightly-harder training example come from?"

3.2

GANs as a minimax game

A Generative Adversarial Network pits a generator $G$, which maps noise $z \sim p_z$ to fake samples $G(z)$, against a discriminator $D$, which outputs the probability that a sample is real. $D$ wants to label reals as 1 and fakes as 0; $G$ wants $D(G(z))$ to read as 1. Goodfellow et al. (2014) wrote this as a single two-player zero-sum game on one value function:

EQ G3.1 — THE GAN MINIMAX $$ \min_{G}\,\max_{D}\; V(D,G) \;=\; \mathbb{E}_{x \sim p_{\text{data}}}\!\big[\log D(x)\big] \;+\; \mathbb{E}_{z \sim p_z}\!\big[\log\big(1 - D(G(z))\big)\big] $$

$D$ maximizes $V$ (push $D(x)\to 1$ on reals, $D(G(z))\to 0$ on fakes); $G$ minimizes it by fooling $D$. It is zero-sum in spirit: every bit $D$ gains, $G$ loses. The generator never sees the data directly — its only teacher is the gradient flowing back through $D$. That is the whole trick: the loss function is itself learned, and it grows more discerning exactly as the generator improves.

Fix $G$ and ask for the best discriminator. For any $x$, $V$ is maximized pointwise, and calculus gives the optimal critic in closed form:

EQ G3.2 — THE OPTIMAL DISCRIMINATOR $$ D^{*}_{G}(x) \;=\; \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{g}(x)} $$

where $p_g$ is the generator's induced distribution. When the generator has won — $p_g = p_{\text{data}}$ everywhere — the optimal discriminator reads $D^{*}(x) = \tfrac{1}{2}$ for every input: it can do no better than a coin flip. That fixed point is the Nash equilibrium of the game.

Substituting $D^{*}_G$ back collapses the game onto a divergence between the two distributions:

EQ G3.3 — VALUE AT THE GENERATOR'S OPTIMUM $$ C(G) \;=\; \max_{D} V(D,G) \;=\; -\log 4 \;+\; 2\cdot \mathrm{JSD}\!\big(p_{\text{data}}\,\|\,p_g\big) $$

$\mathrm{JSD}\ge 0$ is the Jensen–Shannon divergence, zero only when $p_g = p_{\text{data}}$. So the global minimum of $C(G)$ is $-\log 4 \approx -1.386$, attained exactly when the generator matches the data. Training a GAN is, in this idealized analysis, minimizing JSD by a game instead of by an explicit formula — which matters because JSD itself is intractable to compute on real high-dimensional data.

The honest caveats. The clean theory assumes $D$ is trained to optimality at every step and that both networks have unlimited capacity. Neither holds. In practice GANs are infamous for training instability and mode collapse (the generator parks all its mass on a few outputs that reliably fool the current $D$). JSD also saturates — its gradient vanishes when the distributions barely overlap — which motivated Wasserstein GANs (Arjovsky et al., 2017), replacing JSD with an Earth-Mover distance whose gradient stays informative. The minimax framing is the right mental model; the optimization is genuinely hard, and as of 2026 diffusion and autoregressive models have largely displaced GANs for frontier image and audio synthesis, even as the adversarial idea persists everywhere from super-resolution to robustness training.

At the GAN's Nash equilibrium the generator matches the data, so $p_g(x) = p_{\text{data}}(x)$ for every $x$. Plug this into EQ G3.2: what value does the optimal discriminator $D^{*}(x)$ output everywhere?

$D^{*}(x) = \dfrac{p_{\text{data}}}{p_{\text{data}} + p_g} = \dfrac{p_{\text{data}}}{2\,p_{\text{data}}} = \dfrac{1}{2} = $ 0.5. The perfect critic is reduced to a coin flip — it can no longer tell real from fake.

Using EQ G3.3, what is the value $C(G)$ at the global optimum, where $\mathrm{JSD} = 0$? (Give the natural-log value of $-\log 4$, to three decimals.)

$C(G) = -\log 4 + 2\cdot 0 = -\log 4 = -1.38629\ldots \approx $ −1.386. This is the floor of the game; a generator that has matched the data can drive the value no lower.

A GAN is a zero-sum (minimax) game between the generator and the discriminator over a single value function $V(D,G)$. True or false? (Answer true or false.)

EQ G3.1 is literally $\min_G \max_D V(D,G)$: the discriminator maximizes the same quantity the generator minimizes, so it is a two-player minimax (zero-sum) game. The answer is true.

PYTHON · RUNNABLE IN-BROWSER

# GAN minimax value on a toy: two distributions over 5 discrete bins.
# Optimal D is closed-form (EQ G3.2); the game value collapses to JSD (EQ G3.3).
import numpy as np

p_data = np.array([0.05, 0.15, 0.40, 0.25, 0.15])      # the real distribution
def value(p_g):
    p_g = np.asarray(p_g, float); p_g /= p_g.sum()
    D = p_data / (p_data + p_g)                          # EQ G3.2, optimal critic
    V = (p_data * np.log(D) + p_g * np.log(1 - D)).sum() # EQ G3.1 at D*
    m = 0.5 * (p_data + p_g)                             # JSD, base-e
    jsd = 0.5*(p_data*np.log(p_data/m)).sum() + 0.5*(p_g*np.log(p_g/m)).sum()
    return V, jsd, D

for name, pg in [("bad   ", [0.40,0.30,0.10,0.10,0.10]),
                 ("closer", [0.10,0.20,0.30,0.25,0.15]),
                 ("matched", p_data.copy())]:
    V, jsd, D = value(pg)
    print(f"{name}: value V={V:+.4f}  JSD={jsd:.4f}  check(-log4+2*JSD)={-np.log(4)+2*jsd:+.4f}")
print(f"\nfloor of the game: -log 4 = {-np.log(4):+.4f}  (reached only when p_g == p_data)")
print("at the match, every D* entry equals 0.5:", np.round(value(p_data.copy())[2], 3))

edits are live — break it on purpose

INSTRUMENT G3.1 — GAN PAYOFF VISUALIZEREQ G3.1–G3.3 · ONE BIN, CLOSED-FORM D*

GENERATOR MASS p_g (vs fixed p_data = 0.50) 0.20

OPTIMAL D*(x)

—

GAME VALUE C(G)

—

JSD(p_data ‖ p_g)

—

A single point with $p_{\text{data}} = 0.5$; slide the generator's mass $p_g$. The curve is the game value $C(G)$ over all $p_g$; the dot is where you are. It bottoms out at $p_g = 0.5$ where $D^{*} = 0.5$, JSD $= 0$ and $C(G) = -\log 4$. Push $p_g$ to either extreme and the discriminator wins decisively — exactly the regime where $G$'s gradient (the slope of the curve) goes flat and learning stalls.

3.3

Self-play — AlphaZero & beyond

The cleanest game-as-curriculum is an agent playing against itself. There is no human data, no teacher, no fixed opponent: the agent's current policy is both the player and the environment it must beat. Because the opponent is a copy of you, the difficulty tracks your skill automatically — a perfectly calibrated curriculum that needs no designer.

AlphaGo Zero (Silver et al., 2017) made this concrete for Go and then chess and shogi (AlphaZero). A single network $f_\theta(s) = (\boldsymbol{p}, v)$ outputs a move-probability vector $\boldsymbol{p}$ and a scalar value $v \in [-1, 1]$ estimating who wins from state $s$. Monte-Carlo Tree Search (MCTS) uses the network to look ahead, producing improved move counts $\boldsymbol{\pi}$; the game is then played to a result $z \in \{-1, +1\}$. Training pulls the network toward its own searched-and-played behavior:

EQ G3.4 — ALPHAZERO'S SELF-PLAY LOSS $$ \ell(\theta) \;=\; (z - v)^2 \;-\; \boldsymbol{\pi}^{\top} \log \boldsymbol{p} \;+\; c\,\lVert \theta \rVert^2 $$

First term: regress the value head toward the actual game outcome $z$. Second term: cross-entropy pulling the raw policy $\boldsymbol{p}$ toward the search-improved policy $\boldsymbol{\pi}$ — the network distills its own lookahead back into its instincts. Third: weight decay. The data is generated entirely by the current network playing itself; tomorrow's training set is produced by today's model, which is what makes the curriculum self-generating.

Each generation is stronger, so each generation's self-play games are harder, so the next network must improve to keep winning — a ratchet. The same ratchet drives AlphaStar (StarCraft II), OpenAI Five (Dota 2), and the policy-improvement loops inside RLHF, where a reward model plays the critic. The mechanism that powers Pluribus (Brown & Sandholm, 2019) — superhuman six-player poker — is self-play too, but in an imperfect-information game, so it computes a blueprint via counterfactual regret minimization and refines it with real-time search; its solution concept is approximate Nash rather than a hard win/loss value.

The minimal engine behind self-play improvement is a value bootstrap: a state's value is estimated from the values of the states it leads to, and those estimates pull each other toward consistency. In a two-player zero-sum game the backup is a minimax — you assume the opponent (your own copy) plays its best reply:

EQ G3.5 — MINIMAX VALUE BACKUP $$ V(s) \;\leftarrow\; \max_{a}\; \Big[\, r(s,a) \;-\; \gamma\, V\big(s'(s,a)\big) \,\Big] $$

The sign flips on the child value because what is good for you is bad for the opponent who moves next (a "negamax" backup, $\gamma$ the discount). Iterating this map is a contraction: the values converge to the game's true minimax values regardless of where you start. No labels were ever supplied — the targets are bootstrapped from the agent's own evolving estimates.

In self-play, the agent's own games against copies of itself produce the training data, so each generation's opponents are as strong as the current model — i.e. self-play generates its own training curriculum. True or false? (Answer true or false.)

There is no external dataset; the agent plays itself, and as it improves, the games it generates get harder, supplying a steadily-harder curriculum with no human in the loop. The answer is true.

PYTHON · RUNNABLE IN-BROWSER

# Tiny self-play value bootstrap on a toy game.
# A 6-node game tree: leaves have true outcomes; internal nodes back up by
# negamax (EQ G3.5). We start from a WRONG guess and let it self-correct.
import numpy as np

# children[node] = list of child indices ([] means a leaf)
children = {0:[1,2], 1:[3,4], 2:[4,5], 3:[], 4:[], 5:[]}
leaf_val = {3:+1.0, 4:-1.0, 5:+1.0}   # zero-sum outcomes from mover's view
gamma = 1.0

V = {n: (leaf_val[n] if n in leaf_val else 0.7) for n in children}  # bad init
print("init   :", {k: round(v,3) for k,v in V.items()})
for sweep in range(5):
    for n in [2,1,0]:                  # back up internal nodes, leaves to root
        if children[n]:
            V[n] = max(-gamma*V[c] for c in children[n])   # negamax backup
    print(f"sweep {sweep}:", {k: round(v,3) for k,v in V.items()})

best = max(children[0], key=lambda c: -V[c])
print(f"\nroot value V(0) = {V[0]:+.0f}; mover should play toward child {best}.")
print("targets were never labeled -- they bootstrapped from the leaves up.")

edits are live — break it on purpose

INSTRUMENT G3.2 — SELF-PLAY LADDER SIMULATORELO RATCHET · GENERATION-OVER-GENERATION

LEARNING RATE (ELO GAINED PER WIN-MARGIN) 32

NOISE (TRAINING VARIANCE) 20

OPPONENT

FINAL ELO

—

VS FIXED TEACHER

—

GENERATIONS

Forty training generations. In SELF-PLAY the opponent is always the latest version, so each win nudges the rating up and the bar rises with it — open-ended growth, the AlphaZero ratchet. Switch to FIXED TEACHER and watch the curve flatten the moment the agent matches the teacher: a frozen opponent is a ceiling. The dashed line marks the teacher's strength — self-play sails past it without ever being shown a stronger example.

3.4

Multi-agent reinforcement learning

Two players is the easy case. Multi-agent reinforcement learning (MARL) drops $N$ learners into a shared environment, each with its own policy $\pi_i$ and reward $r_i$. The hard part is structural: from any single agent's view, the others are part of the environment, and they are changing as they learn. The world is non-stationary — the ground that gradient descent assumes is fixed is, in fact, moving under every step.

The right object is the Markov (stochastic) game: state transitions and each agent's reward depend on the joint action $(a_1,\ldots,a_N)$. The solution concept is a Nash equilibrium of policies — no agent can improve by unilaterally changing its own. Cooperation, competition and mixtures all live here, distinguished only by how the reward functions relate:

Reward structure	Game	What agents learn	Example
Fully aligned	cooperative	Coordination, role assignment, shared conventions	Team play, traffic
Fully opposed	zero-sum	Robust, minimax-optimal strategies	Go, poker
Mixed	general-sum	Negotiation, reciprocity, social dilemmas	Markets, Diplomacy

The workhorse algorithmic idea is centralized training, decentralized execution (CTDE). During training a critic may see everyone's observations and actions — making its target stationary — while each agent's actor learns a policy that runs on its own local view alone. MADDPG (Lowe et al., 2017) is the canonical instance. The key intuition is that one agent's policy-gradient sign depends on what the others do:

EQ G3.6 — DECENTRALIZED POLICY GRADIENT (CTDE) $$ \nabla_{\theta_i} J_i \;=\; \mathbb{E}\!\left[\, \nabla_{\theta_i} \log \pi_i(a_i \mid o_i)\; Q_i^{\boldsymbol{\pi}}\!\big(s,\, a_1,\ldots,a_N\big) \,\right] $$

Agent $i$'s actor depends only on its local observation $o_i$, but its centralized critic $Q_i$ is conditioned on the joint action — so it can attribute outcomes correctly even when the cause was a teammate's move. This is what tames non-stationarity at training time. At deployment the critic is discarded and each policy acts on $o_i$ alone.

The deepest lessons in MARL come from the simplest games. A coordination game can have several equilibria, and which one a population lands on is a matter of risk and history, not just payoff. The textbook case is the Stag Hunt: hunting a stag together pays best but only if your partner also commits; hunting hare is a safe solo payoff. There are two pure Nash equilibria — (stag, stag), which is payoff-dominant, and (hare, hare), which is risk-dominant — and learners frequently converge to the safe-but-worse one.

EQ G3.7 — STAG HUNT & RISK DOMINANCE $$ \begin{array}{c|cc} & \text{Stag} & \text{Hare}\\\hline \text{Stag} & (4,4) & (0,3)\\ \text{Hare} & (3,0) & (3,3) \end{array} \qquad \mathbb{E}[\text{Stag}] = 4q,\;\; \mathbb{E}[\text{Hare}] = 3 $$

With partner probability $q$ of choosing Stag, hunting stag beats hunting hare only when $4q > 3$, i.e. $q > 0.75$. So you must believe your partner cooperates more than three-quarters of the time before cooperation is rational. (stag, stag) earns more but demands trust; (hare, hare) is safe. This single threshold — not a payoff comparison — is why coordination is hard, and why learned conventions and communication matter.

In the Stag Hunt of EQ G3.7, your partner plays Stag with probability $q$. Your expected payoff is $4q$ for Stag and $3$ for Hare. At what $q$ are you exactly indifferent (the threshold above which cooperating is rational)?

Set $4q = 3$: $q = 3/4 = $ 0.75. Below this you should defect to Hare; above it, hunt Stag. Cooperation requires believing your partner cooperates more than 75% of the time.

INSTRUMENT G3.3 — MULTI-AGENT COORDINATION TOYSTAG HUNT · BEST-RESPONSE DYNAMICS · EQ G3.7

INITIAL FRACTION HUNTING STAG 0.60

EXPLORATION (MUTATION RATE) 0.03

BASIN THRESHOLD q*

0.75

CONVERGES TO

—

FINAL STAG FRACTION

—

A population plays Stag Hunt and each round shifts toward the better reply (replicator-style best response). The basin boundary sits at $q^{*} = 0.75$: start above it and the population climbs to the payoff-dominant all-Stag equilibrium; start below and it collapses to the risk-dominant all-Hare. Set the initial fraction near 0.75 to feel the knife-edge — a tiny change in starting beliefs flips the entire outcome. That sensitivity is the core difficulty of cooperative MARL.

Where the field actually is (2026). MARL works well in two-team zero-sum settings (it inherits self-play's stability) and in tightly cooperative ones with CTDE. General-sum, partially-observable, many-agent settings remain hard: equilibria may not be unique or even exist in tractable form, credit assignment across agents is brittle, and emergent behavior is difficult to specify or guarantee. The standout recent result, Meta's CICERO playing Diplomacy, needed to fuse a planning engine with a language model precisely because raw self-play does not by itself produce the negotiation and trust-building a mixed-motive game demands.

3.5

Mechanism design & adversarial robustness

Two more places where the game frame is load-bearing — one about designing games, one about defending against them.

Mechanism design: the inverse game

Ordinary game theory takes the rules as given and predicts behavior. Mechanism design runs the arrow backward: choose the rules so that self-interested play produces the outcome you want. It is the theory behind auctions, voting, and — increasingly — AI training. RLHF is a mechanism: the reward model is an incentive structure designed so that maximizing it yields helpful behavior, and reward hacking is what happens when the mechanism is mis-specified and the agent finds an unintended winning strategy. A central result is incentive compatibility — make truth-telling a dominant strategy — exemplified by the second-price (Vickrey) auction, where bidding your true value is optimal no matter what others do.

Adversarial robustness: the game against your inputs

A deployed model faces an implicit adversary: an attacker choosing the worst input within a small budget. Training a model to survive this is, again, a minimax game — but now the inner maximizer perturbs the data, not a network:

EQ G3.8 — ADVERSARIAL TRAINING (ROBUST OPTIMIZATION) $$ \min_{\theta}\; \mathbb{E}_{(x,y)\sim \mathcal{D}}\Big[\, \max_{\lVert \delta \rVert_p \le \epsilon}\; L\big(f_\theta(x + \delta),\, y\big) \,\Big] $$

The inner $\max$ (Madry et al., 2018) finds the most damaging perturbation $\delta$ inside an $\epsilon$-ball; the outer $\min$ hardens $\theta$ against it. It is a self-generating curriculum of hard examples — the model manufactures its own worst case at every step. The persistent caveat: robustness usually costs clean accuracy, and an $\ell_p$-ball is a narrow proxy for the open-ended threats a real deployment faces.

The same shape recurs across modern safety work: red-teaming a model is an adversary searching for a prompt that breaks it; constitutional and debate-style training pit models against each other to surface flaws; GAN-style discriminators reappear as learned detectors. The lesson of the chapter, stated once more: whenever you want a system to be robust, train it against an adversary that improves alongside it. A fixed test set is a teacher with a ceiling; a learning opponent is a teacher without one.

You have now seen the through-line of the whole volume: from the rational agents of Chapter 01, to repeated cooperation in Chapter 02, to games as the engine of modern AI here. The minimax skeleton that began as a way to analyze strategic behavior turned out to be the way to create it. Return to the Index to branch into the deep-learning and reinforcement-learning volumes where these games are implemented at scale.

3.R

References

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS — the minimax game of EQ G3.1–G3.3.
Silver, D. et al. (2017). Mastering the game of Go without human knowledge. Nature 550 — AlphaGo Zero / AlphaZero self-play (EQ G3.4).
Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science 365 — Pluribus, self-play in imperfect information.
Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN. ICML — replacing JSD with an Earth-Mover distance for stable gradients.
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P. & Mordatch, I. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS — MADDPG and the CTDE gradient of EQ G3.6.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR — adversarial training as the robust min-max of EQ G3.8.
Meta FAIR Diplomacy Team et al. (2022). Human-level play in the game of Diplomacy by combining language models with strategic reasoning (CICERO). Science 378 — mixed-motive multi-agent play with negotiation.