When learning becomes a game
The first two chapters treated games as a model of the world: rational players, payoff matrices, equilibria you solve for. This chapter inverts the relationship. Here the game is a training objective — a structure we impose on optimization so that the loss surface is no longer fixed but co-created by the learner itself. The defining feature is a moving target: the thing a model is trying to beat improves whenever the model does.
Static supervised learning has a ceiling. The objective is a frozen dataset, and the best you can do is fit it; once you match the labels, the gradient goes quiet. A game-based objective never goes quiet, because the opponent (an adversary, a past version of yourself, a population of peers) keeps raising the bar. Three families dominate modern practice:
| Setup | The two sides | What the game produces | Canonical system |
|---|---|---|---|
| Adversarial | generator vs critic | A learned loss function that sharpens as samples improve | GANs |
| Self-play | agent vs its own past | An automatic curriculum of ever-stronger opponents | AlphaZero |
| Multi-agent | N agents in a shared world | Emergent strategy, cooperation, and convention | Pluribus, MADDPG |
What unifies them is the minimax skeleton from Chapter 01: a value that one party maximizes and another minimizes. The mathematics of saddle points, best responses and equilibria — built for analyzing rational agents — turns out to be exactly the mathematics of training them. The catch, returned to throughout, is that gradient descent was designed to find minima, not saddle points, so these games are notoriously harder to optimize than ordinary losses.
A useful slogan: supervised learning imitates a teacher; a game manufactures one. Everything below is a different answer to the question "where does the next, slightly-harder training example come from?"
GANs as a minimax game
A Generative Adversarial Network pits a generator \(G\), which maps noise \(z \sim p_z\) to fake samples \(G(z)\), against a discriminator \(D\), which outputs the probability that a sample is real. \(D\) wants to label reals as 1 and fakes as 0; \(G\) wants \(D(G(z))\) to read as 1. Goodfellow et al. (2014) wrote this as a single two-player zero-sum game on one value function:
Fix \(G\) and ask for the best discriminator. For any \(x\), \(V\) is maximized pointwise, and calculus gives the optimal critic in closed form:
Substituting \(D^{*}_G\) back collapses the game onto a divergence between the two distributions:
The honest caveats. The clean theory assumes \(D\) is trained to optimality at every step and that both networks have unlimited capacity. Neither holds. In practice GANs are infamous for training instability and mode collapse (the generator parks all its mass on a few outputs that reliably fool the current \(D\)). JSD also saturates — its gradient vanishes when the distributions barely overlap — which motivated Wasserstein GANs (Arjovsky et al., 2017), replacing JSD with an Earth-Mover distance whose gradient stays informative. The minimax framing is the right mental model; the optimization is genuinely hard, and as of 2026 diffusion and autoregressive models have largely displaced GANs for frontier image and audio synthesis, even as the adversarial idea persists everywhere from super-resolution to robustness training.
# GAN minimax value on a toy: two distributions over 5 discrete bins.
# Optimal D is closed-form (EQ G3.2); the game value collapses to JSD (EQ G3.3).
import numpy as np
p_data = np.array([0.05, 0.15, 0.40, 0.25, 0.15]) # the real distribution
def value(p_g):
p_g = np.asarray(p_g, float); p_g /= p_g.sum()
D = p_data / (p_data + p_g) # EQ G3.2, optimal critic
V = (p_data * np.log(D) + p_g * np.log(1 - D)).sum() # EQ G3.1 at D*
m = 0.5 * (p_data + p_g) # JSD, base-e
jsd = 0.5*(p_data*np.log(p_data/m)).sum() + 0.5*(p_g*np.log(p_g/m)).sum()
return V, jsd, D
for name, pg in [("bad ", [0.40,0.30,0.10,0.10,0.10]),
("closer", [0.10,0.20,0.30,0.25,0.15]),
("matched", p_data.copy())]:
V, jsd, D = value(pg)
print(f"{name}: value V={V:+.4f} JSD={jsd:.4f} check(-log4+2*JSD)={-np.log(4)+2*jsd:+.4f}")
print(f"\nfloor of the game: -log 4 = {-np.log(4):+.4f} (reached only when p_g == p_data)")
print("at the match, every D* entry equals 0.5:", np.round(value(p_data.copy())[2], 3))
Self-play — AlphaZero & beyond
The cleanest game-as-curriculum is an agent playing against itself. There is no human data, no teacher, no fixed opponent: the agent's current policy is both the player and the environment it must beat. Because the opponent is a copy of you, the difficulty tracks your skill automatically — a perfectly calibrated curriculum that needs no designer.
AlphaGo Zero (Silver et al., 2017) made this concrete for Go and then chess and shogi (AlphaZero). A single network \(f_\theta(s) = (\boldsymbol{p}, v)\) outputs a move-probability vector \(\boldsymbol{p}\) and a scalar value \(v \in [-1, 1]\) estimating who wins from state \(s\). Monte-Carlo Tree Search (MCTS) uses the network to look ahead, producing improved move counts \(\boldsymbol{\pi}\); the game is then played to a result \(z \in \{-1, +1\}\). Training pulls the network toward its own searched-and-played behavior:
Each generation is stronger, so each generation's self-play games are harder, so the next network must improve to keep winning — a ratchet. The same ratchet drives AlphaStar (StarCraft II), OpenAI Five (Dota 2), and the policy-improvement loops inside RLHF, where a reward model plays the critic. The mechanism that powers Pluribus (Brown & Sandholm, 2019) — superhuman six-player poker — is self-play too, but in an imperfect-information game, so it computes a blueprint via counterfactual regret minimization and refines it with real-time search; its solution concept is approximate Nash rather than a hard win/loss value.
The minimal engine behind self-play improvement is a value bootstrap: a state's value is estimated from the values of the states it leads to, and those estimates pull each other toward consistency. In a two-player zero-sum game the backup is a minimax — you assume the opponent (your own copy) plays its best reply:
# Tiny self-play value bootstrap on a toy game.
# A 6-node game tree: leaves have true outcomes; internal nodes back up by
# negamax (EQ G3.5). We start from a WRONG guess and let it self-correct.
import numpy as np
# children[node] = list of child indices ([] means a leaf)
children = {0:[1,2], 1:[3,4], 2:[4,5], 3:[], 4:[], 5:[]}
leaf_val = {3:+1.0, 4:-1.0, 5:+1.0} # zero-sum outcomes from mover's view
gamma = 1.0
V = {n: (leaf_val[n] if n in leaf_val else 0.7) for n in children} # bad init
print("init :", {k: round(v,3) for k,v in V.items()})
for sweep in range(5):
for n in [2,1,0]: # back up internal nodes, leaves to root
if children[n]:
V[n] = max(-gamma*V[c] for c in children[n]) # negamax backup
print(f"sweep {sweep}:", {k: round(v,3) for k,v in V.items()})
best = max(children[0], key=lambda c: -V[c])
print(f"\nroot value V(0) = {V[0]:+.0f}; mover should play toward child {best}.")
print("targets were never labeled -- they bootstrapped from the leaves up.")
Multi-agent reinforcement learning
Two players is the easy case. Multi-agent reinforcement learning (MARL) drops \(N\) learners into a shared environment, each with its own policy \(\pi_i\) and reward \(r_i\). The hard part is structural: from any single agent's view, the others are part of the environment, and they are changing as they learn. The world is non-stationary — the ground that gradient descent assumes is fixed is, in fact, moving under every step.
The right object is the Markov (stochastic) game: state transitions and each agent's reward depend on the joint action \((a_1,\ldots,a_N)\). The solution concept is a Nash equilibrium of policies — no agent can improve by unilaterally changing its own. Cooperation, competition and mixtures all live here, distinguished only by how the reward functions relate:
| Reward structure | Game | What agents learn | Example |
|---|---|---|---|
| Fully aligned | cooperative | Coordination, role assignment, shared conventions | Team play, traffic |
| Fully opposed | zero-sum | Robust, minimax-optimal strategies | Go, poker |
| Mixed | general-sum | Negotiation, reciprocity, social dilemmas | Markets, Diplomacy |
The workhorse algorithmic idea is centralized training, decentralized execution (CTDE). During training a critic may see everyone's observations and actions — making its target stationary — while each agent's actor learns a policy that runs on its own local view alone. MADDPG (Lowe et al., 2017) is the canonical instance. The key intuition is that one agent's policy-gradient sign depends on what the others do:
The deepest lessons in MARL come from the simplest games. A coordination game can have several equilibria, and which one a population lands on is a matter of risk and history, not just payoff. The textbook case is the Stag Hunt: hunting a stag together pays best but only if your partner also commits; hunting hare is a safe solo payoff. There are two pure Nash equilibria — (stag, stag), which is payoff-dominant, and (hare, hare), which is risk-dominant — and learners frequently converge to the safe-but-worse one.
Where the field actually is (2026). MARL works well in two-team zero-sum settings (it inherits self-play's stability) and in tightly cooperative ones with CTDE. General-sum, partially-observable, many-agent settings remain hard: equilibria may not be unique or even exist in tractable form, credit assignment across agents is brittle, and emergent behavior is difficult to specify or guarantee. The standout recent result, Meta's CICERO playing Diplomacy, needed to fuse a planning engine with a language model precisely because raw self-play does not by itself produce the negotiation and trust-building a mixed-motive game demands.
Mechanism design & adversarial robustness
Two more places where the game frame is load-bearing — one about designing games, one about defending against them.
Mechanism design: the inverse game
Ordinary game theory takes the rules as given and predicts behavior. Mechanism design runs the arrow backward: choose the rules so that self-interested play produces the outcome you want. It is the theory behind auctions, voting, and — increasingly — AI training. RLHF is a mechanism: the reward model is an incentive structure designed so that maximizing it yields helpful behavior, and reward hacking is what happens when the mechanism is mis-specified and the agent finds an unintended winning strategy. A central result is incentive compatibility — make truth-telling a dominant strategy — exemplified by the second-price (Vickrey) auction, where bidding your true value is optimal no matter what others do.
Adversarial robustness: the game against your inputs
A deployed model faces an implicit adversary: an attacker choosing the worst input within a small budget. Training a model to survive this is, again, a minimax game — but now the inner maximizer perturbs the data, not a network:
The same shape recurs across modern safety work: red-teaming a model is an adversary searching for a prompt that breaks it; constitutional and debate-style training pit models against each other to surface flaws; GAN-style discriminators reappear as learned detectors. The lesson of the chapter, stated once more: whenever you want a system to be robust, train it against an adversary that improves alongside it. A fixed test set is a teacher with a ceiling; a learning opponent is a teacher without one.
You have now seen the through-line of the whole volume: from the rational agents of Chapter 01, to repeated cooperation in Chapter 02, to games as the engine of modern AI here. The minimax skeleton that began as a way to analyze strategic behavior turned out to be the way to create it. Return to the Index to branch into the deep-learning and reinforcement-learning volumes where these games are implemented at scale.
References
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks.
- Silver, D. et al. (2017). Mastering the game of Go without human knowledge.
- Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker.
- Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN.
- Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P. & Mordatch, I. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.
- Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks.
- Meta FAIR Diplomacy Team et al. (2022). Human-level play in the game of Diplomacy by combining language models with strategic reasoning (CICERO).