Repeated & Cooperative Games

2.1

Repeated games & the shadow of the future

Chapter 01 left us with a uncomfortable fact: a Nash equilibrium can be jointly terrible. Two players, each best-responding to the other, can lock into an outcome that both would gladly trade away if only they could trust one another. The escape is not a cleverer one-shot argument — there is none. The escape is repetition. When the same players meet again and again, today's defection can be punished tomorrow, and the prospect of that punishment makes cooperation a credible, self-enforcing equilibrium.

A repeated game takes a one-shot game — the stage game — and plays it over and over, the same opponents each round. What changes is that a player's strategy is no longer a single action; it is a plan that can condition on history. "Cooperate, but defect forever the moment you defect on me" is only expressible when there is a future to threaten. The value of that future is governed by a single number, the discount factor $\delta$: how much a payoff one round from now is worth today.

EQ G2.1 — DISCOUNTED PAYOFF OF A REPEATED GAME $$ U \;=\; \sum_{t=0}^{\infty} \delta^{t}\, u_t \;=\; u_0 + \delta\,u_1 + \delta^{2} u_2 + \cdots, \qquad \delta \in [0, 1) $$

$u_t$ is the stage payoff in round $t$; $\delta$ discounts the future. $\delta$ is the "shadow of the future" — it can be read as patience, or as the per-round probability the relationship continues. A constant stream of $c$ per round is worth $c/(1-\delta)$, the same geometric sum that tames returns in reinforcement learning (Vol RL · EQ R1.3). The larger $\delta$, the more a future punishment outweighs a one-round gain from cheating — which is exactly the lever that makes cooperation rational.

This is not a vague hope; it is a theorem. The Folk Theorem says that in an infinitely repeated game with players patient enough ($\delta$ close to 1), any outcome in which every player does at least as well as their guaranteed minimum (their minmax payoff) can be sustained as a subgame-perfect equilibrium. Cooperation is one such outcome — but so are many others, which is both the power and the embarrassment of the result: repetition explains how cooperation can arise, not which equilibrium will be selected. We will be honest about that gap throughout.

The contrast with the one-shot world is stark. In a single play, a strategy is just an action and the only stable thing is mutual defection. In the repeated world, a strategy is a policy over histories and the set of equilibria explodes. The rest of this chapter is about which of those equilibria are robust — to a clever opponent, to mutation, to noise.

A relationship yields a constant payoff of $c = 5$ every round, discounted at $\delta = 0.75$. Using $U = c/(1-\delta)$ (EQ G2.1 with constant $u_t$), what is the total discounted value $U$ of cooperating forever?

$U = \dfrac{c}{1-\delta} = \dfrac{5}{1 - 0.75} = \dfrac{5}{0.25} = $ 20. The future is worth four rounds of present payoff — patient players have a lot to lose, which is precisely what deters them from cheating.

2.2

The Prisoner's Dilemma

The Prisoner's Dilemma is the cleanest specimen of the conflict between individual and collective rationality. Two suspects, held separately, can each cooperate (stay silent) or defect (betray the other). The payoffs are usually written with four letters: Temptation (defect on a cooperator), Reward (mutual cooperation), Punishment (mutual defection), and Sucker (cooperate against a defector).

You ↓ / Them →	Cooperate	Defect
Cooperate	R, R = 3, 3	S, T = 0, 5
Defect	T, S = 5, 0	P, P = 1, 1

A game is a Prisoner's Dilemma whenever the payoffs obey two inequalities. The first makes defection dominant; the second makes mutual cooperation the socially better outcome:

EQ G2.2 — WHAT MAKES IT A DILEMMA $$ T > R > P > S \qquad\text{and}\qquad 2R > T + S $$

The first chain, $T > R > P > S$, means that whatever the opponent does, defecting pays more: against a cooperator $T > R$; against a defector $P > S$. So defect strictly dominates cooperate — and two rational players both defect, landing on $(P,P) = (1,1)$. The second condition, $2R > T + S$, ensures that mutual cooperation $(R,R)$ beats the average of taking turns exploiting, so alternating is not a way out. With our numbers: $5 > 3 > 1 > 0$ ✓ and $6 > 5$ ✓. The tragedy is exact: the unique Nash equilibrium $(1,1)$ is the one outcome both players would pay to avoid $(3,3)$.

In one shot, that is the end of the story. There is no trick of reasoning that recovers cooperation, no "if I cooperate maybe they will too" — the dominance argument is airtight, and the chapter on equilibria proved it. Defection is not a failure of intelligence; it is what intelligence prescribes when the game is played exactly once. The dilemma is real, and it is why the one-shot answer to the headline true/false below is unambiguous.

True or false: in a one-shot Prisoner's Dilemma satisfying EQ G2.2, the dominant strategy for a rational player is to defect. (Answer true or false.)

Against a cooperating opponent, defect pays $T = 5$ versus cooperate's $R = 3$; against a defecting opponent, defect pays $P = 1$ versus cooperate's $S = 0$. Defecting beats cooperating in both columns, so it strictly dominates and a rational one-shot player defects. The statement is true. (The whole point of this chapter is that repetition overturns this — but only when the game repeats.)

Now repeat the game. Suppose both players adopt Grim Trigger: cooperate every round, but if the opponent ever defects, defect forever after. Is mutual cooperation stable? A player tempted to defect once gains $T - R$ this round, then is punished with $P$ instead of $R$ for the rest of time. Cooperation is an equilibrium exactly when the one-time gain does not beat the discounted stream of forfeited rewards:

EQ G2.3 — WHEN DOES COOPERATION HOLD? $$ \underbrace{T - R}_{\text{tempting gain now}} \;\le\; \underbrace{\frac{\delta}{1-\delta}\,(R - P)}_{\text{discounted future loss}} \qquad\Longleftrightarrow\qquad \delta \;\ge\; \frac{T - R}{T - P} $$

Deviating buys $T-R$ once, but trades a future of $R$ for a future of $P$, starting next round — a loss of $(R-P)$ per round discounted by $\delta/(1-\delta)$. Cooperation survives iff the future loss dominates the present gain. For our payoffs the threshold is $\delta \ge (5-3)/(5-1) = 0.5$: any pair patient enough to value tomorrow at least half as much as today can sustain cooperation forever. Below $\delta = 0.5$, the future is too faint a threat and the relationship collapses back to mutual defection. This single inequality is the engine of the whole chapter.

With $T = 5,\ R = 3,\ P = 1$, what is the minimum discount factor $\delta$ at which Grim Trigger sustains cooperation, $\delta = \dfrac{T-R}{T-P}$ (EQ G2.3)?

$\delta = \dfrac{T-R}{T-P} = \dfrac{5-3}{5-1} = \dfrac{2}{4} = $ 0.5. At or above $\delta = 0.5$ the future is heavy enough that the threat of "defect forever" deters a one-round betrayal; below it, cooperation unravels.

PYTHON · RUNNABLE IN-BROWSER

# Grim-Trigger: when is "cooperate forever" worth more than "defect once"?
import numpy as np
T, R, P, S = 5, 3, 1, 0                      # PD payoffs (EQ G2.2)

# value of always cooperating vs deviating once then being punished forever
def coop_value(delta):  return R / (1 - delta)            # R, R, R, ...
def defect_value(delta): return T + delta * P / (1 - delta)  # T now, then P forever

thresh = (T - R) / (T - P)                   # closed-form threshold (EQ G2.3)
print(f"theoretical cooperation threshold: delta >= {thresh:.3f}\n")

print(" delta   coop_value   defect_value   cooperation holds?")
for d in (0.30, 0.49, 0.50, 0.75, 0.95):
    c, x = coop_value(d), defect_value(d)
    print(f" {d:4}    {c:9.2f}    {x:11.2f}      {c >= x}")

print("\ncooperation becomes the better choice exactly at delta = 0.5,")
print("matching (T-R)/(T-P). The shadow of the future has a sharp edge.")
plot_xy([0.3,0.49,0.5,0.75,0.95],
        [coop_value(d)-defect_value(d) for d in (0.3,0.49,0.5,0.75,0.95)])

edits are live — break it on purpose

2.3

Tit-for-tat & Axelrod's tournament

Theory tells us cooperation can be an equilibrium. It does not tell us which strategy a real population will land on. In 1980 the political scientist Robert Axelrod ran an experiment to find out: he invited game theorists to submit computer strategies for the iterated Prisoner's Dilemma, then played them all against each other in a round-robin and summed each one's score. The winner — submitted by Anatol Rapoport, and the shortest program entered — was Tit-for-Tat (TFT): cooperate on the first move, then on every move after, simply copy what the opponent did last round.

EQ G2.4 — TIT-FOR-TAT $$ a^{\text{TFT}}_{t} = \begin{cases} \texttt{C} & t = 0 \\ a^{\text{opp}}_{t-1} & t \ge 1 \end{cases} $$

One line, no memory beyond the last round. Axelrod distilled its success into four properties. It is nice — never the first to defect; retaliatory — it punishes a defection immediately, so it is not a patsy; forgiving — it returns to cooperation the instant the opponent does, so grudges do not spiral; and clear — its behavior is trivially legible, which lets opponents learn that cooperating pays. Tit-for-tat never beats any single opponent — it can only tie or lose by one defection — yet it wins the tournament, because it is not trying to beat opponents, it is trying to elicit cooperation.

That last point is the deep lesson, and it is genuinely counter-intuitive. In a zero-sum world you win by making the other player lose. The iterated PD is not zero-sum: two cooperators each score $R$ per round, far more than two defectors' $P$. TFT racks up its total not by exploiting anyone but by spending most of its rounds in the lucrative $(R,R)$ groove with other nice strategies. Strategies that tried to be clever — probing for weakness, defecting "just once" — poisoned their own relationships and scored worse. Greed was self-defeating in a way that only repetition makes visible.

TFT is not a flawless oracle, and honesty requires the caveats. First, it is fragile to noise: if a single move is misimplemented — a cooperate flips to a defect by error — two TFT players fall into an endless echo of mutual recrimination, each punishing the other's punishment. Variants like Tit-for-Two-Tats (retaliate only after two defections) and Generous TFT (forgive a defection with some probability) were designed to break that echo. Second, TFT's victory is tournament-dependent: change the population of opponents and a different strategy can top the table. Against a field with no exploitable cooperators, the relentless defector ALLD can win; Axelrod's result holds because his fields were rich in nice, retaliatory strategies. The instrument below lets you feel exactly this — and the one after it shows what happens when strategies must survive, not just score.

PYTHON · RUNNABLE IN-BROWSER

# A mini Axelrod tournament: round-robin, sum each strategy's total score
import numpy as np
T, R, P, S = 5, 3, 1, 0
rng = np.random.default_rng(0)

def TFT(me, opp):   return 'C' if not opp else opp[-1]          # copy last move
def ALLD(me, opp):  return 'D'                                  # always defect
def ALLC(me, opp):  return 'C'                                  # always cooperate
def GRIM(me, opp):  return 'D' if 'D' in opp else 'C'           # never forgive
def RAND(me, opp):  return 'C' if rng.random() < 0.5 else 'D'

def match(a, b, rounds=200):
    ha, hb, sa, sb = [], [], 0, 0
    pay = {('C','C'):(R,R), ('C','D'):(S,T), ('D','C'):(T,S), ('D','D'):(P,P)}
    for _ in range(rounds):
        x, y = a(ha, hb), b(hb, ha)
        pa, pb = pay[(x, y)]; sa += pa; sb += pb; ha.append(x); hb.append(y)
    return sa, sb

strats = {'TFT':TFT, 'ALLD':ALLD, 'ALLC':ALLC, 'GRIM':GRIM, 'RAND':RAND}
score = {n: 0 for n in strats}
for i in strats:
    for j in strats:                      # round-robin, includes self-play
        score[i] += match(strats[i], strats[j])[0]

for n in sorted(score, key=lambda k: -score[k]):
    print(f"{n:5s} total score {score[n]:5d}")
print("\nTFT and GRIM win the lucrative (R,R) groove with nice strategies;")
print("ALLC is exploited, ALLD poisons its own matches. Niceness pays.")

edits are live — break it on purpose

INSTRUMENT G2.1 — ITERATED PRISONER'S-DILEMMA ARENATIT-FOR-TAT · ALWAYS-DEFECT · RANDOM · ROUND-ROBIN

ROUNDS PER MATCH 200

NOISE (MOVE FLIP %) 0%

TOURNAMENT WINNER

—

TFT TOTAL

—

ALLD TOTAL

—

Three strategies — Tit-for-Tat, Always-Defect, and Random — meet in a full round-robin (each plays every strategy, itself included), and the bars show total score. With no noise, TFT wins: it ties ALLD to within a single defection, mutually cooperates with itself, and harvests Random. Now drag the noise slider up. A single mistaken move sends two TFTs into a retaliation echo, their score collapses, and the unforgiving defector climbs the table — the precise fragility that motivated Generous-TFT. Cooperation is robust, but not unconditionally.

2.4

Evolutionary game theory & ESS

Axelrod's tournament scored strategies once. But in biology — and in any population of learning agents — a successful strategy does more than score: it reproduces. Strategies that earn more payoff become more common; strategies that earn less die out. This shift in perspective, from a rational chooser to a population under selection, is evolutionary game theory, introduced by John Maynard Smith and George Price in 1973. It needs no assumption that players are rational — only that fitter strategies spread.

The central solution concept is the Evolutionarily Stable Strategy (ESS): a strategy that, if adopted by the whole population, cannot be invaded by a small group of mutants playing anything else. Formally, an incumbent strategy $x$ is an ESS if, against itself, it does at least as well as any mutant $y$ — and, in the knife-edge case where they tie against the incumbent, $x$ beats $y$ when the mutant has to play against other mutants.

EQ G2.5 — EVOLUTIONARILY STABLE STRATEGY $$ \text{either } u(x, x) > u(y, x), \quad\text{or}\quad u(x, x) = u(y, x) \ \text{ and } \ u(x, y) > u(y, y) \qquad \forall\, y \neq x $$

$u(a, b)$ is the payoff to a player using $a$ against an opponent using $b$. The first clause says the incumbent strictly out-earns the mutant in the prevailing population (which is almost all incumbents). The second handles the tie: if the mutant matches the incumbent against incumbents, it must lose against itself, so a rare mutant cluster cannot get a foothold. Every ESS is a Nash equilibrium, but not every Nash equilibrium is an ESS — ESS is a strict refinement that adds robustness to invasion, which is exactly what "stable" should mean.

How a population gets to an ESS is described by the replicator dynamics: the share of each strategy grows in proportion to how much its fitness beats the population average. Strategies above average expand; strategies below average shrink. The rest points of this flow are the Nash equilibria, and its stable rest points are the ESSs.

EQ G2.6 — REPLICATOR DYNAMICS $$ \dot{x}_i \;=\; x_i\,\big(\, f_i(x) - \bar{f}(x) \,\big), \qquad f_i(x) = (A x)_i, \quad \bar{f}(x) = x^{\top} A x $$

$x_i$ is the fraction of the population playing strategy $i$; $A$ is the payoff matrix ($A_{ij}$ = payoff to $i$ against $j$); $f_i$ is strategy $i$'s fitness against the current mix and $\bar f$ the mean fitness. The bracket is strategy $i$'s advantage over the average — positive shares grow, negative shrink, and the simplex $\sum_i x_i = 1$ is preserved. Note the form: it is the soft, population-level cousin of a policy-gradient step (Vol RL), pushing mass toward above-average strategies. Run it on a Hawk–Dove game and it spirals into the mixed ESS, never to a pure one.

The canonical illustration is Hawk–Dove, Maynard Smith's own example. Animals contest a resource of value $V$. Hawks escalate and risk injury of cost $C$; Doves display and retreat. Two Hawks split the resource minus the expected injury, $(V-C)/2$; a Hawk meeting a Dove takes the whole $V$; two Doves share it, $V/2$. When $C > V$, neither pure strategy is stable — a population of all Hawks is invadable by Doves (who avoid the ruinous fights) and vice versa. The unique ESS is a mixed population with a Hawk fraction $p^{*} = V/C$, and the replicator dynamics converges there from almost any start.

True or false: every Nash equilibrium is automatically an Evolutionarily Stable Strategy. (Answer true or false.)

The implication runs the other way. Every ESS is a Nash equilibrium (an ESS must be a best response to itself), but ESS adds a strict invasion-resistance condition (EQ G2.5) that some Nash equilibria fail — for example, equilibria sustained only by weakly-best-response ties can be invaded by neutral mutants. So the statement is false: ESS is a strict refinement of Nash.

PYTHON · RUNNABLE IN-BROWSER

# Replicator dynamics on Hawk-Dove: converge to the mixed ESS p* = V/C
import numpy as np
V, C = 2.0, 4.0                              # resource value, injury cost (C > V)
# rows/cols: 0 = Hawk, 1 = Dove ; A[i,j] = payoff to i against j  (EQ G2.6)
A = np.array([[(V - C) / 2,  V   ],
              [ 0.0,         V / 2]])
p_star = V / C                              # predicted ESS Hawk fraction
print(f"predicted mixed ESS: Hawk fraction p* = V/C = {p_star:.3f}\n")

x = np.array([0.90, 0.10])                  # start: mostly Hawks
dt = 0.05
traj = []
for step in range(600):
    f = A @ x                              # fitness of each strategy
    phi = x @ f                            # mean fitness  x^T A x
    x = x + dt * x * (f - phi)             # replicator update
    x = x / x.sum()                        # renormalize onto the simplex
    if step % 120 == 0:
        print(f"step {step:3d}: Hawk={x[0]:.3f}  Dove={x[1]:.3f}")
    traj.append(x[0])

print(f"\nconverged Hawk fraction: {x[0]:.3f}  (matches V/C = {p_star:.3f})")
print("neither pure strategy survives: the population settles at the mix.")
plot_xy(list(range(len(traj))), traj)

edits are live — break it on purpose

INSTRUMENT G2.2 — REPLICATOR-DYNAMICS SIMULATORHAWK–DOVE · CONVERGENCE TO THE MIXED ESS · EQ G2.6

RESOURCE VALUE V 2.0

INJURY COST C 4.0

INITIAL HAWK SHARE 0.90

PREDICTED ESS p* = V/C

—

CONVERGED HAWK SHARE

—

REGIME

—

The red curve is the Hawk fraction over time; the mint dashed line is the predicted ESS $p^{*} = V/C$. Start the population anywhere and watch it flow to the same interior mix — that is what makes the ESS attracting, not merely an equilibrium. Push $C$ below $V$ and the math changes character entirely: injuries become cheap, $V/C$ exceeds 1, and Hawk becomes a pure ESS that sweeps the population — the regime readout flips to tell you which world you are in. No interaction is needed to see the answer: the default $V=2,\,C=4$ already converges to a 50/50 mix.

2.5

Cooperative games & the Shapley value

Everything so far has been non-cooperative game theory: players choose actions independently and we ask what they will do. Cooperative (or coalitional) game theory asks a different question. Suppose the players can form binding agreements and pool their efforts — the only question is how to divide the joint payoff fairly. The primitive is no longer a payoff matrix but a characteristic function $v(S)$: for every possible coalition $S$ of players, the total value that coalition can guarantee on its own.

The most celebrated answer to "what is each player's fair share?" is the Shapley value, introduced by Lloyd Shapley in 1953. Its idea is disarmingly simple: a player's worth is their average marginal contribution across every order in which the coalition could have been assembled. Imagine the players walking into a room one at a time in a random order; each player is credited with how much they add to the value of those already present. Average that credit over all $n!$ orders, and you have the Shapley value.

EQ G2.7 — THE SHAPLEY VALUE $$ \phi_i(v) \;=\; \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\,\big(n - |S| - 1\big)!}{n!}\,\Big[\, v(S \cup \{i\}) - v(S) \,\Big] $$

$N$ is the set of all $n$ players; the sum runs over every coalition $S$ that excludes $i$; the bracket $[v(S\cup\{i\}) - v(S)]$ is $i$'s marginal contribution to that coalition; and the weight $\tfrac{|S|!\,(n-|S|-1)!}{n!}$ is the probability that, in a uniformly random arrival order, $i$ arrives exactly after the players in $S$. So $\phi_i$ is precisely the expected marginal contribution over a random ordering of arrivals. The shares always exhaust the grand coalition's value, $\sum_i \phi_i = v(N)$ — nothing is created or lost in the division.

The Shapley value is the unique allocation satisfying four axioms that any reasonable notion of fairness should demand. Efficiency: the shares sum to the total value $v(N)$. Symmetry: two players who contribute identically to every coalition get equal shares. Null player: a player who adds nothing to any coalition gets zero. Additivity: the value of two games played together is the sum of the values played separately. That a single formula is forced by these four innocuous requirements is the result's quiet power — and it is why the Shapley value reaches far beyond economics.

That reach is the bridge to the next chapter. In machine learning, SHAP (SHapley Additive exPlanations) treats a model's input features as "players" cooperating to produce a prediction, and uses the Shapley value to attribute the prediction fairly among them — the dominant method for feature attribution in 2026, and the rare interpretability tool with an axiomatic guarantee. The same idea credits data points for a model's accuracy (data valuation) and apportions cost in shared infrastructure. Fair division, it turns out, is a computational problem the field cannot stop needing.

True or false: the Shapley value distributes a coalition's total payoff to each player according to their average marginal contribution across all possible orderings of the players. (Answer true or false.)

EQ G2.7 is exactly an average of the marginal contributions $v(S\cup\{i\}) - v(S)$, weighted by the probability of each arrival order — i.e. the expected marginal contribution over a uniformly random ordering. So the statement is true; this is the defining intuition of the Shapley value.

Three players have a characteristic function $v$ with $v(\{0\})=10,\ v(\{1\})=20,\ v(\{2\})=30,\ v(\{0,1\})=50,\ v(\{0,2\})=60,\ v(\{1,2\})=70,\ v(\{0,1,2\})=100$ (and $v(\varnothing)=0$). What is player 0's Shapley value $\phi_0$? (Round to two decimals.)

Average player 0's marginal contribution over all $3! = 6$ orders. Orders and player 0's marginal: (0,1,2)→10; (0,2,1)→10; (1,0,2)→$50-20=30$; (2,0,1)→$60-30=30$; (1,2,0)→$100-70=30$; (2,1,0)→$100-70=30$. Sum $= 10+10+30+30+30+30 = 140$; divide by 6: $\phi_0 = 140/6 = $ 23.33. (Check: by symmetry of the arithmetic, $\phi_1 = 33.33$, $\phi_2 = 43.33$, and they sum to exactly $100 = v(N)$ — efficiency.)

PYTHON · RUNNABLE IN-BROWSER

# Shapley value of a 3-player game, by averaging over all arrival orders
import numpy as np
from itertools import permutations

# characteristic function v(S): value each coalition can secure (EQ G2.7)
v = {(): 0, (0,): 10, (1,): 20, (2,): 30,
     (0,1): 50, (0,2): 60, (1,2): 70, (0,1,2): 100}
def val(S): return v[tuple(sorted(S))]

n = 3
phi = np.zeros(n)
orders = list(permutations(range(n)))         # all n! = 6 arrival orders
for order in orders:
    present = set()
    for p in order:
        before = val(present)                 # value without player p
        present.add(p)
        phi[p] += val(present) - before       # p's marginal contribution
phi /= len(orders)                            # average over orderings

for i in range(n):
    print(f"player {i}: Shapley value phi = {phi[i]:.3f}")
print(f"\nsum of shares      = {phi.sum():.3f}")
print(f"grand-coalition v(N) = {val(range(n))}   <- efficiency: they match exactly")

edits are live — break it on purpose

INSTRUMENT G2.3 — SHAPLEY-VALUE CALCULATOR3-PLAYER COALITIONAL GAME · EQ G2.7

v(A) 10

v(B) 20

v(C) 30

v(AB) 50

v(AC) 60

v(BC) 70

v(ABC) — GRAND COALITION 100

φ(A)

—

φ(B)

—

φ(C)

—

Σ φ (SHOULD EQUAL v(ABC))

—

Set the value every coalition can secure on its own, and the calculator splits the grand-coalition payoff by each player's average marginal contribution across all 6 arrival orders (EQ G2.7). The bars are the three Shapley shares; the sum readout always equals $v(\text{ABC})$ — that is efficiency, baked in. Try making one player a null player (set every coalition's value identical with and without them) and watch their share fall to zero; make two players symmetric and their bars equalize. The defaults already give the worked-exercise answer: $\phi = (23.3,\,33.3,\,43.3)$.

We have seen how games stabilize cooperation and how to divide its rewards fairly. Now watch these ideas leave the blackboard. Chapter 03 follows game theory into modern AI: self-play that bootstrapped superhuman Go and poker, GANs as a literal two-player minimax, multi-agent reinforcement learning, RLHF as a game between a model and a reward, and SHAP — the Shapley value of this chapter — as the field's leading tool for explaining what a model decided and why.

2.R

References

Axelrod, R. (1984). The Evolution of Cooperation. Basic Books — the round-robin tournaments and the four properties of tit-for-tat (EQ G2.4, §2.3); the founding popular text on repeated cooperation.
Axelrod, R. & Hamilton, W. D. (1981). The Evolution of Cooperation. Science 211(4489) — the peer-reviewed account of the iterated-PD tournaments and the evolutionary stability of tit-for-tat.
Maynard Smith, J. & Price, G. R. (1973). The Logic of Animal Conflict. Nature 246 — introduces the Evolutionarily Stable Strategy (EQ G2.5) and the Hawk–Dove game (§2.4).
Shapley, L. S. (1953). A Value for n-Person Games. In Contributions to the Theory of Games II, Princeton University Press — defines the Shapley value (EQ G2.7) and its axiomatic characterization (§2.5).
Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge University Press — the book-length development of ESS and the replicator perspective (EQ G2.6).
Friedman, J. W. (1971). A Non-cooperative Equilibrium for Supergames. Review of Economic Studies 38(1) — Grim-Trigger equilibria and an early form of the Folk Theorem behind EQ G2.3 (§2.1).
Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 30 (SHAP) — the Shapley value applied to machine-learning feature attribution, the §2.5 bridge into Chapter 03.