THE AI ENCYCLOPEDIA — FULL TEXT EXPORT https://ai-encyclopedia.com Generated for LLM consumption. Interactive instruments and Python cells are not representable in text — visit the site to use them. ======================================================================== MATHEMATICS & STATISTICS ======================================================================== ## STATS · Probability (https://ai-encyclopedia.com/stats/01-probability.html) Probability — The Logic of Uncertainty — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 01 / PROBABILITY INDEX NEXT: DISTRIBUTIONS → MATHEMATICS & STATISTICS · CHAPTER 01 / 08 Probability — The Logic of Uncertainty Probability gives degrees of belief an arithmetic, built on three axioms. Conditioning is the operation that turns prior belief into posterior knowledge: it governs how a diagnostic test revises a diagnosis, how a spam filter learns, and how any model reasons under doubt. LEVEL INTRO READING TIME ≈ 24 MIN BUILDS ON ALGEBRA INSTRUMENTS BAYES BOX · LLN · TREE IN THIS CHAPTER 1.1 Sample spaces & axioms 1.2 Conditioning & independence 1.3 Bayes' theorem 1.4 Random variables & expectation 1.5 Pitfalls: base rates 1.R References 1.1 Sample spaces, events & Kolmogorov's axioms Probability begins by naming everything that could happen. The sample space \(\Omega\) is the set of all possible outcomes of an experiment: for one die roll \(\Omega = \{1,2,3,4,5,6\}\); for a coin flip \(\Omega = \{H, T\}\). An event is any subset of \(\Omega\) — "the roll is even" is the event \(\{2,4,6\}\). Probability is then a single function that assigns each event a number between 0 and 1, measuring how much of the sample space it occupies. In 1933 Andrei Kolmogorov reduced the entire subject to three rules. Every theorem in this chapter — every theorem in probability — is a consequence of just these: EQ S1.1 — KOLMOGOROV'S AXIOMS $$ \text{(1)}\;\; P(A) \ge 0 \qquad \text{(2)}\;\; P(\Omega) = 1 \qquad \text{(3)}\;\; P\!\Big(\bigcup_i A_i\Big) = \sum_i P(A_i) \;\;\text{for disjoint } A_i $$ Probabilities are non-negative, the certain event has probability one, and the probability of any of several mutually exclusive events is the sum of their probabilities. That is the whole foundation. From them follow the complement rule \(P(A^c) = 1 - P(A)\), monotonicity \(A \subseteq B \Rightarrow P(A) \le P(B)\), and — for events that can overlap — inclusion–exclusion. The third axiom only adds probabilities when events cannot happen together. When they can overlap, naively summing double-counts the intersection, so you subtract it back out: EQ S1.2 — ADDITION RULE (INCLUSION–EXCLUSION) $$ P(A \cup B) \;=\; P(A) + P(B) - P(A \cap B) $$ "Probability of A or B" — where "or" is inclusive. The overlap \(P(A \cap B)\) sits inside both \(P(A)\) and \(P(B)\), so it is counted twice and must be removed once. If \(A\) and \(B\) are disjoint, \(P(A \cap B) = 0\) and this collapses to axiom 3. This single correction is the seed of nearly every "but you forgot to subtract the overlap" mistake in applied probability. For a finite, equally-likely sample space — fair dice, shuffled cards, balanced coins — every outcome carries weight \(1/|\Omega|\), and the probability of an event reduces to counting: EQ S1.3 — THE CLASSICAL (COUNTING) DEFINITION $$ P(A) \;=\; \frac{|A|}{|\Omega|} \;=\; \frac{\text{number of favorable outcomes}}{\text{number of possible outcomes}} $$ Valid only when outcomes are equally likely — a modelling assumption, not a law. Most of real life is not equally likely (a biased coin, a loaded market), which is exactly why Kolmogorov's axioms are stated abstractly: they hold whether probabilities come from symmetry, from long-run frequency, or from a degree of belief. Frequentist vs. Bayesian — the honest caveat. The axioms say what a probability function must obey; they are silent on what a probability means. One school reads \(P(A)\) as a long-run frequency — the fraction of times \(A\) occurs in endless repetitions (§1.4 makes this precise). The other reads it as a degree of belief that can be updated by evidence (§1.3). Both satisfy EQ S1.1 identically, which is why the two camps share every equation and disagree only on interpretation. This chapter uses whichever lens is clearer and flags the switch. From a single draw, \( P(A) = 0.5 \), \( P(B) = 0.4 \), and \( P(A \cap B) = 0.2 \). What is \( P(A \cup B) \)? By EQ S1.2, \( P(A \cup B) = P(A) + P(B) - P(A \cap B) = 0.5 + 0.4 - 0.2 = \) 0.7. The \(0.2\) overlap was sitting inside both \(0.5\) and \(0.4\), so it is removed exactly once. PYTHON · RUNNABLE IN-BROWSER # Axioms by counting: a fair die, the event "even", and the addition rule import numpy as np omega = np.array([1, 2, 3, 4, 5, 6]) # sample space A = omega[omega % 2 == 0] # event A: roll is even {2,4,6} B = omega[omega > 3] # event B: roll > 3 {4,5,6} P = lambda S: len(S) / len(omega) # EQ S1.3: classical definition inter = np.intersect1d(A, B) # A and B -> {4, 6} union = np.union1d(A, B) # A or B -> {2,4,5,6} print(f"P(A) = {P(A):.4f}") print(f"P(B) = {P(B):.4f}") print(f"P(A and B) = {P(inter):.4f}") print(f"P(A or B) count= {P(union):.4f}") addition = P(A) + P(B) - P(inter) # EQ S1.2 print(f"P(A or B) rule = {addition:.4f} RUN ▶ edits are live — break it on purpose 1.2 Conditional probability & independence The single most important operation in the subject is conditioning: revising a probability once you learn that some event has occurred. Learning that \(B\) happened shrinks your world from all of \(\Omega\) down to just \(B\), and you renormalize so the new, smaller world again has total probability one. EQ S1.4 — CONDITIONAL PROBABILITY $$ P(A \mid B) \;=\; \frac{P(A \cap B)}{P(B)}, \qquad P(B) > 0 $$ Read aloud: "the probability of \(A\) given \(B\)." You keep only the part of \(A\) that lives inside \(B\) — the numerator — and rescale by the size of the new universe \(B\). Conditioning is the engine of all learning from evidence: every belief update, every diagnosis, every filter is some instance of this one ratio. Rearranging EQ S1.4 gives the multiplication rule for the joint probability of two events, and applying it across a partition gives a way to assemble a total probability from its conditional pieces: EQ S1.5 — CHAIN RULE & LAW OF TOTAL PROBABILITY $$ P(A \cap B) = P(A \mid B)\,P(B), \qquad\qquad P(A) = \sum_i P(A \mid B_i)\,P(B_i) $$ Left: a joint probability factors into a marginal times a conditional. Right: if the \(B_i\) partition \(\Omega\) (mutually exclusive, collectively exhaustive), the overall chance of \(A\) is a weighted average of its chances within each slice, weighted by how likely each slice is. This averaging step is exactly the denominator of Bayes' theorem in §1.3. Two events are independent when knowing one tells you nothing about the other — conditioning leaves the probability unchanged, \(P(A \mid B) = P(A)\). Substituting into the chain rule gives the cleaner, symmetric test: EQ S1.6 — INDEPENDENCE $$ A \perp B \quad\Longleftrightarrow\quad P(A \cap B) = P(A)\,P(B) $$ Independence means the joint factors into the product of marginals. It is a property of the probabilities, not of the physical situation — two events can be independent under one distribution and dependent under another. A frequent trap: mutually exclusive events with positive probability are the opposite of independent — if \(A\) rules out \(B\), then learning \(A\) tells you \(B\) cannot happen, so \(P(B \mid A) = 0 \ne P(B)\). Conditioning is not symmetric. \(P(A \mid B)\) and \(P(B \mid A)\) are different numbers in general — most rain comes with clouds, but most clouds bring no rain. Confusing the two is the prosecutor's fallacy (§1.5), and correcting it is precisely what Bayes' theorem does. Roll a fair die. Given that the result is even (\(B = \{2,4,6\}\)), what is the probability it is greater than 3 (\(A = \{4,5,6\}\))? Compute \( P(A \mid B) \). \( A \cap B = \{4, 6\} \), so \( P(A \cap B) = 2/6 \) and \( P(B) = 3/6 \). By EQ S1.4, \( P(A \mid B) = \dfrac{2/6}{3/6} = \dfrac{2}{3} \approx \) 0.667. For contrast, the unconditional \(P(A) = 3/6 = 0.5\): conditioning on "even" raises the probability from 0.5 to 0.667, because the even faces lean high. INSTRUMENT S1.3 — CONDITIONAL-PROBABILITY EXPLORER VENN ⇄ TREE · EQ S1.4–S1.6 P(A) 0.50 P(B) 0.40 P(A ∩ B) 0.20 P(A ∪ B) — P(A | B) — P(B | A) — INDEPENDENT? — The two circles are events \(A\) and \(B\); their overlap is \(P(A \cap B)\). Drag the overlap slider toward \(P(A)\,P(B)\) and the verdict flips to INDEPENDENT — that is the exact point where conditioning stops changing anything (\(P(A\mid B) = P(A)\)). Push the overlap to zero and the events become mutually exclusive: \(P(A\mid B) = 0\), the opposite of independent. The slider is clamped so the overlap can never exceed either circle — an impossible probability the axioms forbid. 1.3 Bayes' theorem — inverting the condition We can usually measure \(P(\text{evidence} \mid \text{cause})\) — how often a disease produces a positive test, how often spam contains the word "free." But what we want is the reverse: \(P(\text{cause} \mid \text{evidence})\) — given a positive test, how likely is the disease? Bayes' theorem is the bridge. Start from the symmetry of the chain rule, \(P(A \cap B) = P(A \mid B)P(B) = P(B \mid A)P(A)\), and solve for the conditional you don't have: EQ S1.7 — BAYES' THEOREM $$ P(H \mid E) \;=\; \frac{P(E \mid H)\,P(H)}{P(E)}, \qquad P(E) = P(E\mid H)P(H) + P(E\mid H^c)P(H^c) $$ \(H\) is a hypothesis, \(E\) the evidence. \(P(H)\) is the prior (belief before seeing \(E\)); \(P(E\mid H)\) the likelihood (how well \(H\) predicts \(E\)); \(P(H\mid E)\) the posterior (belief after). The denominator \(P(E)\) — the total probability of the evidence under all hypotheses, from EQ S1.5 — is just the normalizer that makes the posteriors sum to one. Prior, scaled by how well the data fit, renormalized: that is all learning is. The structure is clearer in odds form, which strips away the shared denominator. The posterior odds are the prior odds multiplied by the likelihood ratio — how much more probable the evidence is under \(H\) than under its negation: EQ S1.8 — BAYES IN ODDS FORM $$ \underbrace{\frac{P(H \mid E)}{P(H^c \mid E)}}_{\text{posterior odds}} \;=\; \underbrace{\frac{P(H)}{P(H^c)}}_{\text{prior odds}} \;\times\; \underbrace{\frac{P(E \mid H)}{P(E \mid H^c)}}_{\text{likelihood ratio}} $$ Evidence enters as a multiplier on your odds — a likelihood ratio of 1 leaves belief untouched; 10 multiplies your odds tenfold; \(\tfrac{1}{10}\) divides them. This form makes the central lesson of §1.5 visible at a glance: a strong test (large likelihood ratio) applied to a rare hypothesis (tiny prior odds) can still leave the posterior small. The multiplier is powerful, but it multiplies a number that started near zero. THE BASE RATE Why a 99%-accurate test can be wrong most of the time it fires. Take a disease that afflicts 1 person in 100, a test with 99% sensitivity and 95% specificity. Of 10,000 people, ~100 are sick and ~99 test positive correctly; of the 9,900 healthy, 5% — about 495 — test positive falsely. A positive result therefore points to a sick person only \(99/(99 + 495) \approx 17\%\) of the time. The test is excellent; the disease is rarer than the test's error rate, and the rarity dominates. Conditioning forces you to confront the base rate the headline accuracy hides. A disease has prevalence \(P(D) = 0.01\). A test has sensitivity \(P(+\mid D) = 0.99\) and specificity \(P(-\mid D^c) = 0.95\) (so the false-positive rate is \(0.05\)). You test positive. What is \( P(D \mid +) \)? By EQ S1.7: numerator \( = P(+\mid D)P(D) = 0.99 \times 0.01 = 0.0099\). Denominator \( = 0.0099 + P(+\mid D^c)P(D^c) = 0.0099 + 0.05 \times 0.99 = 0.0099 + 0.0495 = 0.0594\). So \( P(D \mid +) = 0.0099 / 0.0594 = \) 0.167 — about one in six. A near-perfect test on a rare disease still leaves five of six positives healthy. INSTRUMENT S1.1 — BAYES-BOX DISEASE-TEST CALCULATOR EQ S1.7 · LIVE · 10,000-PERSON COHORT PREVALENCE P(D) 1.00% SENSITIVITY P(+|D) 99% SPECIFICITY P(−|Dᶜ) 95% P(DISEASE | POSITIVE) — TRUE POS · FALSE POS — P(HEALTHY | NEGATIVE) — Each cell of the bar is the 10,000-person cohort split into true/false positives and negatives. At the defaults — prevalence 1%, sensitivity 99%, specificity 95% — a positive result means disease only ~16.7% of the time: the base-rate trap, made of red false positives swamping the green true ones. Now drag prevalence up to 20% and watch the posterior leap past 80% — the same test, a different population. The lesson the headline accuracy hides: a test's worth depends on who you give it to. PYTHON · RUNNABLE IN-BROWSER # Monte-Carlo a conditional probability and check it against exact Bayes import numpy as np rng = np.random.default_rng(0) prev, sens, spec = 0.01, 0.99, 0.95 # prevalence, sensitivity, specificity N = 2_000_000 disease = rng.random(N) < prev # who is actually sick positive = np.where(disease, rng.random(N) < sens, # sick -> true positive rng.random(N) < (1 - spec)) # well -> false positive mc = disease[positive].mean() # P(D | +) by simulation exact = (sens * prev) / (sens * prev + (1 - spec) * (1 - prev)) # EQ S1.7 print(f"positives observed: {positive.sum():,} of {N:,}") print(f"P(D | +) Monte-Carlo: {mc:.4f}") print(f"P(D | +) exact (Bayes): {exact:.4f}") print(f"gap: {abs(mc - exact):.4f}") print(f"\nbase-rate trap: a 99%/95% test is right only {exact*100:.1f}% of the time it fires.") RUN ▶ edits are live — break it on purpose 1.4 Random variables, expectation & variance A random variable \(X\) is a function that attaches a number to each outcome — the value rolled, the count of heads in ten flips, tomorrow's return. It lets us do arithmetic with chance. The two numbers that summarize a random variable are its center of mass and its spread. The expectation (mean) is the probability-weighted average of the values \(X\) can take — the long-run average if you repeated the experiment forever: EQ S1.9 — EXPECTATION $$ \mathbb{E}[X] \;=\; \sum_x x\,P(X = x) \quad\text{(discrete)}, \qquad \mathbb{E}[X] = \int x\,f(x)\,\mathrm{d}x \quad\text{(continuous)} $$ Expectation is linear no matter what: \(\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]\), even when \(X\) and \(Y\) are dependent — a fact used constantly and far less restrictive than it looks. Note the expected value need not be an attainable outcome: a fair die's mean is 3.5, a face it can never show. The variance measures how far values typically stray from the mean — the expected squared deviation. Its square root, the standard deviation, restores the original units: EQ S1.10 — VARIANCE $$ \mathrm{Var}(X) \;=\; \mathbb{E}\!\big[(X - \mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - \big(\mathbb{E}[X]\big)^2 $$ The right-hand form ("mean of the square minus the square of the mean") is the one you actually compute. Variance is not linear: \(\mathrm{Var}(aX) = a^2\,\mathrm{Var}(X)\), and \(\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)\) holds only when \(X \perp Y\). That independence-gated additivity is what makes the average of \(n\) independent samples have variance \(\sigma^2/n\) — the mathematical reason averaging reduces noise. That last fact is the law of large numbers: as you collect more independent samples, their running average converges to the true expectation. Probability, defined abstractly by Kolmogorov, finally reconnects to the frequentist intuition of "long-run frequency" — they are provably the same limit. EQ S1.11 — LAW OF LARGE NUMBERS $$ \bar{X}_n \;=\; \frac{1}{n}\sum_{i=1}^{n} X_i \;\xrightarrow[n \to \infty]{}\; \mathbb{E}[X] $$ The sample mean of i.i.d. draws converges to the population mean. The convergence is slow: the spread of \(\bar{X}_n\) shrinks like \(1/\sqrt{n}\), so cutting your error in half takes four times the data. This \(\sqrt{n}\) rate governs the width of every confidence interval and the cost of every Monte-Carlo estimate (and is why the simulations above use millions of samples for three-decimal accuracy). Let \(X\) be the result of one roll of a fair six-sided die. Compute the expectation \( \mathbb{E}[X] \) using EQ S1.9. Each face has probability \(1/6\): \( \mathbb{E}[X] = \tfrac{1}{6}(1+2+3+4+5+6) = \tfrac{21}{6} = \) 3.5. The mean is exactly halfway between 3 and 4 — a value the die can never actually show, which is fine: expectation is a balance point, not an outcome. For the same fair die, compute the variance \( \mathrm{Var}(X) \) using EQ S1.10. (Note \( \mathbb{E}[X^2] = \tfrac{1}{6}(1+4+9+16+25+36) = \tfrac{91}{6} \).) \( \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \tfrac{91}{6} - 3.5^2 = 15.1\overline{6} - 12.25 = \tfrac{35}{12} \approx \) 2.917. The standard deviation is \( \sqrt{35/12} \approx 1.71 \), a natural "typical distance from 3.5." INSTRUMENT S1.2 — LAW-OF-LARGE-NUMBERS SIMULATOR EQ S1.11 · RUNNING AVERAGE → E[X] EXPERIMENT FAIR COIN FAIR DIE CONTROL RUN ▶ RESET ↺ SAMPLES n — RUNNING MEAN X̄ₙ — TRUE E[X] — The mint line is the running average \(\bar{X}_n\); the dashed line is the true mean (0.5 for the coin, 3.5 for the die). Press RUN and watch the early average lurch wildly, then settle — the convergence visibly slows because error shrinks only as \(1/\sqrt{n}\). A meaningful baseline is drawn before you touch anything: the first ~120 samples render on load. This is the bridge from Kolmogorov's abstract \(P\) back to "long-run frequency." PYTHON · RUNNABLE IN-BROWSER # Monty Hall: simulate stay vs switch and print the win rates import numpy as np rng = np.random.default_rng(0) N = 200_000 car = rng.integers(0, 3, N) # door hiding the car (0,1,2) pick = rng.integers(0, 3, N) # contestant's first pick # stay wins exactly when the first pick was already the car stay_wins = (pick == car) # the host opens a goat door; switching wins whenever staying loses switch_wins = ~stay_wins print(f"trials: {N:,}") print(f"P(win | STAY): {stay_wins.mean():.4f} (theory 1/3 = 0.3333)") print(f"P(win | SWITCH): {switch_wins.mean():.4f} (theory 2/3 = 0.6667)") print(f"switching advantage: {switch_wins.mean() / stay_wins.mean():.2f}x") print("\nWhy: your first pick is right 1/3 of the time, so the OTHER unopened") print("door carries the remaining 2/3 once the host reveals a goat. Conditioning,") print("not intuition. (Law of large numbers: rates lock in as N grows.)") RUN ▶ edits are live — break it on purpose 1.5 Pitfalls — base rates & the prosecutor's fallacy Probability's hardest errors are not algebraic — they are interpretive. The mind reaches for the wrong conditional, ignores the denominator, or forgets how rare the thing it is reasoning about really is. Three traps cause most real-world damage. Base-rate neglect The §1.3 disease test is the canonical case: people quote a 99% test and conclude a positive result means 99% chance of disease, forgetting the 1% prevalence that makes false positives outnumber true ones. The base rate is the prior in Bayes' theorem, and dropping it is mathematically equivalent to setting \(P(H) = P(H^c)\) — assuming the disease is as common as health. Whenever someone reports a conditional probability without a base rate, the number is uninterpretable. The prosecutor's fallacy This is the confusion of \(P(E \mid H)\) with \(P(H \mid E)\) dressed in a courtroom. A forensic match has a one-in-a-million random-match probability: \(P(\text{match} \mid \text{innocent}) = 10^{-6}\). The prosecutor declares the chance the defendant is innocent is therefore one in a million — but that swaps the conditional. The quantity that matters is \(P(\text{innocent} \mid \text{match})\), and by Bayes it depends on the suspect pool. In a city of 10 million, roughly 10 innocent people also match by chance; with one true source, a bare match makes the defendant only ~1-in-11 likely to be the source. The likelihood ratio is enormous, but multiplied against a tiny prior, the posterior is far from certainty. EQ S1.12 — THE FALLACY, STATED EXACTLY $$ P(E \mid H) \;\ne\; P(H \mid E), \qquad\text{related only through}\quad P(H \mid E) = P(E \mid H)\,\frac{P(H)}{P(E)} $$ The two conditionals differ by the factor \(P(H)/P(E)\) — exactly the prior-over-evidence ratio that base-rate neglect throws away. They coincide only when \(P(H) = P(E)\), a coincidence, never a rule. "The evidence is unlikely if innocent" is not "innocence is unlikely given the evidence." Real convictions (and acquittals) have turned on this single transposition. PITFALLS The recurring shapes of error: (1) base-rate neglect — quoting \(P(E\mid H)\) and ignoring how rare \(H\) is; (2) the prosecutor's fallacy — reading \(P(E\mid H)\) as \(P(H\mid E)\); (3) the conjunction fallacy — judging \(P(A \cap B) > P(A)\), impossible since an intersection can never be larger than either part (the "Linda is a bank teller and a feminist" experiment); (4) the gambler's fallacy — believing independent trials "are due" to correct, when by EQ S1.6 past flips tell a fair coin nothing. The unifying diagnosis. Every one of these is a failure to condition correctly — to track which event is given, which is uncertain, and what the base rates are. Bayes' theorem is not just a formula; it is the discipline that makes these errors impossible to commit if you actually write the ratio down. That is why §1.3 is the spine of this chapter and of everything statistical that follows. NEXT We have treated probabilities of events and the mean and spread of a random variable in the abstract. The next chapter gives those random variables names and shapes: the Bernoulli, binomial, Poisson, normal, and exponential distributions — the recurring "characters" of uncertainty — plus the central limit theorem that explains why the bell curve appears everywhere a sum or an average does. 1.R References Kolmogorov, A. N. (1933). Foundations of the Theory of Probability (Grundbegriffe der Wahrscheinlichkeitsrechnung). The axiomatic foundation of EQ S1.1; measure-theoretic probability. Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans. R. Soc. — the original statement of EQ S1.7. Blitzstein, J. K. & Hwang, J. (2019). Introduction to Probability (2nd ed.). Chapman & Hall / CRC. Harvard Stat 110 — conditioning, Bayes, expectation, LLN; free course materials. Tversky, A. & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review 90(4) — the Linda problem (§1.5). Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. Probability as extended logic — the Bayesian reading of §1.1 and §1.3. ← PREVIOUS §§ Index NEXT CHAPTER 02 Distributions AI // ENCYCLOPEDIA — STATISTICS · CH 01 FULL CONTENTS ↗ ## STATS · Distributions (https://ai-encyclopedia.com/stats/02-distributions.html) Distributions — The Shapes of Randomness — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 02 / DISTRIBUTIONS INDEX NEXT: CORRELATION → MATHEMATICS & STATISTICS · CHAPTER 02 / 08 Distributions — The Shapes of Randomness A handful of named distributions account for most randomness in practice: coin flips, queue arrivals, measurement noise, market returns. Each is fixed by one or two numbers. The Central Limit Theorem explains why the Normal curve recurs so often: average enough independent quantities and it appears, regardless of where you started. LEVEL INTRO READING TIME ≈ 24 MIN BUILDS ON STATS 01 INSTRUMENTS EXPLORER · CLT · TAIL RISK IN THIS CHAPTER 2.1 Discrete distributions 2.2 Continuous distributions 2.3 Moments 2.4 The Central Limit Theorem 2.5 Heavy tails for quants 2.R References 2.1 Discrete distributions: counting outcomes A distribution is a complete accounting of how probability is spread over the possible outcomes of a random quantity. When the outcomes are countable — heads or tails, the number of emails arriving in an hour, the roll of a die — we describe it with a probability mass function (PMF): a rule \(p(x)\) that assigns each outcome a probability, with the masses summing to one. Four discrete families cover an astonishing share of real problems, and they are all secretly about the same atom: a single yes/no trial. The atom is the Bernoulli distribution — one trial with success probability \(p\). Everything in this section is built by repeating it, counting it, or waiting on it. EQ S2.1 — BERNOULLI & BINOMIAL $$ \text{Bernoulli: } \; p(1) = p,\; p(0) = 1 - p \qquad\qquad \text{Binomial: } \; P(X = k) = \binom{n}{k} p^{k} (1 - p)^{n - k} $$ A Bernoulli variable is a single coin flip scored 1 (success, probability \(p\)) or 0 (failure). The Binomial counts how many successes appear in \(n\) independent Bernoulli flips: \(\binom{n}{k}\) is the number of ways to place the \(k\) successes, times the probability of any one such arrangement. A Binomial is just a sum of \(n\) Bernoullis — which is exactly why §2.4 will make it look Normal as \(n\) grows. Its mean is \(np\) and its variance \(np(1-p)\). Two more families finish the toolkit, and both arise by pushing the Binomial to a limit: Poisson — the law of rare events spread over a continuum of opportunity. Take a Binomial with many trials (\(n \to \infty\)) each tiny in probability (\(p \to 0\)) but with a fixed expected count \(\lambda = np\), and you get \(P(X = k) = e^{-\lambda}\lambda^{k}/k!\). It models arrivals: photons on a sensor, customers at a till, mutations along a genome, requests at a server. Its defining quirk — mean equals variance equals \(\lambda\) — is a diagnostic: if your count data has variance much larger than its mean, it is over-dispersed and the Poisson is the wrong model. Geometric — the waiting time for the first success: \(P(X = k) = (1 - p)^{k - 1} p\) for \(k = 1, 2, \dots\) (the number of flips up to and including the first head). It is the discrete cousin of the Exponential (§2.2) and is memoryless: having already waited ten flips tells you nothing about how many more remain. EQ S2.2 — POISSON & GEOMETRIC $$ \text{Poisson}(\lambda): \; P(X = k) = \frac{e^{-\lambda}\,\lambda^{k}}{k!} \qquad\qquad \text{Geometric}(p): \; P(X = k) = (1 - p)^{k - 1} p $$ Poisson: one parameter \(\lambda > 0\) is both the rate and (uniquely) both moments. Geometric: \(\mathbb{E}[X] = 1/p\) — a fair coin (\(p = 0.5\)) takes 2 flips on average to land its first head; a rare success (\(p = 0.01\)) takes 100. Both inherit independence from the Bernoulli atom they are built from, which is what makes their formulas so clean. A single trial succeeds with probability \(p = 0.3\). What is the variance of this \(\text{Bernoulli}(0.3)\) variable, \(p(1 - p)\)? A Bernoulli's variance is \(\mathbb{E}[X^2] - (\mathbb{E}[X])^2 = p - p^2 = p(1 - p)\). With \(p = 0.3\): \(0.3 \times 0.7 = \) 0.21. Note it is maximised at \(p = 0.5\) (variance 0.25) — a fair coin is the most unpredictable, a near-certain trial the least. Calls arrive at a desk at a rate of one per minute, \(\lambda = 1\). Using the Poisson PMF, what is \(P(X = 2)\) — the probability of exactly two calls in a minute? (Use \(e^{-1} = 0.368\).) \(P(X = 2) = \dfrac{e^{-1}\,1^{2}}{2!} = \dfrac{0.368}{2} = \) 0.184. About one minute in five-and-a-half sees exactly two calls — even though one is the expected number. PYTHON · RUNNABLE IN-BROWSER # Sample Binomial, Poisson, Normal -- empirical vs theoretical mean and var import numpy as np rng = np.random.default_rng(0) M = 200_000 # samples per family n, p = 10, 0.3 # Binomial(10, 0.3) lam = 4.0 # Poisson(4) mu, sig = 0.0, 2.0 # Normal(0, 2) draws = { "Binomial(10,0.3)": (rng.binomial(n, p, M), n*p, n*p*(1-p)), "Poisson(4)": (rng.poisson(lam, M), lam, lam), # mean == var == lambda "Normal(0,2)": (rng.normal(mu, sig, M), mu, sig**2), } print(f"{'family':18}{'emp mean':>10}{'theory':>9}{'emp var':>10}{'theory':>9}") for name, (s, m, v) in draws.items(): print(f"{name:18}{s.mean():10.3f}{m:9.3f}{s.var():10.3f}{v:9.3f}") print("\nempirical moments track the formulas to ~1% at M = 200k;") print("note Poisson's mean and variance are both 4 -- its signature.") RUN ▶ edits are live — break it on purpose INSTRUMENT S2.1 — DISTRIBUTION EXPLORER PMF / PDF + SAMPLED HISTOGRAM · 6 FAMILIES FAMILY BINOMIAL POISSON GEOMETRIC UNIFORM NORMAL EXPONENTIAL TRIALS n 20 SUCCESS p 0.40 TYPE — MEAN — VARIANCE — STD DEV — The mint curve is the exact theoretical PMF (bars, for discrete families) or PDF (continuous); the blue outline is a histogram of 4,000 fresh samples. Switch to Poisson and notice the readouts for mean and variance stay locked together. Drag a Binomial's \(n\) up and watch the discrete bars climb into a smooth bell — a preview of §2.4. The two sliders rename themselves to whatever the chosen family's parameters are. 2.2 Continuous distributions: spreading mass over a line When outcomes form a continuum — a height, a temperature, a wait in seconds — no single point can carry positive probability (there are infinitely many points). Instead we use a probability density function (PDF) \(f(x)\): probability is area under the curve, so \(P(a \le X \le b) = \int_a^b f(x)\,\mathrm{d}x\) and the total area is one. Three continuous families dominate the introductory landscape. The Uniform on \([a, b]\) is the flat distribution — every value in the interval equally likely. It is the bedrock of simulation: a computer's random-number generator produces \(\text{Uniform}(0, 1)\) draws, and every other distribution is manufactured from them by transformation. EQ S2.3 — THE NORMAL (GAUSSIAN) DENSITY $$ f(x) = \frac{1}{\sigma\sqrt{2\pi}}\, \exp\!\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), \qquad x \in \mathbb{R} $$ The bell curve, fixed entirely by its mean \(\mu\) (where it is centred) and standard deviation \(\sigma\) (how wide). The exponent is a parabola in \(x\), so the log-density is a downward parabola — the source of the curve's symmetric, rapidly-decaying tails. The 68–95–99.7 rule: roughly 68% of mass lies within \(1\sigma\) of the mean, 95% within \(2\sigma\), 99.7% within \(3\sigma\). Standardising via \(z = (x - \mu)/\sigma\) collapses every Normal onto one standard Normal, \(\mathcal{N}(0, 1)\) — the reason a single z-table once sufficed for all of statistics. The Exponential is the continuous waiting time between Poisson events: if arrivals come at rate \(\lambda\), the gap until the next one is \(\text{Exp}(\lambda)\), with density \(f(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\). Like the Geometric, it is memoryless — the only continuous distribution that is. A bus that arrives "on average every 10 minutes" as a Poisson process gives you no credit for the 9 minutes you've already waited; your expected remaining wait is still 10. This is famously counter-intuitive and is exactly why memorylessness deserves a name. EQ S2.4 — UNIFORM & EXPONENTIAL $$ \text{Uniform}(a,b): \; f(x) = \frac{1}{b - a} \;\; (a \le x \le b) \qquad\qquad \text{Exponential}(\lambda): \; f(x) = \lambda e^{-\lambda x} \;\; (x \ge 0) $$ Uniform: \(\mathbb{E}[X] = \tfrac{a + b}{2}\), \(\operatorname{Var}(X) = \tfrac{(b - a)^2}{12}\) — that \(1/12\) returns in the CLT instrument. Exponential: \(\mathbb{E}[X] = 1/\lambda\), \(\operatorname{Var}(X) = 1/\lambda^2\); its variance equals its mean squared, so the distribution is right-skewed — many short waits, a few long ones. The Exponential is to the Poisson what the Geometric is to the Bernoulli: the continuous waiting time for a discrete counting process. A random number is drawn uniformly from \([0, 1]\). What is its variance, \(\dfrac{(b - a)^2}{12}\)? With \(a = 0,\ b = 1\): \(\operatorname{Var}(X) = \dfrac{(1 - 0)^2}{12} = \dfrac{1}{12} = \) 0.0833. This single number — the variance of a unit uniform — is the seed the Central Limit Theorem grows the Normal from in §2.4. 2.3 Moments: four numbers that describe a shape You don't need the whole PDF to talk about a distribution; four summary numbers — the moments — capture its location, spread, lopsidedness, and tail-heaviness. They are how one distribution gets compared to another, and how you decide whether the Normal is a fair description of your data. EQ S2.5 — THE FOUR MOMENTS $$ \mu = \mathbb{E}[X], \quad \sigma^2 = \mathbb{E}\big[(X - \mu)^2\big], \quad \text{skew} = \mathbb{E}\!\left[\left(\tfrac{X - \mu}{\sigma}\right)^3\right], \quad \text{kurt} = \mathbb{E}\!\left[\left(\tfrac{X - \mu}{\sigma}\right)^4\right] $$ Mean \(\mu\) — the centre of mass, the balance point of the density. Variance \(\sigma^2\) — the average squared distance from the mean; its square root \(\sigma\) is the standard deviation, in the same units as the data. Skewness — the standardised third moment; \(0\) for any symmetric distribution, positive when the right tail is longer (incomes, wait times), negative when the left tail is. Kurtosis — the standardised fourth moment; it measures how much mass sits in the tails. The Normal has kurtosis exactly \(3\), so practitioners quote excess kurtosis \(= \text{kurt} - 3\): zero for a Normal, positive for the heavy-tailed distributions of §2.5. Each higher moment refines the picture. Mean and variance alone cannot distinguish a symmetric bell from a lopsided ramp with the same centre and spread — you need skew. And two distributions can share mean, variance, and skew yet differ wildly in how often they throw extreme values — that difference lives in the kurtosis, which is the single most important number when randomness can hurt you (§2.5). A caution that experts insist on. Higher moments are estimated from data far less reliably than lower ones: a sample skew or kurtosis is dominated by the few most extreme points you happened to observe, so it is noisy and, for genuinely heavy-tailed data, may not even converge. For some distributions in §2.5 the higher moments are infinite — they do not exist at all. Treat sample kurtosis as a hint, not a measurement. Distribution Mean Variance Skew Excess kurtosis Normal (\(\mu, \sigma^2\)) μ σ² 0 0 Uniform (\(a, b\)) (a+b)/2 (b−a)²/12 0 −1.2 Exponential (\(\lambda\)) 1/λ 1/λ² +2 +6 Poisson (\(\lambda\)) λ λ 1/√λ 1/λ Student-t (\(\nu\)) 0 (ν>1) ν/(ν−2) 0 (ν>3) 6/(ν−4) Read the kurtosis column as a "danger gauge." The Uniform is platykurtic (negative excess) — bounded, no surprises. The Exponential and especially the Student-t are leptokurtic (positive excess) — far more prone to outliers than a Normal of the same variance. A Student-t with \(\nu = 5\) has excess kurtosis \(6/(5 - 4) = 6\), and below \(\nu = 4\) its kurtosis is infinite. 2.4 The Central Limit Theorem: why the Normal is everywhere Here is the result that makes the whole subject hang together — and the reason the Normal earns its place at the centre of statistics. Take any distribution with a finite mean \(\mu\) and finite variance \(\sigma^2\). Draw \(n\) independent samples from it and average them. As \(n\) grows, the distribution of that average — properly recentred and rescaled — converges to a standard Normal, regardless of the shape you started from. EQ S2.6 — THE CENTRAL LIMIT THEOREM $$ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \quad\Longrightarrow\quad \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \;\xrightarrow{\;d\;}\; \mathcal{N}(0, 1) \quad \text{as } n \to \infty $$ The sample mean \(\bar{X}_n\) is itself random; the CLT pins down its distribution. Two facts fall out for free. First, \(\bar{X}_n\) centres on \(\mu\) — the average is an unbiased estimate of the true mean. Second, its spread shrinks as \(\sigma/\sqrt{n}\): the standard error falls like \(1/\sqrt{n}\), so to halve your uncertainty you must quadruple your sample. The \(\xrightarrow{d}\) means "converges in distribution." The CLT does not require the \(X_i\) to be Normal — only that they share a distribution with finite variance, the one condition §2.5 will show is not always met. This is why the bell curve appears unbidden across nature and engineering: any quantity that is the sum of many small independent contributions — measurement error from countless tiny perturbations, a person's height from thousands of genetic and environmental nudges, the total noise on a sensor — is approximately Normal by construction. The Normal is not assumed; it is produced, again and again, by aggregation. KEY Convergence is fast for friendly shapes, slow for skewed ones. For a symmetric starting distribution like the Uniform, the average of just \(n = 5\)–\(10\) draws already looks convincingly bell-shaped. For a strongly skewed one like the Exponential you may need \(n = 30\)–\(50\) before the bell is clean — the textbook "\(n \ge 30\)" rule of thumb is a rough average, not a law. The shape of the parent distribution governs the rate of convergence, even though it never governs the limit. PYTHON · RUNNABLE IN-BROWSER # CLT demo: average N Uniform(0,1) draws, M times, and histogram the means import numpy as np rng = np.random.default_rng(1) N, M = 30, 40_000 # N draws per mean, M means means = rng.uniform(0, 1, size=(M, N)).mean(axis=1) # CLT prediction for the means: centre 0.5, variance (1/12)/N print(f"empirical mean of means: {means.mean():.4f} (theory 0.5000)") print(f"empirical var of means: {means.var():.5f} (theory {(1/12)/N:.5f})") print(f"=> standard error shrinks like 1/sqrt(N): {(1/12/N)**0.5:.4f}") # a bell emerges from a FLAT parent -- plot the density histogram hist, edges = np.histogram(means, bins=45, density=True) centers = 0.5 * (edges[:-1] + edges[1:]) print("\nthe parent Uniform is flat; the average of 30 of them is a clean bell.") plot_xy(centers, hist) RUN ▶ edits are live — break it on purpose INSTRUMENT S2.2 — CENTRAL LIMIT THEOREM SIMULATOR AVERAGE OF N IID DRAWS → NORMAL · EQ S2.6 PARENT UNIFORM EXPONENTIAL BERNOULLI SAMPLE SIZE N 1 PARENT SHAPE — STD ERROR σ/√N — SAMPLE SKEW — At \(N = 1\) you see the raw parent — flat, skewed, or two-spiked. Drag \(N\) upward: 10,000 sample means are re-histogrammed each step and the mint Normal curve (mean \(\mu\), width \(\sigma/\sqrt{N}\)) is overlaid for comparison. Watch the Exponential's heavy right skew melt away far more slowly than the Uniform's — the parent shape sets the speed of convergence, never the destination. The sample-skew readout marches toward zero as the bell forms. You average \(n = 4\) draws from \(\text{Uniform}(0, 1)\), whose standard deviation is \(\sigma = 1/\sqrt{12} = 0.2887\). By what factor \(1/\sqrt{n}\) does the standard error of the mean shrink relative to a single draw? The standard error is \(\sigma/\sqrt{n}\), so relative to one draw it shrinks by \(1/\sqrt{n} = 1/\sqrt{4} = 1/2 = \) 0.5. Quadrupling the sample halves the error — the \(1/\sqrt{n}\) law that governs every poll, A/B test, and Monte-Carlo estimate. 2.5 Heavy tails for quants: when the Normal lies The CLT comes with fine print, and on a trading desk that fine print is the whole story. The theorem requires a finite variance. Many real processes — financial returns above all — produce extreme moves far more often than a Normal of the same everyday spread would ever allow. These are heavy-tailed (or "fat-tailed") distributions, and mistaking them for Normal is how risk models blow up. Three families matter, in rising order of danger: Student-t (\(\nu\)). A bell that looks Normal in the middle but decays far more slowly in the tails, governed by the degrees of freedom \(\nu\). Small \(\nu\) means fat tails; as \(\nu \to \infty\) it converges back to the Normal. It is the workhorse for daily and weekly asset returns, where \(\nu \approx 3\)–\(6\) typically fits — and where, below \(\nu = 4\), the kurtosis is infinite and below \(\nu = 2\) even the variance is infinite, voiding the CLT outright. Lognormal. If \(\log X\) is Normal, then \(X\) is lognormal: strictly positive, right-skewed, with a long upper tail. It is the natural model for quantities that grow multiplicatively — stock prices (Quant 03's geometric Brownian motion), income, city sizes, file sizes. Because it is a transformed Normal, the CLT applies to its logarithm, not to it. Power laws (Pareto). The heaviest tails of all: \(P(X > x) \propto x^{-\alpha}\). The tail decays only polynomially, so for small enough exponent \(\alpha\) the variance — or even the mean — fails to exist, and sample averages never settle down. Power laws describe wealth, city populations, word frequencies, network degrees, and the size of catastrophic losses. Whether they are the right model for financial returns, versus a Student-t with merely fattish tails, remains genuinely contested among quants: the data in the extreme tail is, by definition, sparse, and the two models are hard to tell apart from any finite sample. EQ S2.7 — STUDENT-t & POWER-LAW TAILS $$ f_{t}(x;\nu) = \frac{\Gamma\!\left(\frac{\nu + 1}{2}\right)}{\sqrt{\nu\pi}\,\Gamma\!\left(\frac{\nu}{2}\right)} \left(1 + \frac{x^2}{\nu}\right)^{-\frac{\nu + 1}{2}} \qquad\qquad P(X > x) \;\sim\; x^{-\alpha} \;\;\text{(power law)} $$ The Student-t density decays like \(x^{-(\nu + 1)}\) for large \(x\) — a polynomial tail, versus the Normal's \(e^{-x^2/2}\) which is astronomically thinner. That polynomial decay is exactly a power-law tail with \(\alpha = \nu\). The practical consequence: a "six-sigma" daily move is a once-in-a-million-years event under a Normal, but happens every few years in real markets. Risk built on the Normal systematically under-prices the catastrophe; this miscalibration is the proximate cause of more than one financial crisis. There is a deeper reason heavy tails persist. A generalised CLT says that sums of infinite-variance variables converge not to the Normal but to the stable family (of which the Normal is the lone finite-variance member). So heavy-tailedness is not a failure of aggregation to "kick in" — for these processes, aggregation has a different, fatter-tailed attractor. The Normal is the special case, not the rule. INSTRUMENT S2.3 — TAIL-RISK OVERLAY NORMAL vs STUDENT-t · TAIL-PROBABILITY READOUT DEGREES OF FREEDOM ν 4 THRESHOLD (in σ) 3.0 VIEW LINEAR LOG-y P(|X| > t) NORMAL — P(|X| > t) STUDENT-t — TAIL RATIO t / Normal — Both curves are scaled to unit variance, so they agree in the bland centre — the danger hides in the tails. Switch to LOG-y to see the gap explode: the Normal plunges as a downward parabola while the Student-t falls only linearly (a power-law tail). Push the threshold to 4–5σ and read the ratio: at \(\nu = 4\) a 4σ event is many times likelier under the t than the Normal. Raise \(\nu\) toward 30 and the two distributions merge — the Student-t becoming Normal in the limit. CONTESTED How fat are the tails, really? That financial returns are heavier-tailed than Normal is settled and uncontroversial. How heavy is not. One camp (after Mandelbrot) argues for true power laws with possibly infinite variance; another fits finite-variance Student-t or stochastic-volatility models that generate fat tails without abandoning the CLT. The disagreement is hard to resolve precisely because extreme events are rare, so the deciding data is scarce. The honest engineering posture: assume tails fatter than Normal, stress-test against several tail models, and never let a single distributional assumption carry your entire risk number. NEXT One distribution describes one quantity; the next chapter asks how two quantities move together. Stats 03: descriptive statistics and correlation — summarising real data with means, medians, and quantiles, and measuring the linear (Pearson) and rank (Spearman) association between variables, with the warning that opens every honest course: correlation is not causation. 2.R References Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer — the standard modern reference for the distributions, moments, and convergence results in this chapter. Fischer, H. (2011). A History of the Central Limit Theorem: From Classical to Modern Probability Theory. Springer — the full lineage of EQ S2.6, from de Moivre and Laplace to Lindeberg and Lévy. Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1), 1–25. The original derivation of the Student-t distribution (EQ S2.7), written at the Guinness brewery. Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business 36(4), 394–419. The founding argument that financial returns are heavy-tailed and possibly infinite-variance — the contested claim of §2.5. Cont, R. (2001). Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues. Quantitative Finance 1(2), 223–236. A careful survey of the fat-tail evidence and why pinning down the tail exponent is genuinely hard. Clauset, A., Shalizi, C. R. & Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review 51(4), 661–703. The methodological reference on fitting and — crucially — testing power-law tails against alternatives. ← PREVIOUS 01 Probability NEXT CHAPTER 03 Correlation AI // ENCYCLOPEDIA — STATISTICS · CH 02 FULL CONTENTS ↗ ## STATS · Correlation & Causation (https://ai-encyclopedia.com/stats/03-descriptive-correlation.html) Correlation & Causation — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 03 / CORRELATION INDEX NEXT: INFERENCE & TESTING → MATHEMATICS & STATISTICS · CHAPTER 03 / 08 Correlation & Causation Correlation measures how two variables move together. Moving from correlation to causation requires a causal model, not more data. This chapter builds the toolkit in order: the summaries that describe one variable, the coefficients that describe two, and the reason a tight correlation can still mislead about cause. LEVEL INTRO READING TIME ≈ 24 MIN BUILDS ON STATS 01–02 INSTRUMENTS SCATTER · SIMPSON · DAG IN THIS CHAPTER 3.1 Summarizing one variable 3.2 Covariance & Pearson 3.3 Rank correlation 3.4 Correlation ≠ causation 3.5 Causal thinking 3.R References 3.1 Summarizing one variable Before two variables can be related, each must be described. A column of numbers is summarized along two axes: where it sits ( location) and how spread out it is ( scale). Get these two right and most of descriptive statistics follows. The two headline measures of location are the mean and the median. The mean is the balance point; the median is the middle value once the data is sorted. They agree on symmetric data and disagree — sometimes wildly — on skewed data. EQ S3.1 — MEAN & VARIANCE $$ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i, \qquad s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\big(x_i - \bar{x}\big)^2, \qquad s = \sqrt{s^2} $$ \(\bar{x}\) is the arithmetic mean; \(s^2\) the sample variance — the average squared distance from the mean; \(s\) the standard deviation, in the same units as the data. The divisor \(n-1\) (not \(n\)) is Bessel's correction: dividing by \(n\) systematically under-estimates the spread because the deviations are taken from the sample mean — which is itself fit to the data — so one degree of freedom is already spent. Variance is the engine of everything that follows: correlation is just shared variance, normalized. The mean has one fatal weakness: it is not robust. A single extreme value drags it arbitrarily far, while the median barely flinches. This is the first lesson of robust statistics — and it returns the moment a single outlier hijacks a correlation in §3.2. Quantiles generalize the median. The \(q\)-quantile is the value below which a fraction \(q\) of the data falls: the median is the \(0.5\)-quantile, the quartiles are the \(0.25\) and \(0.75\) quantiles, and the interquartile range (IQR \(= Q_3 - Q_1\)) is a robust measure of scale that ignores the tails entirely. Measure What it captures Robust to outliers? Breakdown point Mean Location (balance point) No 0% Median Location (middle value) Yes 50% Std. deviation Scale (typical spread) No 0% IQR Scale (middle 50%) Yes 25% The breakdown point is the fraction of the data you can corrupt before the statistic becomes meaningless. The mean breaks with one bad point (0%); the median survives until half the data is corrupted (50%). When you do not yet trust your data, summarize it with the median and IQR first. PYTHON · RUNNABLE IN-BROWSER # Location & scale: mean vs median, std vs IQR -- and how one outlier hits each import numpy as np x = np.array([2, 4, 4, 5, 5, 6, 7, 8, 9, 10], dtype=float) def summarize(v): q1, med, q3 = np.percentile(v, [25, 50, 75]) return dict(mean=v.mean(), median=med, std=v.std(ddof=1), # ddof=1 => Bessel's n-1 iqr=q3 - q1) print("clean data:", {k: round(val, 2) for k, val in summarize(x).items()}) x_bad = x.copy(); x_bad[-1] = 1000.0 # one wild outlier print("with outlier:", {k: round(val, 2) for k, val in summarize(x_bad).items()}) print("\nmean moved by", round(summarize(x_bad)['mean'] - summarize(x)['mean'], 2)) print("median moved by", round(summarize(x_bad)['median'] - summarize(x)['median'], 2)) print("=> the mean & std chase the outlier; median & IQR barely notice it.") RUN ▶ edits are live — break it on purpose 3.2 Covariance & Pearson correlation With two variables \(X\) and \(Y\), the first question is whether they move together. Covariance answers it directly: when \(X\) is above its mean, is \(Y\) usually above its mean too? Multiply the two deviations and average — positive products dominate when they rise together, negative when one rises as the other falls. EQ S3.2 — COVARIANCE $$ \operatorname{cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}\big(x_i - \bar{x}\big)\big(y_i - \bar{y}\big) $$ Each term is the product of two signed deviations. Same side of the mean → positive; opposite sides → negative. The sum's sign tells you the direction of the association. But its magnitude is uninterpretable: covariance carries the units of \(X\) times the units of \(Y\), so rescaling height from metres to centimetres multiplies it by 100 without changing anything real. Covariance has the right sign but the wrong scale. The fix is to divide out the scale. Normalize covariance by the two standard deviations and you get the Pearson correlation coefficient \(r\) — a pure, unitless number locked to \([-1, +1]\). EQ S3.3 — PEARSON CORRELATION $$ r = \frac{\operatorname{cov}(X, Y)}{s_X\, s_Y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\;\sqrt{\sum_i (y_i - \bar{y})^2}} $$ \(r = +1\) is a perfect increasing line, \(r = -1\) a perfect decreasing line, \(r = 0\) no linear association. Geometrically, \(r\) is the cosine of the angle between the two mean-centred data vectors — which is exactly why \(|r| \le 1\). The square, \(r^2\), is the fraction of \(Y\)'s variance a straight line through \(X\) explains. Pearson sees only straight lines: it can read \(r \approx 0\) off data that is perfectly but non-linearly related (a clean parabola), and it is dragged hard by a single outlier — both failures you can trigger in the instrument below. WORKED EXAMPLE ▾ 01 Three points: \((1,2),(2,4),(3,6)\). Means \(\bar{x}=2,\ \bar{y}=4\). Deviations in \(x\): \((-1,0,1)\); in \(y\): \((-2,0,2)\). 02 Cross-products \((x_i-\bar{x})(y_i-\bar{y})\): \((2,\ 0,\ 2)\), sum \(= 4\). So \(\operatorname{cov} = 4/(3-1) = 2\). 03 \(\sum(x-\bar{x})^2 = 2\), \(\sum(y-\bar{y})^2 = 8\). So \(s_X = \sqrt{2/2}=1\), \(s_Y = \sqrt{8/2}=2\). 04 \(r = \dfrac{\operatorname{cov}}{s_X s_Y} = \dfrac{2}{1 \times 2} = 1\). The three points lie exactly on the line \(y = 2x\), so the correlation is a perfect \(+1\). RESULT: cov = 2, r = +1 (perfect line) You measure five points that lie exactly on the increasing line \( y = 3x + 2 \): \((0,2),(1,5),(2,8),(3,11),(4,14)\). What is their Pearson correlation \( r \)? Every point sits on one straight increasing line, so the linear fit is perfect: \( r = \) 1.0. Pearson reaches \(+1\) for any increasing line, regardless of its slope — the slope (here 3) and intercept (here 2) do not affect \(r\); only the tightness and direction of the linear pattern do. Two variables have covariance \( \operatorname{cov}(X,Y) = 6 \), with standard deviations \( \sigma_X = 2 \) and \( \sigma_Y = 6 \). What is the Pearson correlation \( r \)? By EQ S3.3, \( r = \dfrac{\operatorname{cov}(X,Y)}{\sigma_X\,\sigma_Y} = \dfrac{6}{2 \times 6} = \dfrac{6}{12} = \) 0.5 — a moderate positive linear association. Notice the covariance alone (6) told you nothing until it was divided by the spreads. PYTHON · RUNNABLE IN-BROWSER # Pearson from scratch -- and proof it only sees straight lines (EQ S3.3) import numpy as np def pearson(x, y): x, y = np.asarray(x, float), np.asarray(y, float) xc, yc = x - x.mean(), y - y.mean() return float((xc * yc).sum() / np.sqrt((xc**2).sum() * (yc**2).sum())) x = np.linspace(-3, 3, 60) print("y = 2x + 1 (perfect line) r =", round(pearson(x, 2*x + 1), 3)) print("y = -x (perfect down-line) r =", round(pearson(x, -x), 3)) print("y = x**2 (perfect parabola) r =", round(pearson(x, x**2), 3)) print("\nThe parabola is a *perfect* relationship -- yet Pearson reports ~0,") print("because the symmetric U has no net linear trend. Always plot first.") plot_scatter(x, x**2) # see the U that Pearson is blind to RUN ▶ edits are live — break it on purpose INSTRUMENT S3.1 — SCATTER & CORRELATION EXPLORER DRAG THE OUTLIER · NOISE SLIDER · PEARSON vs SPEARMAN NOISE σ 0.30 TRUE SLOPE +1.0 PEARSON r — SPEARMAN ρ — r² (VARIANCE EXPLAINED) — Drag the single red point — the outlier — anywhere on the canvas. Watch Pearson r swing dramatically while Spearman ρ barely moves: Spearman works on ranks, so one wild value can only shift it by one rank, not arbitrarily far. Now raise the noise slider toward 1.5 and both coefficients collapse toward 0; set slope to negative and both flip sign. The line is the least-squares fit. 3.3 Rank correlation: Spearman & Kendall Pearson asks "do they fall on a line?" Often the better question is "do they move in the same order ?" — a relationship can be reliably increasing without being straight. Rank correlation answers that softer question, and in doing so buys robustness for free. Spearman's ρ is breathtakingly simple: replace every value by its rank, then run ordinary Pearson on the ranks. Because ranks are bounded \(1,\dots,n\), no outlier can pull harder than one rank — and any strictly increasing relationship, line or not, gets \(\rho = 1\). EQ S3.4 — SPEARMAN'S RANK CORRELATION $$ \rho = 1 - \frac{6\sum_{i=1}^{n} d_i^{2}}{n\,(n^{2}-1)}, \qquad d_i = \operatorname{rank}(x_i) - \operatorname{rank}(y_i) $$ \(d_i\) is the difference between the two ranks of observation \(i\). This tidy formula is exact only when there are no ties; with ties you fall back to Pearson-on-the-ranks, which is the general definition. \(\rho\) measures monotonicity, not linearity: it reaches \(+1\) for \(y = e^{x}\), \(y = \log x\), or any other strictly increasing curve, where Pearson would report something less than 1. It inherits the median's robustness because ranks compress the tails. Kendall's τ attacks the same target — monotone agreement — from a different angle. It counts ordered pairs: a pair \((i,j)\) is concordant if \(x\) and \(y\) agree on which is larger, and discordant if they disagree. EQ S3.5 — KENDALL'S τ $$ \tau = \frac{C - D}{\binom{n}{2}} = \frac{(\text{concordant pairs}) - (\text{discordant pairs})}{\tfrac{1}{2}\,n(n-1)} $$ \(C\) counts pairs that move the same way, \(D\) pairs that move opposite ways, out of all \(\binom{n}{2}\) pairs. \(\tau = +1\) means every pair is concordant (perfect monotone increase); \(\tau = -1\) every pair discordant. Kendall's τ has a cleaner probabilistic meaning than Spearman — \(\tau = P(\text{concordant}) - P(\text{discordant})\) — is even more robust to outliers, and behaves better in small samples, at the cost of being more expensive to compute. For ranked data, Kendall is the statistician's default; Spearman is the more widely reported. WHICH TO USE Pearson when the relationship is plausibly linear and the data is clean and roughly normal. Spearman / Kendall when you only care about monotone direction, when outliers are present, when the data is ordinal (ratings, ranks), or when the relationship is curved but consistently increasing. A large gap between Pearson and Spearman is itself a diagnostic: it screams "non-linearity or outliers — go look at the scatter plot." PYTHON · RUNNABLE IN-BROWSER # Pearson vs Spearman on monotone-nonlinear data -- watch the gap open import numpy as np def pearson(x, y): xc, yc = x - x.mean(), y - y.mean() return float((xc * yc).sum() / np.sqrt((xc**2).sum() * (yc**2).sum())) def spearman(x, y): # Pearson on the ranks rx = np.argsort(np.argsort(x)).astype(float) ry = np.argsort(np.argsort(y)).astype(float) return pearson(rx, ry) x = np.linspace(0.1, 4, 80) for name, y in [("linear y=x", x), ("exp y=e^x", np.exp(x)), ("cubic y=x^3", x**3), ("log y=log x", np.log(x))]: print(f"{name:16s} pearson {pearson(x, y):+.3f} spearman {spearman(x, y):+.3f}") print("\nEvery curve above is strictly increasing -> Spearman = +1.000 exactly.") print("Pearson sags below 1 wherever the curve bends. The gap = non-linearity.") RUN ▶ edits are live — break it on purpose 3.4 Why correlation ≠ causation Here is the cliff every analyst eventually walks off. You compute a strong \(r\), the p-value is tiny, the scatter is gorgeous — and you conclude that \(X\) causes \(Y\). The conclusion does not follow, and no amount of additional data fixes it. A correlation is consistent with at least four very different worlds. If X and Y correlate, it could be… Structure Example X causes Y X → Y Smoking → lung cancer Y causes X (reverse) X ← Y "Umbrellas → rain" read backwards A confounder Z causes both X ← Z → Y Ice-cream sales & drownings ← summer heat Pure coincidence none Spurious correlations in noisy, multiply-tested data The most dangerous of these is the confounder: a hidden variable \(Z\) that drives both \(X\) and \(Y\), manufacturing a correlation between them where no direct link exists. Ice-cream sales and drowning deaths rise together — not because frozen dairy is lethal, but because hot weather \(Z\) independently boosts both. Condition on \(Z\) (compare days at the same temperature) and the correlation evaporates. EQ S3.6 — CONFOUNDER-INDUCED CORRELATION $$ X = aZ + \varepsilon_X, \quad Y = bZ + \varepsilon_Y, \quad (\text{no } X \to Y) \;\;\Longrightarrow\;\; \operatorname{corr}(X, Y) = \frac{ab\,\sigma_Z^2}{\sigma_X\,\sigma_Y} \neq 0 $$ Both \(X\) and \(Y\) are noisy copies of the same driver \(Z\); the noise terms \(\varepsilon_X, \varepsilon_Y\) are independent of each other and of \(Z\). There is no arrow from \(X\) to \(Y\) — yet they correlate, purely through their shared parent. With \(a=b=1\) and unit variances everywhere, \(\sigma_X^2 = \sigma_Y^2 = \sigma_Z^2 + 1 = 2\), giving \(\operatorname{corr}(X,Y) = 1/2\). An intervention on \(X\) would move nothing in \(Y\) — the spurious 0.5 would vanish the instant you set \(X\) by hand instead of letting \(Z\) set it. You can build this exact world in Instrument S3.3. Simpson's paradox Confounding has a spectacular special case. Simpson's paradox is when a trend that holds in every subgroup reverses when the groups are pooled. It is not a statistical glitch — both the aggregate and the per-group numbers are arithmetically correct. The aggregate is simply answering a different, usually wrong, question. THE 1973 BERKELEY CASE UC Berkeley's graduate admissions looked biased against women in aggregate (about 44% of men admitted vs 35% of women). But department by department, women were admitted at equal or higher rates than men. The resolution: women applied disproportionately to highly competitive departments with low admit rates for everyone. Department was the confounder. Pooling across it produced a reversal that defamed the wrong cause — a textbook reason never to aggregate blindly across a variable that drives both your exposure and your outcome. INSTRUMENT S3.2 — SIMPSON'S PARADOX VISUALIZER GROUP TREND vs POOLED TREND · WATCH IT FLIP GROUP SEPARATION 1.6 VIEW POOLED BY GROUP POOLED r (ALL POINTS) — WITHIN-GROUP r (AVG) — VERDICT — Each colour is one subgroup with a clear downward trend. Push GROUP SEPARATION up: the groups stagger diagonally so the pooled cloud trends upward even though every group trends down. Toggle POOLED vs BY GROUP to see the grey overall fit fight the coloured within-group fits. The verdict flags when the signs disagree — that is the paradox. 3.5 Causal thinking: DAGs, backdoor paths, the do-operator If more data cannot turn correlation into causation, what can? A causal model — an explicit, falsifiable claim about which variable affects which. Judea Pearl's framework, the standard since the 2000s, draws these claims as a directed acyclic graph (DAG): nodes are variables, arrows are direct causal effects, "acyclic" means no variable causes itself through a loop. FIG S3.1 THREE STRUCTURES · ONLY THE FORK CONFOUNDS CHAIN X→Z→Y X Z Y mediator — do NOT control Z FORK X←Z→Y X Z Y confounder — DO control Z COLLIDER X→Z←Y X Z Y collider — do NOT control Z The same three variables, three different DAGs. The fork is the confounder — to estimate X→Y you must adjust for Z. The chain has Z as a mediator (adjusting for it would erase the very effect you want), and the collider creates a spurious link only if you control Z. Whether to control a variable depends entirely on the graph, not on the data. This figure carries the central, counter-intuitive lesson of causal inference: "control for everything" is wrong. The arrows decide. A fork (\(X \leftarrow Z \rightarrow Y\)) is the confounder you must adjust for. A chain (\(X \rightarrow Z \rightarrow Y\)) makes \(Z\) a mediator — adjust for it and you delete part of the real effect. A collider (\(X \rightarrow Z \leftarrow Y\)) is the trap: \(X\) and \(Y\) are independent until you condition on \(Z\), which opens a fake association (this is collider / selection bias). A backdoor path is any non-causal route from \(X\) to \(Y\) that starts with an arrow into \(X\) — exactly the channel through which confounding leaks. The backdoor criterion says: to read the true causal effect of \(X\) on \(Y\), find a set of variables that blocks every backdoor path without opening a collider, and adjust for them. Do that, and observational data yields a causal answer. EQ S3.7 — THE do-OPERATOR & BACKDOOR ADJUSTMENT $$ P\big(Y \mid \mathrm{do}(X = x)\big) \;=\; \sum_{z} P\big(Y \mid X = x,\, Z = z\big)\, P(Z = z) $$ \(\mathrm{do}(X=x)\) means intervene — reach in and set \(X\) to \(x\), severing the arrows that normally point into \(X\) — as opposed to merely observing \(X = x\), which is \(P(Y \mid X=x)\). The two are equal only when nothing confounds \(X\) and \(Y\). When a sufficient adjustment set \(Z\) blocks the backdoors, this formula recovers the interventional distribution from purely observational data — the bridge from correlation to causation. A randomized controlled trial physically performs the \(\mathrm{do}\): randomizing \(X\) deletes every arrow into it, which is why an RCT needs no DAG to be valid. When you cannot randomize, the DAG plus EQ S3.7 is the next best thing. INSTRUMENT S3.3 — CONFOUNDER TOY TOGGLE Z → SPURIOUS r APPEARS · CONDITION → IT VANISHES CONFOUNDER STRENGTH (a=b) 1.0 ANALYSIS IGNORE Z CONDITION ON Z OBSERVED corr(X,Y) — TRUE X→Y EFFECT 0.00 STATUS — The data-generating truth is fixed: \(X \leftarrow Z \rightarrow Y\), with no direct \(X \to Y\) arrow. Raise CONFOUNDER STRENGTH and the naive scatter sprouts a strong upward correlation out of nothing — pure backdoor leakage through \(Z\). Now switch to CONDITION ON Z (the plot shows one narrow slice of \(Z\)): the spurious correlation collapses toward 0, exactly as EQ S3.6 predicts. This is backdoor adjustment, by hand. NEXT You can now describe data and reason about what does — and does not — cause what. The missing piece is uncertainty: every \(r\), every mean, every effect estimate is computed from a finite sample and could be a fluke. Chapter 04 — Inference & Testing — builds the machinery to ask "could this have happened by chance?": sampling distributions, confidence intervals, p-values, and the hypothesis tests that decide when a correlation is real enough to act on. 3.R References Pearson, K. (1895). Note on Regression and Inheritance in the Case of Two Parents. Proceedings of the Royal Society of London 58 — the product-moment correlation coefficient (EQ S3.3). Spearman, C. (1904). The Proof and Measurement of Association between Two Things. American Journal of Psychology 15(1) — rank correlation, EQ S3.4. Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika 30(1–2) — Kendall's τ, EQ S3.5. Bickel, P. J., Hammel, E. A. & O'Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science 187(4175) — the canonical Simpson's paradox case study. Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society B 13(2) — the paradox's namesake paper. Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press — DAGs, the do-operator and the backdoor criterion (EQ S3.7). Pearl, J. (1995). Causal Diagrams for Empirical Research. Biometrika 82(4) — the foundational presentation of the backdoor criterion. ← PREVIOUS 02 Distributions NEXT CHAPTER 04 Inference & Testing AI // ENCYCLOPEDIA — STATISTICS · CH 03 FULL CONTENTS ↗ ## STATS · Statistical Inference & Hypothesis Testing (https://ai-encyclopedia.com/stats/04-inference-testing.html) Statistical Inference & Hypothesis Testing — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 04 / INFERENCE INDEX NEXT: BAYESIAN INFERENCE → MATHEMATICS & STATISTICS · CHAPTER 04 / 08 Statistical Inference & Hypothesis Testing You never observe the population, only a sample. Inference draws conclusions about the whole from the part, and it attaches a measure of its own reliability. Every inference turns a sample into a statement about the world together with an explicit accounting of how often that statement is wrong. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON STATS 01–03 INSTRUMENTS p-VALUE · CI COVERAGE · ANOVA F IN THIS CHAPTER 4.1 Estimators 4.2 Sampling distributions & CIs 4.3 Hypothesis testing 4.4 t-tests 4.5 ANOVA 4.6 Multiple comparisons 4.R References 4.1 Estimators: bias, variance, consistency, MLE An estimator is a recipe that turns data into a guess about a parameter — the sample mean \(\bar{X}\) estimating the population mean \(\mu\), the sample variance \(S^2\) estimating \(\sigma^2\). Because the data are random, the estimator is random too: it has its own distribution (§4.2). Two numbers summarize how good it is. Bias is how far its average lands from the truth; variance is how much it bounces from sample to sample. EQ S4.1 — BIAS, VARIANCE, AND MSE $$ \operatorname{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta, \qquad \operatorname{MSE}(\hat{\theta}) = \mathbb{E}\big[(\hat{\theta} - \theta)^2\big] = \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}(\hat{\theta})^2 $$ The expected squared error of any estimator splits cleanly into variance plus bias-squared — the same decomposition that governs model generalization (ML · CH 06). An estimator can be unbiased yet useless if its variance is huge, or slightly biased yet excellent if that buys a large drop in variance. "Unbiased" is a property worth wanting but never worth worshipping. Why does the sample variance divide by \(n-1\) instead of \(n\)? Because the deviations are measured from \(\bar{X}\), which was itself fit to the data and therefore sits closer to the points than the true \(\mu\) does. Dividing by \(n\) would systematically underestimate \(\sigma^2\); the correction \(n-1\) — one degree of freedom spent estimating the mean — makes \(S^2\) exactly unbiased. Bessel's correction is the cleanest example of a bias fix you can see in arithmetic. EQ S4.2 — UNBIASED SAMPLE VARIANCE $$ S^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2, \qquad \mathbb{E}[S^2] = \sigma^2 $$ The \(n-1\) is the first degree of freedom you will meet — a count of independent pieces of information left after estimating the mean. Degrees of freedom reappear in every test in this chapter: the \(t\) distribution's shape, the \(\chi^2\), and the two \(df\) of an \(F\)-ratio (§4.5) are all bookkeeping of how many free deviations remain. A third virtue is asymptotic. An estimator is consistent if it converges in probability to the truth as the sample grows: \(\hat{\theta}_n \xrightarrow{p} \theta\). The sample mean is consistent because its variance \(\sigma^2/n\) shrinks to zero — the weak law of large numbers in one line. Consistency is a floor, not a ceiling: it promises you get there eventually, but says nothing about the rate. Maximum likelihood The dominant general recipe for building estimators is maximum likelihood: choose the parameter value that makes the observed data most probable. Treating the joint density as a function of \(\theta\) (the data now fixed) gives the likelihood; maximizing its logarithm — a sum, far friendlier than a product — gives the MLE. EQ S4.3 — MAXIMUM LIKELIHOOD $$ \hat{\theta}_{\text{MLE}} = \arg\max_{\theta}\; \ell(\theta), \qquad \ell(\theta) = \sum_{i=1}^{n} \log p(x_i \mid \theta) $$ For an i.i.d. Gaussian sample, solving \(\partial\ell/\partial\theta = 0\) hands back the sample mean for \(\mu\) and the \(\tfrac{1}{n}\) ( not \(\tfrac{1}{n-1}\)) variance — so the Gaussian MLE of \(\sigma^2\) is slightly biased downward, the cleanest case where ML trades a little bias for the method's generality. MLEs are consistent, asymptotically unbiased, and asymptotically normal with the smallest possible variance (the Cramér–Rao bound), which is why they sit under logistic regression, GLMs, and most of deep learning's loss functions. Cross-reference: minimizing cross-entropy loss (ML · CH 03) is maximizing a likelihood, and the squared-error loss of linear regression (ML · CH 02) is the Gaussian MLE. The optimizers of machine learning are likelihood maximizers wearing different clothes. A population has variance \( \sigma^2 = 25 \). You draw \( n = 10 \) observations and average them. Since \(\bar{X}\) is unbiased, its MSE equals its variance (EQ S4.1). What is that MSE, \( \sigma^2/n \)? For an unbiased estimator the bias term vanishes, so \( \operatorname{MSE}(\bar{X}) = \operatorname{Var}(\bar{X}) = \sigma^2/n = 25/10 = \) 2.5. Averaging ten draws cuts the error variance by a factor of ten — the variance side of EQ S4.1 doing all the work. 4.2 Sampling distributions & confidence intervals The single most important object in inference is invisible in any one dataset: the sampling distribution — the distribution of an estimator across the many samples you could have drawn but didn't. Its spread is the standard error, and it is what converts a point estimate into an honest range. EQ S4.4 — STANDARD ERROR OF THE MEAN $$ \operatorname{Var}(\bar{X}) = \frac{\sigma^2}{n} \quad\Longrightarrow\quad \operatorname{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}} $$ The estimate gets sharper as \(1/\sqrt{n}\) — the iron law of statistics. To halve your error you must quadruple your data, which is why the marginal value of more samples falls off and why "big enough" arrives sooner than intuition expects. The \(\sqrt{n}\) is the lever every power calculation (§4.3) pulls. Why is the sampling distribution of a mean so often bell-shaped, whatever the data look like? The Central Limit Theorem: the standardized sum of i.i.d. variables with finite variance converges to a standard normal, regardless of the parent shape. EQ S4.5 — CENTRAL LIMIT THEOREM $$ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \;\xrightarrow{d}\; \mathcal{N}(0, 1) \quad \text{as } n \to \infty $$ The CLT is why the normal distribution is the lingua franca of inference even for skewed or discrete data: averages forget their parent. The caveats experts insist on: it needs finite variance (it fails for heavy-tailed laws like the Cauchy), the approximation is poor in the tails at small \(n\), and strongly skewed parents need larger \(n\) before the bell sets in — the folk rule "\(n \ge 30\)" is a rough heuristic, not a theorem. Confidence intervals A confidence interval wraps the standard error around the estimate. For a mean with known \(\sigma\), the 95% interval is the estimate plus or minus \(1.96\) standard errors: EQ S4.6 — CONFIDENCE INTERVAL FOR A MEAN $$ \bar{X} \pm z_{1-\alpha/2}\,\frac{\sigma}{\sqrt{n}}, \qquad z_{0.975} = 1.96 \quad (\text{use } t_{n-1} \text{ when } \sigma \text{ is estimated}) $$ The interpretation is subtle and routinely mangled: the 95% refers to the procedure, not to any single interval. If you repeated the whole experiment forever, 95% of the intervals you build this way would contain the true \(\mu\). A given interval either covers the truth or it doesn't — there is no "95% probability" attached to it under the frequentist reading. The instrument below makes this concrete: watch the intervals dance and roughly one in twenty miss. A COMMON ERROR "There is a 95% chance the true mean lies in [a, b]." This sentence is false under the frequentist definition: \(\mu\) is a fixed number, and the interval is the random thing. The correct statement is about the long-run coverage of the method. The probability-about-this-interval reading is exactly what Bayesian credible intervals deliver instead (STATS · CH 05) — which is one reason the two schools talk past each other. INSTRUMENT S4.1 — CONFIDENCE-INTERVAL COVERAGE REPEAT THE EXPERIMENT · ~95% COVER THE TRUTH · EQ S4.6 CONFIDENCE LEVEL 95% SAMPLE SIZE n 25 DRAW 40 ▶ RESET INTERVALS DRAWN — COVERED THE TRUTH — EMPIRICAL COVERAGE — Each horizontal bar is one experiment's 95% interval; the dashed line is the true mean \(\mu = 0\). Mint intervals caught it, red ones missed. Keep pressing DRAW: the empirical coverage hovers near the nominal level. Lower the confidence to 80% and watch the red bars multiply — narrower intervals miss more often. The coverage is a property of the recipe, exactly as EQ S4.6's note warns. A measurement has population standard deviation \( \sigma = 10 \). You average \( n = 100 \) independent measurements. What is the standard error of the mean, \( \sigma/\sqrt{n} \)? \( \operatorname{SEM} = \dfrac{\sigma}{\sqrt{n}} = \dfrac{10}{\sqrt{100}} = \dfrac{10}{10} = \) 1.0. A hundredfold averaging shrinks a spread of 10 down to a standard error of 1 — the \(1/\sqrt{n}\) law in a single step. 4.3 Hypothesis testing: null, p-value, errors, power A hypothesis test is a formal courtroom for a claim. You state a null hypothesis \(H_0\) — the boring default, "no effect" — and ask: if \(H_0\) were true, how surprising is data at least this extreme? That surprise, measured in probability, is the p-value. EQ S4.7 — THE p-VALUE $$ p = \mathbb{P}\big(\,|T| \ge |t_{\text{obs}}| \;\big|\; H_0\,\big) $$ The p-value is the probability, computed under the null, of a test statistic as or more extreme than the one you observed. It is not the probability that \(H_0\) is true, nor the probability your result was a fluke, nor one minus the probability the alternative is true. It answers only: "is my data weird, assuming nothing is going on?" A small p means the data sit far in the tail of the null distribution. The deepest fact about the p-value is also the least intuitive: when \(H_0\) is exactly true, the p-value is uniformly distributed on \([0,1]\). Every value is equally likely. That flatness is not an accident — it is the definition of a calibrated test, and it is why a threshold \(\alpha = 0.05\) yields a 5% false-positive rate. The second Python cell below demonstrates this directly by simulating ten thousand null experiments. EQ S4.8 — THE TWO ERRORS, AND POWER $$ \alpha = \mathbb{P}(\text{reject } H_0 \mid H_0 \text{ true}), \qquad \beta = \mathbb{P}(\text{fail to reject } H_0 \mid H_0 \text{ false}), \qquad \text{power} = 1 - \beta $$ Type I error (\(\alpha\)) is a false alarm — convicting an innocent null. Type II error (\(\beta\)) is a miss — letting a real effect walk free. Power is the probability of detecting an effect that is genuinely there. The four levers are locked together: power rises with the true effect size, with the sample size \(n\) (through the \(\sqrt{n}\) of EQ S4.4), and with a more lenient \(\alpha\) — and falls with noisier data. An underpowered study is one designed to miss; §4.6 is the story of what happens when a whole field runs them. \(H_0\) true (no effect) \(H_0\) false (real effect) Reject \(H_0\) Type I error · prob \(\alpha\) correct detection · prob \(1-\beta\) (power) Fail to reject correct · prob \(1-\alpha\) Type II error · prob \(\beta\) A test does not tell you whether the effect is real; it controls the rate at which you cry wolf. Statistical significance is not practical importance: with a large enough \(n\), a trivially small, useless effect becomes "significant," because significance measures only whether the effect is distinguishable from zero, not whether it is big enough to care about. Always report the effect size and a confidence interval alongside the p-value. INSTRUMENT S4.2 — p-VALUE & POWER SIMULATOR TWO-SAMPLE z-TEST · NULL → ALTERNATIVE · EQ S4.7–S4.8 TRUE EFFECT SIZE d 0.00 SAMPLE SIZE / GROUP n 30 SIGNIFICANCE α 0.05 REGIME — POWER (1 − β) — P(p < α) — The histogram is the distribution of the p-value across hypothetical repetitions of the experiment. Start at effect d = 0: the bars are flat — the uniform null, with exactly an α-sized slice falling left of the threshold (Type I errors). Now crank d up: the distribution piles toward zero and the shaded region left of α swells — that growing slice is the power. Raise n to watch a small effect become detectable; the \(\sqrt{n}\) of EQ S4.4 is the engine. PYTHON · RUNNABLE IN-BROWSER # 10,000 experiments under H0 (no real effect): the p-value is UNIFORM import numpy as np rng = np.random.default_rng(0) def norm_cdf(x): # standard normal CDF, pure numpy (A&S 7.1.26) x = np.asarray(x, float); s = np.sign(x); z = np.abs(x) / np.sqrt(2.0) t = 1.0 / (1.0 + 0.3275911 * z) y = 1.0 - (((((1.061405429*t - 1.453152027)*t) + 1.421413741)*t - 0.284496736)*t + 0.254829592)*t * np.exp(-z*z) return 0.5 * (1.0 + s * y) M, n = 10000, 40 A = rng.normal(0, 1, (M, n)) # both groups drawn from the SAME world B = rng.normal(0, 1, (M, n)) # H0 is TRUE by construction se = np.sqrt(A.var(1, ddof=1)/n + B.var(1, ddof=1)/n) t = (A.mean(1) - B.mean(1)) / se p = 2.0 * (1.0 - norm_cdf(np.abs(t))) # 10,000 two-sided p-values edges = np.linspace(0, 1, 11) counts, _ = np.histogram(p, bins=edges) print("p-value histogram (10 equal bins, ~1000 expected each):") for i in range(10): print(f" [{edges[i]:.1f},{edges[i+1]:.1f}) {counts[i]:5d} " + "#"*int(counts[i]/25)) print(f"\nfalse positives (p RUN ▶ edits are live — break it on purpose A trial is designed with a Type II error rate of \( \beta = 0.20 \). What is its statistical power, \( 1 - \beta \)? \( \text{power} = 1 - \beta = 1 - 0.20 = \) 0.8. 80% power is the conventional design target — a one-in-five chance of missing a real effect of the size you planned for. 4.4 t-tests: comparing means when σ is unknown In practice you almost never know the population \(\sigma\) — you estimate it with \(S\), and that estimate is itself noisy, especially at small \(n\). William Gosset, brewing statistics at Guinness under the pen name "Student," worked out the exact distribution of the resulting ratio. The fix is to use a heavier-tailed reference curve, the \(t\) distribution, in place of the normal. EQ S4.9 — THE ONE-SAMPLE t STATISTIC $$ t = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \;\sim\; t_{n-1} \quad\text{under } H_0:\, \mu = \mu_0 $$ Same shape as a z-score, but \(\sigma\) is replaced by the sample \(S\) — so the denominator wobbles, fattening the tails. The \(t_{n-1}\) distribution has \(n-1\) degrees of freedom: at small \(df\) its tails are heavy (more extreme values are plausible, so critical values are larger than 1.96), and as \(df \to \infty\) it converges to the normal. The extra tail weight is the price of not knowing \(\sigma\) — and forgetting to pay it is why naive z-tests over-reject on small samples. Three flavors cover most uses. The one-sample test (EQ S4.9) compares a mean to a fixed value. The paired test applies the one-sample test to within-subject differences — before/after, left/right — and is far more powerful when it applies, because it cancels per-subject variation. The two-sample test compares two independent groups: EQ S4.10 — WELCH'S TWO-SAMPLE t $$ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\dfrac{S_1^2}{n_1} + \dfrac{S_2^2}{n_2}}} $$ The denominator is the standard error of the difference of two means. Welch's version does not assume equal variances and is the modern default — Student's original pooled-variance test is a special case that fails, sometimes badly, when the groups have unequal spread or unequal size. Welch costs you only a (non-integer) degrees-of-freedom adjustment and is strictly safer; reach for it unless you have a strong reason not to. Assumptions, honestly stated: the \(t\)-test wants roughly normal data (or large \(n\), via the CLT) and independent observations. It is robust to mild non-normality but not to dependence or to extreme outliers, which inflate \(S\) and quietly kill power. For badly skewed or ordinal data, a rank-based test (Mann–Whitney, Wilcoxon) trades a little power for not caring about the distribution's shape. PYTHON · RUNNABLE IN-BROWSER # Two-sample Welch t-test from scratch: t statistic + normal-approx p import numpy as np rng = np.random.default_rng(2) a = rng.normal(100, 15, 30) # control: true mean 100 b = rng.normal(106, 15, 30) # treatment: true mean 106 (effect = 6) def norm_cdf(x): # standard normal CDF, pure numpy (A&S 7.1.26) x = np.asarray(x, float); s = np.sign(x); z = np.abs(x) / np.sqrt(2.0) t = 1.0 / (1.0 + 0.3275911 * z) y = 1.0 - (((((1.061405429*t - 1.453152027)*t) + 1.421413741)*t - 0.284496736)*t + 0.254829592)*t * np.exp(-z*z) return 0.5 * (1.0 + s * y) nx, ny = len(a), len(b) se = np.sqrt(a.var(ddof=1)/nx + b.var(ddof=1)/ny) # Welch SE of the difference t = (a.mean() - b.mean()) / se p = 2.0 * (1.0 - norm_cdf(abs(t))) # two-sided, normal approximation print(f"mean control = {a.mean():6.2f} mean treatment = {b.mean():6.2f}") print(f"difference = {a.mean()-b.mean():+.2f} standard error = {se:.2f}") print(f"t statistic = {t:.3f}") print(f"two-sided p = {float(p):.4f} (normal approx; df > ~30 makes it tight)") print("reject H0 at alpha = 0.05?", bool(p RUN ▶ edits are live — break it on purpose A one-sample t-test has \( \bar{X} = 52 \), \( \mu_0 = 50 \), sample SD \( S = 8 \), and \( n = 100 \). What is the t statistic \( \dfrac{\bar{X}-\mu_0}{S/\sqrt{n}} \)? Standard error \( = S/\sqrt{n} = 8/\sqrt{100} = 8/10 = 0.8 \). Then \( t = (52 - 50)/0.8 = 2/0.8 = \) 2.5 — comfortably past the \(\approx 1.98\) two-sided critical value at \(df = 99\), so reject \(H_0\) at the 5% level. 4.5 ANOVA: partitioning variance across groups To compare three or more group means, running a \(t\)-test on every pair is a trap — it multiplies the false-positive rate (the very problem of §4.6). The Analysis of Variance sidesteps it with one omnibus test built from a beautiful identity: total variation decomposes exactly into variation between groups and variation within them. EQ S4.11 — THE SUM-OF-SQUARES DECOMPOSITION $$ \underbrace{\sum_{j}\sum_{i}(x_{ij} - \bar{x})^2}_{SS_{\text{total}}} \;=\; \underbrace{\sum_{j} n_j (\bar{x}_j - \bar{x})^2}_{SS_{\text{between}}} \;+\; \underbrace{\sum_{j}\sum_{i}(x_{ij} - \bar{x}_j)^2}_{SS_{\text{within}}} $$ Every observation's distance from the grand mean splits, with no remainder, into "how far its group's mean is from the grand mean" plus "how far it is from its own group's mean." \(SS_{\text{between}}\) is signal (do the groups differ?); \(SS_{\text{within}}\) is noise (how much do individuals scatter inside a group?). This is the same orthogonal decomposition that underlies \(R^2\) in regression (STATS · CH 03). Sums of squares are not directly comparable — \(SS_{\text{between}}\) is built from \(k\) group means, \(SS_{\text{within}}\) from \(N\) observations. Dividing each by its degrees of freedom gives mean squares, and their ratio is the test statistic. Under \(H_0\) (all group means equal), both mean squares estimate the same noise variance, so their ratio sits near 1; a real difference inflates the numerator. EQ S4.12 — THE F-RATIO $$ F = \frac{MS_{\text{between}}}{MS_{\text{within}}} = \frac{SS_{\text{between}} / (k-1)}{SS_{\text{within}} / (N-k)} \;\sim\; F_{k-1,\,N-k} \quad\text{under } H_0 $$ \(k\) groups, \(N\) total observations. The numerator has \(k-1\) degrees of freedom, the denominator \(N-k\). \(F\) is a signal-to-noise ratio: large \(F\) means the spread between groups dwarfs the spread within them, which is hard to explain if the means are truly equal. For exactly two groups, \(F = t^2\) — ANOVA and the two-sample \(t\)-test agree. A significant \(F\) says "some means differ" but not which; post-hoc tests (Tukey's HSD) localize it while controlling the family-wise error of §4.6. INSTRUMENT S4.3 — ANOVA F EXPLORER 3 GROUPS · BETWEEN vs WITHIN VARIANCE · EQ S4.11–S4.12 SPREAD OF GROUP MEANS 1.5 WITHIN-GROUP SD σ 2.0 GROUP SIZE n 12 MS BETWEEN — MS WITHIN — F = MSB / MSW — VERDICT (α = 0.05) — Three groups of dots, one per column; the mint diamonds are group means, the dashed line is the grand mean. Pull the group means apart (raise the spread) and watch \(MS_{\text{between}}\) and \(F\) climb. Raise the within-group SD and the dots smear vertically: the same separation now drowns in noise and \(F\) collapses. Shrinking the spread to zero leaves \(F \approx 1\) — pure noise over noise, the null. \(F\) is the ratio of the two, and the verdict flips when it crosses the critical value. An ANOVA gives \( SS_{\text{between}} = 120 \) with \( df = 2 \), and \( SS_{\text{within}} = 300 \) with \( df = 27 \). Compute the F-ratio, \( \dfrac{SS_B/df_B}{SS_W/df_W} \). \( MS_{\text{between}} = 120/2 = 60 \) and \( MS_{\text{within}} = 300/27 = 11.11 \). Then \( F = 60 / 11.11 = \) 5.4. Against \( F_{2,27} \) the 5% critical value is \(\approx 3.35\), so \(5.4\) clears it — at least one group mean differs. 4.6 Multiple comparisons & the replication crisis Here is the dark side of the p-value, and the reason this chapter ends in a cautionary tale. A 5% false-positive rate per test compounds ruthlessly across many tests. Run twenty independent null tests and the probability that at least one hits "significance" by chance is not 5% — it is \(1 - 0.95^{20} \approx 64\%\). EQ S4.13 — FAMILY-WISE ERROR INFLATION $$ \text{FWER} = \mathbb{P}(\text{at least one false positive}) = 1 - (1 - \alpha)^{m} \approx m\alpha \;\;(\text{small } \alpha) $$ Across \(m\) independent tests at level \(\alpha\), the chance of some spurious hit grows toward 1. The Bonferroni correction restores control by testing each hypothesis at \(\alpha/m\) — simple, conservative, and at the cost of power. For large \(m\) (genomics, neuroimaging), controlling the false discovery rate instead (Benjamini–Hochberg) — the expected fraction of your "discoveries" that are false — keeps far more power. Either way, the unit of error control is the family of tests, not the single test. p-HACKING The same arithmetic, weaponized by flexibility. If you try many outcome variables, many subgroups, many covariate combinations, or peek at the data and stop when \(p < 0.05\), you are running dozens of hidden tests and reporting only the winner. This is p-hacking, and it manufactures significance from pure noise. The "garden of forking paths" makes it possible without any conscious cheating — every undocumented analytic choice is a degree of freedom that inflates the real \(\alpha\). This is not academic hygiene; it broke a field's confidence in itself. Beginning in the 2010s, large replication efforts found that a substantial share of published findings — in psychology, parts of biomedicine, and beyond — failed to reproduce. The diagnosis pointed straight at the machinery of this chapter: chronic underpowering (§4.3), undisclosed multiple comparisons (above), publication bias toward "significant" results, and the cult of the \(p < 0.05\) threshold. John Ioannidis's 2005 argument — that most published research findings are false — followed from a few lines of conditional probability: when power is low, priors are low, and bias and multiplicity are high, a "significant" result is more likely false than true. CONTESTED The reforms are real but not settled. Pre-registration, larger samples, reporting effect sizes with intervals, and sharing data are now mainstream and demonstrably help. Beyond that, consensus frays: some argue for lowering the threshold to \(p < 0.005\), some for abandoning fixed thresholds entirely, some for replacing significance testing with Bayesian model comparison (STATS · CH 05) or estimation-with-intervals. The honest summary in 2026: the p-value is a useful, badly abused tool, and "statistical significance" should be read as the start of an argument, never the end of one. You run \( m = 4 \) hypothesis tests and want a family-wise error rate of \( \alpha = 0.05 \). What per-test threshold does the Bonferroni correction use, \( \alpha/m \)? Bonferroni tests each hypothesis at \( \alpha/m = 0.05/4 = \) 0.0125. Only p-values below 0.0125 count as significant — the tighter bar that keeps the chance of any false positive at or below 5%. NEXT Frequentist inference controls error rates but cannot say "how probable is my hypothesis?" — only a Bayesian can. STATS · CH 05 turns the question around: instead of asking how surprising the data are under a fixed null, it puts a probability distribution on the parameter itself, updates it with Bayes' rule, and reads off credible intervals that mean exactly what the misread confidence interval of §4.2 was supposed to. 4.R References Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1) — the t distribution, derived for small Guinness brewing samples (EQ S4.9). Neyman, J. & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Phil. Trans. R. Soc. A 231 — Type I/II error, power, and the framework of EQ S4.8. Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine 2(8) — the conditional-probability argument behind §4.6. Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science. Science 349(6251) — the large-scale replication study that crystallized the crisis. Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate. J. R. Stat. Soc. B 57(1) — FDR control for the many-tests regime of EQ S4.13. Wasserstein, R. L. & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) — the profession's own caution on what a p-value is not. Welch, B. L. (1947). The Generalization of Student's Problem When Several Different Population Variances Are Involved. Biometrika 34 — the unequal-variance two-sample test of EQ S4.10. ← PREVIOUS 03 Correlation NEXT CHAPTER 05 Bayesian Inference AI // ENCYCLOPEDIA — STATISTICS · CH 04 FULL CONTENTS ↗ ## STATS · Bayesian Inference (https://ai-encyclopedia.com/stats/05-bayesian.html) Bayesian Inference — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 05 / BAYESIAN INDEX NEXT: LINEAR ALGEBRA → MATHEMATICS & STATISTICS · CHAPTER 05 / 08 Bayesian Inference Frequentist statistics treats a parameter as a fixed unknown and the data as random; Bayesian inference reverses this. A parameter becomes a probability distribution that data updates, a prior turned by likelihood into a posterior. Bayes' theorem drives the entire procedure and resolves several frequentist paradoxes, at the cost of requiring you to state your assumptions explicitly as a prior. LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON STATS 01–04 INSTRUMENTS BETA UPDATER · PRIOR SENSITIVITY · MAP vs MLE IN THIS CHAPTER 5.1 Prior · likelihood · posterior 5.2 Conjugate priors 5.3 MAP vs MLE 5.4 Credible vs confidence 5.5 When to go Bayesian 5.R References 5.1 Prior, likelihood, posterior Start from the definition of conditional probability and read it as a learning rule. You hold a belief about a parameter \(\theta\) before seeing data — the prior \(p(\theta)\). The data \(D\) arrive with a likelihood \(p(D \mid \theta)\), the probability the model assigns to what you observed for each candidate value of \(\theta\). Bayes' theorem combines them into the posterior \(p(\theta \mid D)\) — your updated belief: EQ S5.1 — BAYES' THEOREM $$ p(\theta \mid D) \;=\; \frac{p(D \mid \theta)\, p(\theta)}{p(D)}, \qquad p(D) = \int p(D \mid \theta)\, p(\theta)\, \mathrm{d}\theta $$ The denominator \(p(D)\) — the marginal likelihood or evidence — is just the constant that makes the posterior integrate to 1; it does not depend on \(\theta\). So for inference about \(\theta\) you can usually drop it and work with the proportionality posterior \(\propto\) likelihood \(\times\) prior. The hard part of Bayesian computation is almost always that integral over \(\theta\); conjugacy (§5.2) makes it disappear, and when it cannot, you reach for MCMC or variational methods. The unnormalized form is the one to keep in your head, because it shows exactly how the three pieces interact: EQ S5.2 — POSTERIOR AS A PRODUCT $$ \underbrace{p(\theta \mid D)}_{\text{posterior}} \;\propto\; \underbrace{p(D \mid \theta)}_{\text{likelihood}} \;\cdot\; \underbrace{p(\theta)}_{\text{prior}} $$ The prior is a soft starting point; the likelihood pulls it toward whatever the data support. With little data the prior dominates; as data accumulate the likelihood overwhelms any non-dogmatic prior and the posterior converges to the truth (the Bernstein–von Mises theorem makes this precise: the posterior becomes asymptotically Gaussian and prior-independent). A prior that puts zero mass on a value can never recover — Cromwell's rule: never assign probability exactly 0 or 1 to something you might be wrong about. Three properties make this more than a formula. First, it is sequential: yesterday's posterior is today's prior, and processing data in one batch or in a stream gives the identical result. Second, it returns a whole distribution, not a point estimate — uncertainty is first-class, not an afterthought computed from a sampling thought-experiment. Third, every quantity in it is a probability about \(\theta\) itself, which is what most people actually want to know and (contestably) believe a confidence interval already tells them. KEY The interpretive split is real, not cosmetic. To a frequentist, \(\theta\) is a fixed constant and probability statements about it are meaningless; randomness lives in the data and the procedure. To a Bayesian, \(\theta\) is uncertain and probability is the language of that uncertainty. Both camps agree on the math of EQ S5.1 — they disagree on what a probability is. Most working statisticians today are pragmatic: Bayesian when priors are defensible and uncertainty must be honest, frequentist when a guarantee over repeated use is what matters. 5.2 Conjugate priors: Beta–Binomial & Normal–Normal The integral in EQ S5.1 is what makes Bayes hard. A conjugate prior sidesteps it entirely: if the prior and posterior belong to the same family, updating is just arithmetic on the parameters. The canonical pair is the Beta prior with a Binomial likelihood — the model for "estimate a coin's bias from heads and tails." EQ S5.3 — BETA–BINOMIAL UPDATE $$ \theta \sim \mathrm{Beta}(\alpha, \beta), \quad k \mid \theta \sim \mathrm{Binomial}(n, \theta) \;\;\Longrightarrow\;\; \theta \mid k \sim \mathrm{Beta}(\alpha + k,\; \beta + n - k) $$ Observe \(k\) successes in \(n\) trials and you simply add the successes to \(\alpha\) and the failures to \(\beta\). The prior parameters act as pseudo-counts: \(\mathrm{Beta}(1,1)\) is uniform (one imaginary head, one imaginary tail — total ignorance over \([0,1]\)); \(\mathrm{Beta}(2,2)\) is a gentle nudge toward a fair coin. The posterior mean is \(\dfrac{\alpha + k}{\alpha + \beta + n}\) — a data MLE \(k/n\) shrunk toward the prior mean \(\alpha/(\alpha+\beta)\), with the shrinkage fading as \(n\) grows. The same trick works for a Gaussian mean with known variance. A Normal prior on the mean, combined with Normal data, yields a Normal posterior whose mean is a precision-weighted average of prior and data: EQ S5.4 — NORMAL–NORMAL UPDATE (KNOWN σ²) $$ \mu \sim \mathcal{N}(\mu_0, \tau_0^2), \;\; x_i \mid \mu \sim \mathcal{N}(\mu, \sigma^2) \;\;\Longrightarrow\;\; \mu \mid \bar{x} \sim \mathcal{N}\!\left( \frac{\tfrac{\mu_0}{\tau_0^2} + \tfrac{n\bar{x}}{\sigma^2}}{\tfrac{1}{\tau_0^2} + \tfrac{n}{\sigma^2}},\; \left(\tfrac{1}{\tau_0^2} + \tfrac{n}{\sigma^2}\right)^{-1} \right) $$ Work in precision (inverse variance) and it is beautifully simple: posterior precision = prior precision + data precision, and the posterior mean is the average of \(\mu_0\) and \(\bar{x}\) weighted by those precisions. Each observation adds \(1/\sigma^2\) of precision, so the posterior tightens like \(1/n\) and the data term \(n\bar{x}/\sigma^2\) eventually swamps the prior. This is the engine behind Kalman filters, ridge regression's Bayesian reading, and Gaussian hierarchical models. You start with a \( \mathrm{Beta}(2,2) \) prior on a coin's bias and observe 7 heads in 10 flips. What is the posterior mean ? (Use EQ S5.3, then \( \tfrac{\alpha'}{\alpha'+\beta'} \).) Update: \( \alpha' = 2 + 7 = 9 \), \( \beta' = 2 + (10-7) = 5 \), so the posterior is \( \mathrm{Beta}(9,5) \). Posterior mean \( = \dfrac{9}{9+5} = \dfrac{9}{14} = \) 0.643. Note it sits below the raw MLE of \(7/10 = 0.70\): the symmetric prior shrinks the estimate toward \(0.5\). A posterior comes out as \( \mathrm{Beta}(3, 1) \). What is its mean ? The mean of \( \mathrm{Beta}(\alpha,\beta) \) is \( \dfrac{\alpha}{\alpha+\beta} = \dfrac{3}{3+1} = \dfrac{3}{4} = \) 0.75. (Its mode, by contrast, is \( \tfrac{\alpha-1}{\alpha+\beta-2} = \tfrac{2}{2} = 1 \) — mean and mode part ways for skewed Betas.) PYTHON · RUNNABLE IN-BROWSER # EQ S5.3: Beta-Binomial conjugate update -- posterior mean vs MLE import numpy as np a0, b0 = 2.0, 2.0 # Beta(2,2) prior: gentle nudge toward a fair coin k, n = 7, 10 # observed: 7 heads in 10 flips a, b = a0 + k, b0 + (n - k) # EQ S5.3: add heads to a, tails to b post_mean = a / (a + b) # E[theta | data] post_var = a * b / ((a + b)**2 * (a + b + 1)) prior_mean = a0 / (a0 + b0) mle = k / n print(f"prior: Beta({a0:.0f},{b0:.0f}) mean {prior_mean:.4f}") print(f"posterior: Beta({a:.0f},{b:.0f}) mean {post_mean:.4f} sd {post_var**0.5:.4f}") print(f"MLE k/n: {mle:.4f}") print(f"shrinkage: posterior sits {mle - post_mean:+.4f} from the MLE,") print(f" pulled toward the prior mean {prior_mean:.2f}") # draw the posterior density on a grid (unnormalized Beta kernel, then normalize) grid = np.linspace(0.001, 0.999, 200) dx = grid[1] - grid[0] dens = grid**(a - 1) * (1 - grid)**(b - 1) dens /= dens.sum() * dx # normalize to unit area (Riemann) plot_xy(grid, dens) RUN ▶ edits are live — break it on purpose INSTRUMENT S5.1 — BETA–BINOMIAL UPDATER EQ S5.3 · LIVE POSTERIOR PRIOR α 2 PRIOR β 2 TRUE BIAS (COIN) 0.70 FLIP COINS +1 +10 +100 RESET DATA (HEADS / FLIPS) 0 / 0 POSTERIOR Beta(α′,β′) — POSTERIOR MEAN — 95% CREDIBLE — The grey curve is the prior, the mint curve the posterior; the dashed line is the true bias. Flip a few coins and the posterior is broad and prior-shaped; flip a hundred and watch it collapse to a spike on the truth — the likelihood drowning the prior exactly as §5.1 promised. Set α = β = 1 for a flat prior and the posterior mean tracks the raw MLE; crank α and β up to feel a stubborn prior resist the data. 5.3 MAP vs MLE: two ways to pick one number A full posterior is the honest answer, but engineering often wants a single estimate. Two point estimates dominate. Maximum likelihood (MLE) picks the \(\theta\) that makes the data most probable, ignoring any prior. Maximum a posteriori (MAP) picks the mode of the posterior — the most probable \(\theta\) given the data and the prior: EQ S5.5 — MLE AND MAP $$ \hat{\theta}_{\text{MLE}} = \arg\max_{\theta}\; p(D \mid \theta), \qquad \hat{\theta}_{\text{MAP}} = \arg\max_{\theta}\; p(D \mid \theta)\, p(\theta) $$ MAP is MLE with the prior multiplied back in — equivalently, with \(\log p(\theta)\) added to the log-likelihood. MAP collapses to MLE exactly when the prior is flat (\(p(\theta)\) constant), so MLE is the special case "I refuse to state a prior." Crucially, MAP is not the same as the posterior mean: for a skewed posterior the mode, mean, and median all differ, and MAP — being a single point — throws away the very uncertainty that motivated going Bayesian. The connection to machine learning is direct and worth internalizing: regularization is a prior. Adding an \(L_2\) penalty \(\lambda\lVert\theta\rVert^2\) to a loss is precisely MAP estimation under a Gaussian prior \(\mathcal{N}(0, 1/(2\lambda))\) on the weights; an \(L_1\) penalty is a Laplace prior. The penalty strength \(\lambda\) is the prior's tightness. Seen this way, "regularized MLE" and "MAP" are the same computation under two vocabularies. EQ S5.6 — MAP MEAN OF A BETA POSTERIOR $$ \hat{\theta}_{\text{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}, \qquad \hat{\theta}_{\text{mean}} = \frac{\alpha + k}{\alpha + \beta + n}, \qquad \hat{\theta}_{\text{MLE}} = \frac{k}{n} $$ For the Beta–Binomial model all three estimators have closed forms, and comparing them is the cleanest way to feel the difference. With a flat \(\mathrm{Beta}(1,1)\) prior the MAP \(\tfrac{k}{n}\) coincides with the MLE (the "\(-1\)" terms cancel the "\(+1\)" pseudo-counts), while the posterior mean \(\tfrac{k+1}{n+2}\) is Laplace's rule of succession — still shrunk toward \(0.5\). On small \(n\), these can differ enough to matter; by large \(n\) they converge. PYTHON · RUNNABLE IN-BROWSER # Grid-approximate a Normal-mean posterior (EQ S5.4 shape, computed numerically) # then read off the 95% credible interval from the posterior CDF. import numpy as np rng = np.random.default_rng(0) mu_true, sigma = 5.0, 2.0 # known data variance data = rng.normal(mu_true, sigma, size=8) # a SMALL sample of 8 mu0, tau0 = 0.0, 5.0 # weak prior: N(0, 5^2) on the mean grid = np.linspace(-2, 12, 2000) # candidate values of mu logprior = -0.5 * ((grid - mu0) / tau0)**2 # log-likelihood of the sample under each candidate mu (sum over data points) loglik = -0.5 * (((data[:, None] - grid[None,:]) / sigma)**2).sum(0) logpost = logprior + loglik post = np.exp(logpost - logpost.max()) # stabilize, then normalize dx = grid[1] - grid[0] post /= post.sum() * dx # unit-area posterior cdf = np.cumsum(post) * dx # running probability mass lo = grid[np.searchsorted(cdf, 0.025)] # 2.5th percentile hi = grid[np.searchsorted(cdf, 0.975)] # 97.5th percentile mean = (grid * post).sum() * dx # E[mu | data] print(f"sample mean (MLE): {data.mean():.3f}") print(f"posterior mean: {mean:.3f}") print(f"95% credible interval: [{lo:.3f}, {hi:.3f}]") print(f"interpretation: P(mu in interval | data) = 0.95 -- a") print(f" statement about mu, not about the procedure") plot_xy(grid, post) RUN ▶ edits are live — break it on purpose INSTRUMENT S5.2 — MAP vs MLE ON A SMALL SAMPLE EQ S5.6 · BETA–BINOMIAL HEADS k 2 FLIPS n 3 PRIOR α=β 2.0 MLE k/n — MAP (mode) — POSTERIOR MEAN — GAP MLE − MAP — Three vertical markers on the posterior: blue MLE, mint MAP (mode), and a paler mint posterior mean. With n = 3 and a \(\mathrm{Beta}(2,2)\) prior they sit far apart — small data is exactly where the prior earns its keep. Drag n up to 40 and the three markers march together, the posterior narrows, and the choice of estimator stops mattering. Set the prior to 1 and the MAP snaps onto the MLE (EQ S5.6). 5.4 Credible intervals vs confidence intervals This is the comparison where the two philosophies collide most sharply, and where the difference is routinely misstated. Both produce an interval; they answer different questions. EQ S5.7 — A 95% CREDIBLE INTERVAL $$ \mathbb{P}\big(\theta \in [L, U] \,\big|\, D\big) = 0.95, \qquad \text{e.g. } \int_{L}^{U} p(\theta \mid D)\, \mathrm{d}\theta = 0.95 $$ A credible interval is a direct probability statement about the parameter: given the data you actually saw, there is a 95% probability \(\theta\) lies in \([L, U]\). Two common flavors: the equal-tailed interval (cut 2.5% off each tail of the posterior) and the highest-posterior-density (HPD) interval (the shortest interval containing 95% of the mass — every point inside is more probable than every point outside). For symmetric posteriors they coincide; for skewed ones the HPD is tighter and more honest. A confidence interval makes no such statement. Its 95% is a property of the procedure across hypothetical repetitions: if you reran the whole experiment many times and computed an interval each time, about 95% of those intervals would contain the true (fixed) \(\theta\). For the one interval in front of you, \(\theta\) is either in it or not — the 95% does not transfer to this realization. The seductive sentence "there's a 95% chance the parameter is in this interval" is a credible-interval statement smuggled onto a confidence interval, and it is false under the frequentist definition. Credible interval (Bayesian) Confidence interval (frequentist) What's random θ (the belief) the interval (the data) The 95% means P(θ in [L,U] | this data) = 0.95 95% of such intervals cover the fixed θ over repeats Needs a prior yes no Guarantee conditional on the data you saw long-run, over the procedure "95% chance θ is inside" correct wrong For large samples with a flat prior the two intervals nearly coincide numerically — which is why the distinction looks pedantic until it isn't. They diverge when the prior carries real information, when \(n\) is small, or when the parameter lives near a boundary (a near-zero proportion, a variance close to zero), where confidence intervals can behave pathologically — even extending below zero for a quantity that cannot be negative — while a properly bounded prior keeps the credible interval inside the feasible region. A Bayesian reports a 95% credible interval \([L, U]\) for \( \theta \). Reading it correctly: what is \( \mathbb{P}(\theta \in [L, U] \mid D) \), the probability the parameter lies inside given the observed data ? By definition (EQ S5.7), a 95% credible interval is exactly the interval whose posterior mass is 0.95, so \( \mathbb{P}(\theta \in [L,U] \mid D) = \) 0.95. This is the statement people want a confidence interval to make — and the reason credible intervals are easier to communicate honestly. CONTESTED The interpretation gap cuts both ways. Bayesians point out that the credible interval answers the question people actually ask. Frequentists counter that it only does so if you accept the prior — and that a confidence interval's coverage guarantee holds regardless of any belief, which is exactly what you want for a regulator or a referee. Neither is "more correct"; they optimize for different things. The honest summary: report a credible interval when a defensible prior exists and you owe a statement about this parameter; report a confidence interval when you owe a guarantee that survives an adversary's choice of θ. 5.5 When to go Bayesian: small data, real priors, hierarchy Bayesian inference is not a moral upgrade over frequentist statistics — it is a tool with a cost (you must specify a prior, and often pay for sampling) and three regimes where it clearly pays for itself. Small data. When \(n\) is tiny, the MLE is high-variance and can sit on the boundary (zero successes \(\Rightarrow\) "the true rate is 0"). A mild prior regularizes the estimate and, more importantly, the posterior reports its own width — you get calibrated uncertainty instead of an overconfident point. This is exactly the regime Instrument S5.2 dramatizes. Genuine prior information. If you actually know something — a physical constraint, last quarter's conversion rate, a published effect size — discarding it to "let the data speak" is throwing away signal. A prior is the disciplined way to encode it, and the posterior shows precisely how much the new data revised it. Hierarchy & partial pooling. When you estimate many related quantities at once — conversion rates for 500 stores, batting averages for 200 players, effects across 30 hospitals — a hierarchical model lets a shared hyper-prior borrow strength across groups. Each estimate is shrunk toward the population mean by an amount the data decide; noisy small-sample groups shrink a lot, well-measured groups barely move. This is the modern form of the James–Stein result that a pooled estimator dominates independent MLEs. EQ S5.8 — A HIERARCHICAL (PARTIAL-POOLING) MODEL $$ \phi \sim p(\phi), \qquad \theta_j \mid \phi \sim p(\theta_j \mid \phi)\;\; (j = 1,\dots,J), \qquad y_{ij} \mid \theta_j \sim p(y \mid \theta_j) $$ A top-level hyperparameter \(\phi\) (e.g. the population mean and spread) generates group-level parameters \(\theta_j\), which generate the observations. Fitting \(\phi\) and all \(\theta_j\) jointly produces shrinkage toward the group mean — the cure for both the "no pooling" extreme (every group on its own, wildly overfit) and the "complete pooling" extreme (one number for everyone, badly biased). Closed-form conjugacy rarely survives here, so these models are the daily reason practitioners reach for MCMC (Hamiltonian Monte Carlo / NUTS in Stan, PyMC, NumPyro) or variational inference. When to stay frequentist. If a defensible prior is genuinely unavailable and a referee will challenge whatever you pick; if you need a coverage guarantee that holds for an adversarially chosen θ (much regulatory and clinical work); or if the model is simple, data abundant, and the two answers coincide anyway — the frequentist route is cheaper and beyond dispute. The mature stance is fluency in both, not allegiance to one. INSTRUMENT S5.3 — PRIOR-SENSITIVITY EXPLORER SAME DATA · THREE PRIORS · EQ S5.3 HEADS k 3 FLIPS n 5 FLAT Beta(1,1) — FAIR-LEANING Beta(10,10) — SKEPTIC Beta(2,8) — The identical data feed three posteriors: a flat prior (let the data speak), a strong fair-coin prior, and a skeptic who expects a low rate. With n = 5 the three posterior means disagree sharply — prior choice is doing real work. Now drag n toward 100: the curves converge and the disagreement evaporates. That convergence is the honest defence of priors — with enough data they wash out; with little data, stating yours is just being explicit about an assumption you were making anyway. NEXT Every update in this chapter — precision-weighted means, conjugate sums, the integral in Bayes' theorem — is linear algebra in disguise. Chapter 06 lays the foundation those operations stand on: vectors and matrices, eigen-decomposition, the SVD, and the geometry of projections that turns "weighted average of prior and data" into a single matrix equation. 5.R References Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans. R. Soc. — the original statement of the theorem. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press — the standard modern reference for conjugacy, hierarchy, and computation. Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press — the objective-Bayesian case for probability as extended logic. Efron, B. & Morris, C. (1975). Data Analysis Using Stein's Estimator and Its Generalizations. JASA / Ann. Statist. — shrinkage and partial pooling as empirical Bayes. Casella, G. & Berger, R. L. (1987). Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem. JASA — a careful account of where the two frameworks agree and diverge. Carpenter, B. et al. (2017). Stan: A Probabilistic Programming Language. J. Stat. Softw. — Hamiltonian Monte Carlo for the hierarchical models of §5.5. ← PREVIOUS 04 Inference & Testing NEXT CHAPTER 06 Linear Algebra AI // ENCYCLOPEDIA — STATISTICS · CH 05 FULL CONTENTS ↗ ## STATS · Linear Algebra for Machine Learning (https://ai-encyclopedia.com/stats/06-linear-algebra.html) Linear Algebra for Machine Learning — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 06 / LINEAR ALGEBRA INDEX NEXT: MARKOV CHAINS → MATHEMATICS & STATISTICS · CHAPTER 06 / 08 Linear Algebra for Machine Learning Beneath the framework and the GPU, almost every model in this encyclopedia is the same object: a stack of linear maps with some nonlinearity between them. Eigenvectors and the singular value decomposition reveal the directions data actually occupies. This chapter develops the vocabulary of vectors, matrices, norms, and rank, then reaches the two factorizations behind PCA, embeddings, recommenders, and low-rank fine-tuning. LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON STATS 01–05 INSTRUMENTS LINEAR MAP · SVD · EIGEN IN THIS CHAPTER 6.1 Vectors & matrices 6.2 Norms, dot products, projection 6.3 Maps, rank & null space 6.4 Eigenvalues & eigenvectors 6.5 The SVD 6.R References 6.1 Vectors, matrices & the operations that matter A vector is an ordered list of numbers, and you should hold two pictures of it at once: an arrow from the origin in \(\mathbb{R}^n\), and a single data point — one row of your dataset, one word embedding, one image flattened. A matrix \(A \in \mathbb{R}^{m \times n}\) is a grid of numbers, and it too wears two hats: a table of data (rows are examples, columns are features), and — the view this chapter cares about — a function that takes a vector in and returns a vector out. Three operations carry essentially all the weight. Scaling and addition let you form linear combinations \(\alpha\mathbf{u} + \beta\mathbf{v}\); the set of all such combinations of some vectors is their span, a flat subspace through the origin. Matrix–vector multiplication applies the map. And matrix–matrix multiplication composes two maps — its only subtlety is that inner dimensions must agree and order matters: \(AB \neq BA\) in general. EQ S6.1 — MATRIX–VECTOR PRODUCT (TWO READINGS) $$ (A\mathbf{x})_i \;=\; \sum_{j=1}^{n} A_{ij}\, x_j \qquad\Longleftrightarrow\qquad A\mathbf{x} \;=\; \sum_{j=1}^{n} x_j\, \mathbf{a}_{:,j} $$ The left form is the textbook one: output entry \(i\) is the dot product of row \(i\) of \(A\) with \(\mathbf{x}\). The right form is the one that builds intuition: \(A\mathbf{x}\) is a linear combination of the columns of \(A\), weighted by the entries of \(\mathbf{x}\). That single re-reading explains rank, span, and the column space (§6.3) in one move — and it is exactly what a fully-connected layer \(\mathbf{y} = W\mathbf{x} + \mathbf{b}\) computes a few billion times a second. Matrix multiplication is associative and distributive but, crucially, not commutative: rotating then stretching is not the same as stretching then rotating. The transpose \(A^\top\) flips rows and columns and obeys \((AB)^\top = B^\top A^\top\). The identity \(I\) leaves vectors untouched, and an inverse \(A^{-1}\) (when it exists — only for square, full-rank \(A\)) undoes the map. Most matrices in machine learning are not invertible, which is the whole reason §6.5 exists. SHAPE DISCIPLINE Half of all numerical bugs are shape bugs. For \(AB\) to be defined, \(A\) must be \(m\times k\) and \(B\) must be \(k\times n\) — the inner \(k\) must match, and the result is \(m\times n\). When a model crashes at 3 a.m., the first thing a practitioner prints is.shape. Reading every product right-to-left as "a map applied to the output of another map" makes the dimensions self-checking. Let \(A\) be \(2\times 3\) and \(B\) be \(3\times 4\). The inner dimensions match (both \(3\)), so \(AB\) is defined. How many entries does the resulting matrix \(AB\) have? An \(m\times k\) times a \(k\times n\) matrix gives an \(m\times n\) result. Here \(AB\) is \(2\times 4\), so it has \(2 \times 4 = \) 8 entries. The shared inner dimension \(k=3\) is summed over and does not appear in the output shape. 6.2 Norms, dot products & projections To do geometry you need length and angle. The dot product supplies both: it is the engine behind similarity scores, attention logits, least-squares, and the kernel trick. The L2 (Euclidean) norm is a vector's length; the dot product of two unit vectors is the cosine of the angle between them. EQ S6.2 — DOT PRODUCT, NORM & COSINE $$ \mathbf{u}\cdot\mathbf{v} = \sum_{i=1}^{n} u_i v_i = \|\mathbf{u}\|\,\|\mathbf{v}\|\cos\theta, \qquad \|\mathbf{v}\|_2 = \sqrt{\mathbf{v}\cdot\mathbf{v}} = \sqrt{\textstyle\sum_i v_i^2} $$ Two vectors are orthogonal exactly when \(\mathbf{u}\cdot\mathbf{v}=0\) (\(\cos\theta = 0\)). Cosine similarity \(\dfrac{\mathbf{u}\cdot\mathbf{v}}{\|\mathbf{u}\|\,\|\mathbf{v}\|}\) is the dot product after normalizing length — the default way to compare embeddings, because it cares about direction, not magnitude. The L1 norm \(\|\mathbf{v}\|_1 = \sum_i|v_i|\) and the max norm \(\|\mathbf{v}\|_\infty = \max_i|v_i|\) are the other two you meet daily; L1 regularization owes its sparsity to the diamond shape of its unit ball. WORKED EXAMPLE ▾ 01 Take \(\mathbf{v} = (3, 4)\). Its L2 norm is \(\sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5\) — the classic 3-4-5 right triangle. 02 Its L1 norm is \(|3| + |4| = 7\); its max norm is \(\max(3,4) = 4\). Always \(\|\mathbf{v}\|_\infty \le \|\mathbf{v}\|_2 \le \|\mathbf{v}\|_1\). 03 Dot with \(\mathbf{u} = (4, -3)\): \(3\cdot 4 + 4\cdot(-3) = 12 - 12 = 0\) — orthogonal. Indeed \((3,4)\) and \((4,-3)\) meet at a right angle. RESULT: ‖(3,4)‖₂ = 5, and (3,4) ⟂ (4,−3) The geometric companion to the dot product is the projection. To project \(\mathbf{x}\) onto the direction of \(\mathbf{a}\) is to find the point on the line through \(\mathbf{a}\) closest to \(\mathbf{x}\) — the "shadow" \(\mathbf{x}\) casts on that line. This single idea, generalized to projecting onto the column space of a matrix, is least-squares regression. EQ S6.3 — PROJECTION ONTO A DIRECTION $$ \mathrm{proj}_{\mathbf{a}}(\mathbf{x}) \;=\; \frac{\mathbf{a}\cdot\mathbf{x}}{\mathbf{a}\cdot\mathbf{a}}\;\mathbf{a}, \qquad\text{and the residual } \mathbf{x} - \mathrm{proj}_{\mathbf{a}}(\mathbf{x}) \perp \mathbf{a} $$ The scalar \(\dfrac{\mathbf{a}\cdot\mathbf{x}}{\mathbf{a}\cdot\mathbf{a}}\) is "how many copies of \(\mathbf{a}\)" you need; multiplying by \(\mathbf{a}\) places the shadow on the line. The leftover, the residual, is orthogonal to \(\mathbf{a}\) by construction — the defining property that makes projection the unique closest point. Ordinary least squares is exactly this with \(\mathbf{a}\) replaced by a whole subspace: \(\hat{\mathbf{y}} = X(X^\top X)^{-1}X^\top\mathbf{y}\) projects the targets onto the column space of the design matrix. What is the L2 (Euclidean) norm of the vector \( \mathbf{v} = (3, 4) \)? Compute \( \|\mathbf{v}\|_2 = \sqrt{3^2 + 4^2} \). \( \|\mathbf{v}\|_2 = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = \) 5. This is the length of the hypotenuse of a 3-4-5 right triangle. PYTHON · RUNNABLE IN-BROWSER # Norms, dot products, cosine similarity, and projection (EQ S6.2-S6.3) import numpy as np u = np.array([3.0, 4.0]) v = np.array([4.0, -3.0]) x = np.array([2.0, 1.0]) print("||u||_2 (sqrt of sum of squares):", np.linalg.norm(u)) # -> 5.0 print("||u||_1 (sum of abs):", np.linalg.norm(u, 1)) # -> 7.0 print("||u||_inf (max abs):", np.linalg.norm(u, np.inf)) dot = u @ v print("\nu. v:", dot, " -> orthogonal" if abs(dot) RUN ▶ edits are live — break it on purpose 6.3 A matrix as a linear map; rank & null space Here is the conceptual hinge of the chapter. A matrix \(A\) is a linear map \(\mathbf{x}\mapsto A\mathbf{x}\): it sends lines to lines, keeps the origin fixed, and respects linear combinations, \(A(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha A\mathbf{u} + \beta A\mathbf{v}\). Geometrically a \(2\times 2\) matrix can rotate, scale, shear, reflect, or flatten the plane — and nothing else. The instrument below lets you grab the four entries and watch the unit grid deform. Two subspaces describe everything a map does. The column space (range) is the set of all outputs \(A\mathbf{x}\) — by EQ S6.1 it is precisely the span of the columns. The null space (kernel) is the set of inputs that get crushed to zero, \(A\mathbf{x}=\mathbf{0}\). The dimension of the column space is the rank — the number of genuinely independent directions the map can produce. EQ S6.4 — RANK–NULLITY THEOREM $$ \underbrace{\operatorname{rank}(A)}_{\dim(\text{column space})} \;+\; \underbrace{\operatorname{nullity}(A)}_{\dim(\text{null space})} \;=\; n \quad (\text{number of columns of } A \in \mathbb{R}^{m\times n}) $$ Every input dimension is accounted for: it either survives into the output (contributing to rank) or is annihilated (contributing to nullity). A square matrix is invertible iff it is full rank, i.e. nullity \(=0\), i.e. \(\det A \neq 0\). Row rank always equals column rank — a small miracle worth pausing on. Real data matrices are nearly low-rank: a thousand columns of survey answers might have effective rank 20, because the columns are correlated. That redundancy is exactly what §6.5 exploits. For a \(2\times 2\) map the determinant \(\det A = ad - bc\) (for \(A = \left[\begin{smallmatrix} a & b \\ c & d \end{smallmatrix}\right]\)) is the signed area-scaling factor: a unit square of area 1 becomes a parallelogram of area \(|\det A|\). A determinant of zero means the map flattens the plane onto a line (or a point) — it loses a dimension, drops rank, and cannot be inverted. A negative determinant means the map also flips orientation (a reflection). INSTRUMENT S6.1 — 2×2 LINEAR-MAP VISUALIZER DRAG ENTRIES · UNIT GRID · EIGENVECTORS a (col 1, row 1) 2.0 b (col 2, row 1) 1.0 c (col 1, row 2) 1.0 d (col 2, row 2) 2.0 det A (AREA × ORIENT.) — RANK — REAL EIGENVALUES λ — The faint grid is the input plane; the mint grid is its image under \(A\). The blue arrows are the images of the standard basis — they are the columns of \(A\). When real eigenvalues exist, their eigenvectors are drawn as red rays: directions the map only stretches, never turns. Pull the determinant to zero (try a=2, b=1, c=2, d=1) and the whole plane collapses onto a single line — rank drops to 1, the map is no longer invertible. Default \(A = \left[\begin{smallmatrix}2&1\\1&2\end{smallmatrix}\right]\) has eigenvalues 3 and 1 along the diagonals \((1,1)\) and \((1,-1)\). 6.4 Eigenvalues & eigenvectors — the invariant directions Most directions get rotated when you apply a matrix. A precious few do not: the map only stretches them, leaving their line untouched. Those special directions are the eigenvectors, and the stretch factors are the eigenvalues. They are the natural coordinate system of the map — the axes along which its behavior is pure scaling. EQ S6.5 — THE EIGENVALUE EQUATION $$ A\mathbf{v} = \lambda\mathbf{v}, \quad \mathbf{v}\neq\mathbf{0} \qquad\Longleftrightarrow\qquad \det(A - \lambda I) = 0 $$ \(A\mathbf{v}=\lambda\mathbf{v}\) says: applying \(A\) to \(\mathbf{v}\) just rescales it by \(\lambda\). For a nonzero \(\mathbf{v}\) to exist, \(A-\lambda I\) must be singular, giving the characteristic polynomial \(\det(A-\lambda I)=0\). For a \(2\times 2\) matrix this is the tidy quadratic \(\lambda^2 - (\operatorname{tr}A)\,\lambda + \det A = 0\), where \(\operatorname{tr}A = a+d\). Two facts fall straight out: the eigenvalues sum to the trace and multiply to the determinant. Eigen-decomposition is everywhere once you know its face. PCA diagonalizes the covariance matrix; its eigenvectors are the principal axes of the data cloud and the eigenvalues are the variance along each. PageRank is the dominant eigenvector of the web's link matrix. The spectral theorem guarantees that any symmetric matrix \(A = A^\top\) — covariance matrices, graph Laplacians, Gram matrices, Hessians — has real eigenvalues and a full set of orthogonal eigenvectors, which is what makes those objects so well-behaved. A symmetric matrix is positive-definite exactly when all its eigenvalues are positive: the condition for a strictly convex quadratic and a unique loss minimum (Stats 02). CAVEAT Not every matrix is so tidy. A real matrix can have complex eigenvalues — a pure rotation \(\left[\begin{smallmatrix}0&-1\\1&0\end{smallmatrix}\right]\) has eigenvalues \(\pm i\) and no real eigenvector, because it turns every direction. Non-symmetric matrices may also be defective: they lack a full set of independent eigenvectors and cannot be diagonalized at all. The SVD (§6.5) sidesteps every one of these pathologies — it exists for any matrix, square or not, real or rank-deficient — which is why it, not eigen-decomposition, is the workhorse of applied linear algebra. INSTRUMENT S6.2 — 2×2 EIGEN-DECOMPOSITION STEPPER trace · det · CHARACTERISTIC ROOTS · EQ S6.5 a 2.0 b 1.0 c 1.0 d 2.0 trace = λ₁ + λ₂ — det = λ₁ · λ₂ — EIGENVALUES — Reads off the characteristic equation step by step: trace, determinant, discriminant \(\tau^2 - 4\delta\), then the roots \(\lambda = \tfrac{1}{2}\big(\tau \pm \sqrt{\tau^2 - 4\delta}\big)\). When the discriminant goes negative the roots turn complex (a rotational map — no real eigenvector). The default \(\left[\begin{smallmatrix}2&1\\1&2\end{smallmatrix}\right]\) gives \(\tau=4\), \(\delta=3\), discriminant \(4\), and eigenvalues 3 and 1. Notice trace and determinant always equal the sum and product of the eigenvalues — a free correctness check. What is the largest eigenvalue of the diagonal matrix \( \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix} \)? For a diagonal (or triangular) matrix the eigenvalues are exactly the diagonal entries: here \( \{2, 3\} \). The largest is 3. (Check: trace \(= 2+3 = 5 = \lambda_1 + \lambda_2\); determinant \(= 2\cdot 3 = 6 = \lambda_1\lambda_2\). ✓) PYTHON · RUNNABLE IN-BROWSER # Power iteration: find the dominant eigenvector, compare to numpy.linalg.eig import numpy as np rng = np.random.default_rng(0) A = np.array([[2.0, 1.0], [1.0, 2.0]]) # symmetric: eigenvalues 3 and 1 v = rng.normal(size=2) # random start v /= np.linalg.norm(v) for it in range(50): # repeatedly apply A and renormalize w = A @ v v = w / np.linalg.norm(w) lam = v @ (A @ v) # Rayleigh quotient -> the eigenvalue # numpy's reference decomposition vals, vecs = np.linalg.eig(A) top = np.argmax(np.abs(vals)) print("power-iteration eigenvalue:", round(float(lam), 6)) print("numpy top eigenvalue:", round(float(vals[top].real), 6)) print("power-iteration eigenvector:", np.round(np.abs(v), 4)) print("numpy top eigenvector:", np.round(np.abs(vecs[:, top]), 4)) print("A v - lambda v (~0):", np.round(A @ v - lam * v, 6)) RUN ▶ edits are live — break it on purpose 6.5 The Singular Value Decomposition — the master factorization Eigen-decomposition is fussy: it wants square, ideally symmetric, non-defective matrices. The Singular Value Decomposition has no such demands. Every matrix — rectangular, rank-deficient, whatever — factors as a rotation, a pure axis-aligned scaling, and another rotation: EQ S6.6 — THE SVD $$ A = U\Sigma V^\top, \qquad A \in \mathbb{R}^{m\times n},\; U^\top U = I,\; V^\top V = I,\; \Sigma = \operatorname{diag}(\sigma_1 \ge \sigma_2 \ge \cdots \ge 0) $$ \(V\) (right singular vectors) is an orthonormal basis of the input space, \(U\) (left singular vectors) of the output space, and the singular values \(\sigma_i\) on \(\Sigma\) say how much each direction is stretched. Read it as a recipe: \(A\) rotates by \(V^\top\), scales each axis by \(\sigma_i\), then rotates by \(U\). The rank of \(A\) is just the count of nonzero \(\sigma_i\). The singular values are the square roots of the eigenvalues of \(A^\top A\), connecting the SVD back to §6.4. Every matrix has one; it is the closest thing linear algebra has to a universal tool. The SVD's superpower is optimal compression. Keep only the largest \(k\) singular values — zero out the rest — and you get the best possible rank-\(k\) approximation of \(A\), in a precise sense. This is the Eckart–Young theorem, one of the most useful results in all of applied mathematics: EQ S6.7 — ECKART–YOUNG: BEST LOW-RANK APPROXIMATION $$ A_k = \sum_{i=1}^{k} \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top \;=\; \arg\min_{\operatorname{rank}(B)\le k} \|A - B\|, \qquad \|A - A_k\|_F = \sqrt{\textstyle\sum_{i>k}\sigma_i^2} $$ No rank-\(k\) matrix approximates \(A\) better than truncating its SVD — true in both the spectral norm (error \(=\sigma_{k+1}\)) and the Frobenius norm (error \(=\sqrt{\sum_{i>k}\sigma_i^2}\)). The reconstruction error is governed entirely by the singular values you threw away. If the spectrum decays fast — as it does for almost all real data — a tiny \(k\) captures nearly everything. This single theorem underlies PCA, image compression, latent-semantic indexing, collaborative-filtering recommenders, and the low-rank update at the heart of LoRA fine-tuning (Vol II · CH 06). The connection to PCA is exact: center your data matrix, take its SVD, and the right singular vectors \(\mathbf{v}_i\) are the principal components, with variance \(\sigma_i^2/(N-1)\) along each. Computing PCA via the SVD rather than by forming \(X^\top X\) is also numerically preferable — squaring the matrix squares its condition number and throws away precision. INSTRUMENT S6.3 — SVD LOW-RANK APPROXIMATION RANK-k RECONSTRUCTION · ECKART–YOUNG · EQ S6.7 TARGET SMOOTH RAMP RING CHECKER KEPT RANK k 3 FULL RANK 16 RELATIVE ERROR ‖A−Aₖ‖/‖A‖ — STORAGE vs FULL — Left panel is the original \(16\times16\) matrix as a heatmap; right panel is its rank-\(k\) reconstruction \(A_k = \sum_{i\le k}\sigma_i\mathbf{u}_i\mathbf{v}_i^\top\). Drag \(k\) and watch the error fall. The smooth ramp is essentially rank 2 — by \(k=2\) the error is already near zero. The ring has a longer tail of singular values, so its error decays gradually. The blocky checkerboard hides a surprise: despite looking high-frequency it has only two nonzero singular values (both \(\approx 8\)), so \(k=2\) reconstructs it perfectly — a reminder that visual complexity and algebraic rank are different things. A rank-\(k\) factorization stores \(k(m+n)\) numbers instead of \(mn\): at \(k=3\) on a \(16\times16\) grid that is 96 vs 256 numbers, and the gap widens enormously at scale. The error reported is exactly \(\sqrt{\sum_{i>k}\sigma_i^2}/\|A\|_F\), as Eckart–Young promises. PYTHON · RUNNABLE IN-BROWSER # SVD a small matrix, reconstruct with the top-k singular values, print the error import numpy as np rng = np.random.default_rng(1) # a 6x5 matrix with a deliberately low-rank core plus a little noise core = rng.normal(size=(6, 2)) @ rng.normal(size=(2, 5)) # true rank 2 A = core + 0.05 * rng.normal(size=(6, 5)) U, s, Vt = np.linalg.svd(A, full_matrices=False) print("singular values:", np.round(s, 3)) normA = np.linalg.norm(A) # Frobenius norm for k in range(1, len(s) + 1): Ak = (U[:,:k] * s[:k]) @ Vt[:k] # rank-k reconstruction err_measured = np.linalg.norm(A - Ak) / normA err_formula = np.sqrt((s[k:] ** 2).sum()) / normA # Eckart-Young, EQ S6.7 print(f"k={k}: rel.error {err_measured:.4f} " f"(formula sqrt(sum sigma_i>k^2): {err_formula:.4f})") print("\nerror collapses after k=2 -- the matrix is essentially rank 2,") print("and the measured error matches the Eckart-Young formula exactly.") plot_xy(list(range(1, len(s) + 1)), list(s)) # the singular-value spectrum RUN ▶ edits are live — break it on purpose NEXT You now have the static geometry; next comes the dynamics. A linear map applied over and over — a stochastic transition matrix stepping a probability distribution forward — is a Markov chain, and its long-run behavior is decided by exactly the eigenstructure you just met: the dominant eigenvalue is 1, and its eigenvector is the stationary distribution. Chapter 07 turns the matrix loose in time. 6.R References Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. the standard intuition-first text for the column-space, rank, eigenvalue, and SVD material here. Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. the canonical numerical reference for the SVD, power iteration, and conditioning. Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218. the optimal low-rank approximation theorem (EQ S6.7). Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572. the origin of principal-component analysis as projection onto a best-fit subspace (§6.2, §6.5). Stewart, G. W. (1993). On the early history of the singular value decomposition. SIAM Review, 35(4), 551–566. historical context tracing the SVD from Beltrami and Jordan to its modern role. ← PREVIOUS 05 Bayesian NEXT CHAPTER 07 Markov Chains AI // ENCYCLOPEDIA — STATISTICS · CH 06 FULL CONTENTS ↗ ## STATS · Markov Chains & MCMC (https://ai-encyclopedia.com/stats/07-markov-chains.html) Markov Chains & MCMC — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 07 / MARKOV CHAINS INDEX NEXT: INFORMATION THEORY → MATHEMATICS & STATISTICS · CHAPTER 07 / 08 Markov Chains & MCMC A process that forgets its past, a property called memorylessness, is enough to model PageRank and language and to sample from distributions we cannot integrate. The assumption that the next state depends only on the present, not the full history, turns a sequence into a matrix, gives long-run behavior a fixed point you can solve for, and lets us draw from any probability density by walking through it. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON STATS 01 · 06 INSTRUMENTS TRANSITION · PAGERANK · METROPOLIS IN THIS CHAPTER 7.1 The Markov property 7.2 Stationarity & ergodicity 7.3 Hidden Markov Models 7.4 MCMC: Metropolis–Hastings 7.5 Gibbs & MCMC in ML 7.R References 7.1 The Markov property & transition matrices Most stochastic processes carry their whole history with them: tomorrow's weather might depend on the last week, a word on the whole paragraph. A Markov chain is the radical simplification that throws the history away. The future depends on the present state and nothing earlier — the chain is memoryless: EQ S7.1 — THE MARKOV PROPERTY $$ \Pr\!\big(X_{t+1} = j \mid X_t = i,\, X_{t-1},\, \ldots,\, X_0\big) \;=\; \Pr\!\big(X_{t+1} = j \mid X_t = i\big) \;=\; P_{ij} $$ Conditioning on the entire past collapses to conditioning on the single current state. The number \(P_{ij}\) — the probability of stepping from state \(i\) to state \(j\) — is independent of when or how you arrived at \(i\). This one line is the whole subject. Everything downstream (stationarity, HMMs, MCMC) is a consequence of replacing "history" with "current state". A chain that instead needs the last \(k\) states is order-\(k\); but any order-\(k\) chain over alphabet \(S\) is an ordinary order-1 chain over the enlarged state space \(S^k\), so we lose no generality studying order 1. Collect the \(P_{ij}\) into a transition matrix \(P\). It is row-stochastic: every entry is non-negative and every row sums to 1, because from state \(i\) the chain must go somewhere. If the distribution over states today is a row vector \(\pi^{(t)}\), then one step of the chain is one matrix multiply: EQ S7.2 — ONE STEP & THE CHAPMAN–KOLMOGOROV RELATION $$ \pi^{(t+1)} \;=\; \pi^{(t)} P, \qquad\Longrightarrow\qquad \pi^{(t)} \;=\; \pi^{(0)} P^{\,t}, \qquad \big(P^{\,n}\big)_{ij} \;=\; \Pr\!\big(X_{t+n}=j \mid X_t=i\big) $$ Distributions are left -multiplied by \(P\) (row vector times matrix). The \(n\)-step transition probabilities are just the \(n\)-th matrix power — the Chapman–Kolmogorov equation \(P^{m+n} = P^m P^n\) is nothing more than the associativity of matrix multiplication. So the long-run behavior of a Markov chain is an eigenvalue question about \(P\), which is exactly the linear algebra of Chapter 06 put to work. A canonical example is a tiny two-state weather model: Sunny and Rainy, with \(P(\text{stay sunny}) = 0.7\) and \(P(\text{stay rainy}) = 0.6\). The transition matrix and one step from "certainly sunny today" are: EQ S7.3 — A TWO-STATE CHAIN $$ P = \begin{pmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{pmatrix}, \qquad \pi^{(0)} = (1,\ 0), \qquad \pi^{(1)} = \pi^{(0)} P = (0.7,\ 0.3) $$ Rows are "from", columns are "to": row 1 is "from Sunny", so \(0.7\) stay sunny, \(0.3\) turn rainy. Iterate and the distribution marches toward a fixed point \((4/7,\ 3/7) \approx (0.571,\ 0.429)\) regardless of the starting day — the stationary distribution of §7.2. The whole future is encoded in this \(2\times2\) grid. A chain's next state depends only on the current state — that is exactly the Markov property of EQ S7.1. What is the order of such a chain (how many previous states the transition rule reads)? Order-\(k\) means the next state depends on the last \(k\) states. The Markov property says it depends on the current state alone, so \(k = \) 1. (An order-2 chain would read the last two states; it can always be re-encoded as an order-1 chain over pairs.) PYTHON · RUNNABLE IN-BROWSER # Iterate a 2-state chain (EQ S7.2) and watch it forget where it started import numpy as np P = np.array([[0.7, 0.3], # from Sunny -> [Sunny, Rainy] [0.4, 0.6]]) # from Rainy -> [Sunny, Rainy] for start in ([1.0, 0.0], [0.0, 1.0]): # two very different beginnings pi = np.array(start) print(f"start {start}:") for t in range(6): print(f" t={t} P(Sunny)={pi[0]:.4f} P(Rainy)={pi[1]:.4f}") pi = pi @ P # one step = one matrix multiply print() print("Both starts converge to (4/7, 3/7) =", np.round([4/7, 3/7], 4), "-- the chain forgets its initial state.") RUN ▶ edits are live — break it on purpose INSTRUMENT S7.1 — TRANSITION-MATRIX EXPLORER EDIT P · ITERATE TO STATIONARY · EQ S7.2 P(A → A) 0.70 P(B → B) 0.60 START P(A) 1.00 CURRENT P(A) · P(B) — STATIONARY π = (πA, πB) — STEPS TO CONVERGE (Δ<1e-4) — STEP ▸ RUN TO π ▶ RESET Off-diagonals are fixed by row-stochasticity: \(P(A\to B) = 1 - P(A\to A)\). Press STEP to apply \(\pi \leftarrow \pi P\) once and watch the bars march; RUN animates the whole walk to the fixed point. The dashed line is the solved stationary \(\pi_A = \tfrac{P(B\to A)}{P(A\to B)+P(B\to A)}\). Drag START P(A) anywhere — the chain lands on the same π, because it forgets its past. 7.2 Stationary distributions & ergodicity The fixed point the weather chain crawled toward is no accident. A distribution \(\pi\) is stationary if applying the chain leaves it unchanged: one step in, the same distribution out. It is the eigenvector of \(P^\top\) for eigenvalue 1: EQ S7.4 — STATIONARY DISTRIBUTION $$ \pi P = \pi, \qquad \sum_i \pi_i = 1, \qquad \pi_i \ge 0 \qquad\Longleftrightarrow\qquad P^\top \pi^\top = \pi^\top $$ A row-stochastic \(P\) always has eigenvalue 1 (its rows sum to 1, so the all-ones vector is a right eigenvector); the corresponding left eigenvector, normalized to sum to 1, is \(\pi\). For the two-state chain, solving \(\pi_A P_{AB} = \pi_B P_{BA}\) with \(\pi_A + \pi_B = 1\) gives \(\pi_A = P_{BA}/(P_{AB}+P_{BA})\). Stationary does not mean the chain stops — individual realizations keep hopping; it means the population of states stops shifting. When does iterating actually reach \(\pi\), and is \(\pi\) unique? That is the question of ergodicity. A finite chain converges to a single stationary distribution from every start exactly when it is: Irreducible — every state is reachable from every other (the chain is one connected piece, not separate islands). Otherwise each island has its own \(\pi\). Aperiodic — the chain is not trapped in a fixed cycle (e.g. a chain that strictly alternates A→B→A→B has period 2 and never settles; it oscillates forever). A single self-loop \(P_{ii} > 0\) anywhere breaks periodicity. ERGODIC THEOREM An irreducible, aperiodic finite chain has a unique stationary \(\pi\), and \(P^{\,t} \to \mathbf{1}\pi\) as \(t \to \infty\). Two consequences power the rest of the chapter. (1) Convergence: the distribution forgets its start at a rate governed by the second-largest eigenvalue \(|\lambda_2|\) — the spectral gap \(1 - |\lambda_2|\) is the mixing speed. (2) Time-averages = space-averages: the long-run fraction of time a single trajectory spends in state \(i\) equals \(\pi_i\). That second fact is what makes MCMC (§7.4) legal: run one walk long enough and its visit-frequencies are the target distribution. A sufficient (not necessary) condition that makes \(\pi\) easy to verify and is the cornerstone of MCMC is detailed balance — the chain is reversible, with as much probability flowing \(i \to j\) as \(j \to i\): EQ S7.5 — DETAILED BALANCE (REVERSIBILITY) $$ \pi_i\, P_{ij} \;=\; \pi_j\, P_{ji} \quad \text{for all } i, j \qquad\Longrightarrow\qquad \pi P = \pi $$ Sum the left identity over \(i\): \(\sum_i \pi_i P_{ij} = \pi_j \sum_i P_{ji} = \pi_j\), which is exactly \((\pi P)_j = \pi_j\). So detailed balance implies stationarity — a strictly stronger, local, pairwise condition that is far easier to engineer than the global \(\pi P = \pi\). Metropolis–Hastings (§7.4) is, in one sentence, a recipe for constructing a chain that satisfies EQ S7.5 for any target \(\pi\) you name. This is also the engine behind PageRank. Model a random surfer who, at each step, follows an outgoing link uniformly at random; with probability \(1-\alpha\) (the damping, \(\alpha \approx 0.15\)) they instead teleport to a random page. That teleport term makes the chain irreducible and aperiodic on any web graph, so the ergodic theorem guarantees a unique stationary \(\pi\) — and \(\pi_i\), the long-run fraction of time the surfer sits on page \(i\), is its PageRank. A) / (P(A->B) + P(B->A))."> For the two-state chain with \(P(\text{stay A}) = 0.7\) (so \(P(A\to B)=0.3\)) and \(P(\text{stay B}) = 0.6\) (so \(P(B\to A)=0.4\)), what is the stationary probability \(\pi_A\)? Use \(\pi_A = \dfrac{P(B\to A)}{P(A\to B) + P(B\to A)}\). \( \pi_A = \dfrac{0.4}{0.3 + 0.4} = \dfrac{0.4}{0.7} = \dfrac{4}{7} = \) 0.571. Equivalently, detailed balance \(\pi_A \cdot 0.3 = \pi_B \cdot 0.4\) with \(\pi_A + \pi_B = 1\) gives the same answer. The chain spends ~57% of its days sunny in the long run. INSTRUMENT S7.2 — RANDOM WALK & PAGERANK 5-NODE GRAPH · DAMPED SURFER · EQ S7.4 DAMPING α (teleport prob) 0.15 WALK SPEED MED STEPS WALKED 0 TOP PAGE (EMPIRICAL) — EMPIRICAL ≈ EXACT π? — WALK ▶ RESET COUNTS A single surfer hops the directed graph; each node's bar shows the empirical visit frequency, with the blue tick marking the exact stationary \(\pi\) (the eigenvector, solved by power iteration). Watch the time-average climb toward the space-average — that convergence is the ergodic theorem in action. Node C is a hub with many inbound links, so it wins. Set α high and authority flattens (everyone teleports everywhere); set α = 0 and a dangling/cyclic trap can starve the rest. PYTHON · RUNNABLE IN-BROWSER # Power-iterate a transition matrix to its stationary pi; verify pi @ P = pi import numpy as np P = np.array([[0.7, 0.3], [0.4, 0.6]]) assert np.allclose(P.sum(1), 1.0) # rows are valid distributions pi = np.array([1.0, 0.0]) # any start works for _ in range(200): pi = pi @ P # EQ S7.2, repeatedly pi = pi / pi.sum() print("stationary pi:", np.round(pi, 6)) print("pi @ P:", np.round(pi @ P, 6)) print("fixed point holds:", np.allclose(pi @ P, pi)) # cross-check against the dominant LEFT eigenvector of P (eigval 1) w, v = np.linalg.eig(P.T) # left eigvecs = right eigvecs of P^T k = np.argmin(np.abs(w - 1.0)) # the eigenvalue equal to 1 ev = np.real(v[:, k]); ev = ev / ev.sum() print("eigenvector pi:", np.round(ev, 6)) print("closed form (4/7):", round(4/7, 6)) RUN ▶ edits are live — break it on purpose 7.3 Hidden Markov Models So far the state was observable — we saw the weather directly. A Hidden Markov Model (HMM) adds one layer of indirection: the Markov chain runs underneath, but you never see the states. You see only emissions — noisy observations whose distribution depends on the hidden state. The classic toy: you cannot see the weather, but you see whether a friend carries an umbrella; you infer the hidden weather from the visible umbrellas. EQ S7.6 — HMM JOINT DISTRIBUTION $$ \Pr\!\big(x_{1:T},\, z_{1:T}\big) \;=\; \underbrace{\pi_{z_1}}_{\text{initial}} \;\prod_{t=2}^{T} \underbrace{A_{z_{t-1} z_t}}_{\text{transition}} \;\prod_{t=1}^{T} \underbrace{B_{z_t}(x_t)}_{\text{emission}} $$ \(z_{1:T}\) are the hidden states (the Markov chain), \(x_{1:T}\) the observations. \(A\) is the state-transition matrix (the chain of §7.1), \(B_{z}(x)\) the emission probability of seeing \(x\) when the hidden state is \(z\), and \(\pi\) the initial distribution. Two Markov assumptions are stacked: states depend only on the previous state, and each observation depends only on the current state. Summing this joint over all \(K^T\) hidden paths looks hopeless — but dynamic programming makes it \(O(K^2 T)\). Three canonical questions, each with an exact algorithm — all variants of the same dynamic program that re-uses the Markov factorization instead of enumerating paths: Question Algorithm Computes Likelihood — how probable is this observation sequence? Forward \(\Pr(x_{1:T})\) by summing over hidden paths Decoding — what is the single most likely hidden path? Viterbi \(\arg\max_z \Pr(z_{1:T} \mid x_{1:T})\) Learning — fit \(A, B, \pi\) from unlabeled data Baum–Welch (EM) parameters that locally maximize \(\Pr(x_{1:T})\) The forward algorithm carries a vector of "beliefs" \(\alpha_t(j) = \Pr(x_{1:t},\, z_t = j)\) — the joint probability of the observations so far and being in state \(j\) now — and updates it one observation at a time: EQ S7.7 — THE FORWARD RECURSION $$ \alpha_t(j) \;=\; B_j(x_t) \sum_{i=1}^{K} \alpha_{t-1}(i)\, A_{ij}, \qquad \Pr(x_{1:T}) \;=\; \sum_{j=1}^{K} \alpha_T(j) $$ Each step blends "where could I have come from" (\(\sum_i \alpha_{t-1}(i) A_{ij}\), a transition step exactly like EQ S7.2) with "does state \(j\) explain what I just saw" (\(B_j(x_t)\)). The whole sequence's likelihood is the final beliefs summed. Viterbi is the same recursion with \(\sum\) replaced by \(\max\) (and a back-pointer), turning "total probability of all paths" into "probability of the single best path". HMMs ruled speech recognition, part-of-speech tagging, and bioinformatics (gene finding, sequence alignment) for two decades. In modern deep learning their throne went to RNNs and then Transformers (Vol II · Ch 03), which drop the discrete-state and conditional-independence constraints. But the HMM's structure survives everywhere: linear-state-space models and Kalman filters are HMMs with continuous Gaussian states, and the forward–backward dynamic program is the direct ancestor of the message-passing that powers modern structured prediction. An HMM starts with \(\pi_{\text{Rainy}} = 0.6\). The emission probability of seeing an umbrella given Rainy is \(B_{\text{Rainy}}(\text{umbrella}) = 0.3\). Using EQ S7.7 at \(t=1\) (no transition yet), what is the forward value \(\alpha_1(\text{Rainy}) = \pi_{\text{Rainy}}\, B_{\text{Rainy}}(\text{umbrella})\)? \( \alpha_1(\text{Rainy}) = \pi_{\text{Rainy}} \cdot B_{\text{Rainy}}(\text{umbrella}) = 0.6 \times 0.3 = \) 0.18. This is the joint probability of "the weather is Rainy on day 1 and we saw an umbrella" — the seed the forward recursion then propagates forward. 7.4 Markov Chain Monte Carlo — Metropolis–Hastings Here the chapter turns inside-out. Until now \(P\) was given and we asked for \(\pi\). MCMC inverts the problem: you are given a target distribution \(\pi\) — typically a Bayesian posterior (Stats 05) you can evaluate up to a constant but cannot integrate or sample from directly — and you construct a Markov chain whose stationary distribution is exactly that \(\pi\). Run the chain; its trajectory becomes your sample. The ergodic theorem (§7.2) is the guarantee that this is allowed. WHY THIS MATTERS The defining pain of Bayesian inference is the normalizing constant. The posterior \(p(\theta \mid x) = \frac{p(x \mid \theta)\,p(\theta)}{p(x)}\) has a denominator \(p(x) = \int p(x\mid\theta)p(\theta)\,\mathrm{d}\theta\) that is usually an intractable high-dimensional integral. MCMC sidesteps it entirely: every step compares two densities as a ratio, so the unknown constant cancels. You only ever need \(\pi\) up to proportionality. That single trick is why MCMC, not algebra, is how most real Bayesian models are fit. The Metropolis–Hastings algorithm is a constructive recipe for a chain obeying detailed balance (EQ S7.5) with respect to any \(\pi\). From the current point \(\theta\), propose a move to \(\theta'\) from a proposal density \(q(\theta' \mid \theta)\); then accept it with a probability designed to enforce reversibility: EQ S7.8 — THE METROPOLIS–HASTINGS ACCEPTANCE RULE $$ a \;=\; \min\!\left(1,\; \frac{\pi(\theta')\, q(\theta \mid \theta')}{\pi(\theta)\, q(\theta' \mid \theta)} \right); \qquad \text{accept } \theta' \text{ with prob } a, \text{ else stay at } \theta $$ \(\pi\) appears only as the ratio \(\pi(\theta')/\pi(\theta)\) — the normalizing constant cancels, which is the whole point. The proposal ratio \(q(\theta\mid\theta')/q(\theta'\mid\theta)\) corrects for any asymmetry in how you propose. For a symmetric proposal (e.g. a Gaussian centered on the current point) those \(q\) terms cancel and the rule collapses to the original 1953 Metropolis form \(a = \min(1,\, \pi(\theta')/\pi(\theta))\): always move toward higher density, sometimes move toward lower. Plugging EQ S7.8 into detailed balance verifies \(\pi\) is stationary by construction. The logic is a biased random walk. Uphill moves (to higher \(\pi\)) are always taken; downhill moves are taken with probability equal to the density ratio, so the walker explores the tails without getting stuck on the peak. Over many steps it visits each region in proportion to \(\pi\) — exactly the time-average = space-average promise. Two practical knobs dominate everything: Step size (proposal width). Too small and the walker shuffles, exploring slowly with high autocorrelation; too large and almost every proposal lands in the low-density wilderness and is rejected. The folklore target acceptance rate is ~0.234 for high-dimensional random-walk Metropolis (an Roberts–Gelman–Gilks result) — neither greedy nor timid. Burn-in & mixing. The chain starts wherever you put it, not at \(\pi\); the first stretch is transient and is discarded as burn-in. Consecutive samples are correlated, so the effective sample size is far below the raw count. Diagnostics like the \(\hat{R}\) statistic (comparing several independent chains) and trace plots are how you decide it has converged — and "it looks converged" is famously not a proof. In symmetric-proposal Metropolis, the acceptance probability is \(a = \min\!\big(1,\, \pi(\theta')/\pi(\theta)\big)\). The current point has (unnormalized) density \(\pi(\theta) = 2\) and the proposed point has \(\pi(\theta') = 1\). What is the acceptance probability \(a\)? The proposal moves downhill (1 < 2), so \(a = \min\!\big(1,\, 1/2\big) = \) 0.5. The walker takes this downhill step half the time — accepting just enough bad moves to map the distribution's tails rather than collapsing onto its mode. (Had the move been uphill, \(a = \min(1,\, \text{ratio}\,>\,1) = 1\): always accept.) INSTRUMENT S7.3 — METROPOLIS SAMPLER 1-D TWO-PEAK TARGET · HISTOGRAM CONVERGES · EQ S7.8 PROPOSAL STEP σ 1.0 SAMPLE SPEED MED SAMPLES DRAWN 0 ACCEPTANCE RATE — HISTOGRAM vs TARGET (L1) — SAMPLE ▶ RESET The blue curve is the target — a two-peaked mixture you can evaluate but not easily sample. Mint bars are the running histogram of accepted samples; the dot is the current walker. Watch the bars grow into the curve. Now break it: shrink σ toward 0 and the walker can't cross the valley between peaks — it samples one mode and reports a confidently wrong distribution (the canonical MCMC failure). Blow σ up and the acceptance rate craters as proposals miss the target entirely. The healthy regime is in between. PYTHON · RUNNABLE IN-BROWSER # Metropolis-Hastings for a 2-component Gaussian mixture target (EQ S7.8) import numpy as np rng = np.random.default_rng(0) def target(x): # unnormalized density: two peaks return 0.6*np.exp(-0.5*((x+2)/0.7)**2) + 0.4*np.exp(-0.5*((x-2)/1.0)**2) x, step, n = 0.0, 1.5, 40000 samples, accepts = np.empty(n), 0 for i in range(n): xp = x + rng.normal(0, step) # symmetric Gaussian proposal if rng.random() RUN ▶ edits are live — break it on purpose 7.5 Gibbs sampling & MCMC in modern ML Metropolis–Hastings is general but blunt: one proposal width for a whole high-dimensional space rarely fits. Gibbs sampling is the special case that exploits structure — when you cannot sample the joint \(p(\theta_1, \ldots, \theta_d)\) but can sample each variable from its full conditional given all the others. You then cycle through the coordinates, replacing each in turn by a fresh draw from its conditional: EQ S7.9 — THE GIBBS SWEEP $$ \theta_1^{(t+1)} \sim p\big(\theta_1 \mid \theta_2^{(t)}, \ldots, \theta_d^{(t)}\big), \;\; \theta_2^{(t+1)} \sim p\big(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots\big), \;\; \ldots $$ Each coordinate is updated from its conditional with the others held fixed (always using the freshest values). Gibbs is Metropolis–Hastings with acceptance probability always equal to 1: when you propose from the exact conditional, the MH ratio simplifies to one, so no move is ever rejected. The price is that you need those conditionals in closed form — which is exactly why conjugate priors (Stats 05) and graphical models are its natural habitat. It can also mix slowly when coordinates are strongly correlated, because it only ever moves axis-by-axis. Where MCMC sits in the 2026 landscape: Probabilistic programming. Stan, PyMC, and NumPyro let you declare a model and sample its posterior with no hand-derived math. The default sampler is almost never vanilla Metropolis — it is the No-U-Turn Sampler (NUTS), an adaptive form of Hamiltonian Monte Carlo that uses gradients of \(\log\pi\) to propose long, informed trajectories instead of a blind local jiggle, dramatically improving mixing in high dimensions. Vanilla random-walk Metropolis is now mostly pedagogy and a fallback. The gradient frontier. HMC/NUTS need \(\nabla \log \pi\), which auto-diff (Vol II · Ch 03 machinery) supplies for free. For massive datasets, stochastic-gradient MCMC (SGLD and kin) injects calibrated noise into SGD steps so the optimizer's trajectory itself samples a Bayesian posterior over weights — the bridge between deep learning and Bayesian inference. The diffusion connection. Modern image and video generators (Vol II · Ch 08) are, at heart, learned reverse-time Markov chains: a forward chain gradually adds Gaussian noise to data, and a neural network learns to run the chain backward, sampling images by walking from noise to signal. Langevin-style sampling — a gradient ascent on \(\log\pi\) with noise — is the direct intellectual descendant of the Metropolis walk you ran in Instrument S7.3. The honest caveats. Convergence is asymptotic and unprovable in finite time; multimodal targets (like Instrument S7.3 with small σ) can trap a chain in one mode forever; and effective sample sizes can be a tiny fraction of the raw count. MCMC is the workhorse of Bayesian computation precisely because, used with diagnostics and skepticism, it is the most general tool we have — not because it is foolproof. NEXT Markov chains gave us a way to sample what we cannot integrate; the next chapter asks how much a random outcome is worth knowing. Information theory measures uncertainty in bits — entropy, cross-entropy, KL divergence, mutual information — and turns out to be the language in which the acceptance ratios, the mixing rates, and the very loss functions of every model in this encyclopedia are most naturally written. Stats 08: Information Theory. 7.R References Norris, J. R. (1997). Markov Chains. Cambridge University Press — the standard rigorous treatment of transition matrices, stationarity, ergodicity, and reversibility. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21(6) — the original Metropolis acceptance rule (symmetric-proposal special case of EQ S7.8). Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57(1) — the generalization to asymmetric proposals, completing Metropolis–Hastings. Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2) — the canonical reference for the forward, Viterbi, and Baum–Welch algorithms (§7.3). Geman, S. & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE TPAMI 6(6) — introduced Gibbs sampling (EQ S7.9) to statistics and image analysis. Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab — the damped random-walk Markov chain whose stationary distribution is PageRank (§7.2). Roberts, G. O., Gelman, A. & Gilks, W. R. (1997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. Ann. Appl. Probab. 7(1) — the ~0.234 optimal acceptance-rate result (§7.4). Hoffman, M. D. & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. JMLR 15 — NUTS, the adaptive HMC sampler behind Stan / PyMC / NumPyro (§7.5). ← PREVIOUS 06 Linear Algebra NEXT CHAPTER 08 Information Theory AI // ENCYCLOPEDIA — STATISTICS · CH 07 FULL CONTENTS ↗ ## STATS · Information Theory (https://ai-encyclopedia.com/stats/08-information-theory.html) Information Theory — AI Encyclopedia AI // ENCYCLOPEDIA / STATISTICS / 08 / INFORMATION THEORY INDEX NEXT: THE DATA PROBLEM → MATHEMATICS & STATISTICS · CHAPTER 08 / 08 Information Theory In 1948 Claude Shannon laid a foundation that still governs machine learning. He measured surprise as a number, entropy, and proved it is the irreducible cost of communicating, compressing, or predicting a random source. Measured between what a model predicts and what actually happens, that same quantity is the cross-entropy loss that trains neural networks. This chapter builds entropy from one axiom, derives cross-entropy and KL divergence, then connects them to the loss function. LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON STATS 01–07 INSTRUMENTS ENTROPY · KL · HUFFMAN IN THIS CHAPTER 8.1 Entropy — measuring surprise 8.2 Cross-entropy & KL divergence 8.3 Mutual information 8.4 Source coding 8.5 The bridge to ML 8.R References 8.1 Entropy — measuring surprise Start with one demand: how surprised should you be by an outcome? A coin landing heads when you knew it was rigged to always land heads is no surprise at all. A fair coin landing heads is exactly one bit of surprise. Shannon insisted surprise depend only on the probability of the outcome, that a certain event (\(p = 1\)) carry zero surprise, and that the surprise of two independent events add. Only one function satisfies all three: the negative logarithm. EQ S8.1 — SURPRISAL (SELF-INFORMATION) $$ I(x) \;=\; \log_2 \frac{1}{p(x)} \;=\; -\log_2 p(x) \qquad [\text{bits}] $$ The surprisal of an outcome is how many bits it would take to encode it optimally. A coin flip (\(p = \tfrac12\)) costs \(1\) bit; a one-in-a-million event costs \(\approx 20\) bits; a certainty costs \(0\). Independence forces additivity, and \(\log(ab) = \log a + \log b\) is the only function that turns the product of independent probabilities into a sum of surprises. Switch the log base to change the unit: base 2 → bits, base \(e\) → nats, base 10 → bans. Entropy is the average surprisal — the expected number of bits per outcome when the source emits symbols according to distribution \(p\). It is the single number that says how uncertain, how unpredictable, how compressible a source is. EQ S8.2 — SHANNON ENTROPY $$ H(p) \;=\; \mathbb{E}_{x \sim p}\big[\,I(x)\,\big] \;=\; -\sum_{x} p(x)\,\log_2 p(x) \qquad [\text{bits}] $$ By convention \(0 \log 0 = 0\) (an impossible symbol contributes nothing). Entropy is maximized by the uniform distribution — when every outcome is equally likely you cannot do better than guessing, so uncertainty is highest — and minimized (zero) by a point mass, where one outcome is certain. For \(K\) equally likely symbols, \(H = \log_2 K\): a fair die is \(\log_2 6 \approx 2.585\) bits, a fair coin exactly \(1\). WORKED EXAMPLE ▾ 01 Fair coin, \(p = (\tfrac12, \tfrac12)\): \(H = -\tfrac12\log_2\tfrac12 - \tfrac12\log_2\tfrac12 = \tfrac12 + \tfrac12 = 1\) bit. Maximal for two outcomes. 02 Biased coin, \(p = (0.9, 0.1)\): \(H = -0.9\log_2 0.9 - 0.1\log_2 0.1 = 0.137 + 0.332 = 0.469\) bits. Knowing it usually lands heads removes more than half the uncertainty. 03 Certain coin, \(p = (1, 0)\): \(H = -1\log_2 1 - 0 = 0\) bits. No surprise, nothing to encode. RESULT: H sweeps 1.00 → 0.469 → 0 as the coin goes from fair to certain The two-outcome case has a name — the binary entropy function \(H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)\) — and a famous shape: a smooth arch peaking at exactly \(1\) bit when \(p = \tfrac12\) and collapsing to \(0\) at both ends. That arch is the first thing to internalize, because the curve of a training loss is the same idea wearing different clothes. What is the Shannon entropy of a fair coin — outcomes \( \{H, T\} \) each with probability \( \tfrac12 \) — measured in bits ? \( H = -\tfrac12\log_2\tfrac12 - \tfrac12\log_2\tfrac12 = -\tfrac12(-1) - \tfrac12(-1) = \tfrac12 + \tfrac12 = \) 1 bit. This is the definition of the unit: one fair binary choice is exactly one bit of information. A source emits one of 4 equally likely symbols. What is its entropy in bits? (\( H = \log_2 K \).) \( H = \log_2 4 = \) 2 bits. Four equiprobable outcomes need two yes/no questions to pin down — and no coding scheme can average fewer than two bits per symbol. PYTHON · RUNNABLE IN-BROWSER # EQ S8.2: entropy of a distribution + the binary-entropy arch import numpy as np def entropy_bits(p): p = np.asarray(p, float) p = p[p > 0] # 0*log0 = 0 by convention return float(-(p * np.log2(p)).sum()) print("fair coin H =", round(entropy_bits([0.5, 0.5]), 4), "bits") print("biased coin H =", round(entropy_bits([0.9, 0.1]), 4), "bits") print("fair die H =", round(entropy_bits([1/6]*6), 4), "bits (= log2 6)") print("loaded die H =", round(entropy_bits([0.5,0.1,0.1,0.1,0.1,0.1]), 4), "bits") # the binary entropy function H_b(p): an arch peaking at p = 0.5 ps = np.linspace(0.001, 0.999, 200) Hb = -ps*np.log2(ps) - (1-ps)*np.log2(1-ps) print("\npeak of H_b at p =", round(float(ps[Hb.argmax()]), 3), "-> H =", round(float(Hb.max()), 4), "bit") plot_xy(ps, Hb) RUN ▶ edits are live — break it on purpose INSTRUMENT S8.1 — ENTROPY EXPLORER DRAG THE BARS · H PEAKS AT UNIFORM PROBABILITY OVER 5 SYMBOLS — DRAG A BAR (RENORMALIZES TO SUM 1) PRESETS UNIFORM SKEWED NEAR-CERTAIN ENTROPY H — MAX POSSIBLE (log₂ 5) — FRACTION OF MAX — Drag any bar up or down — the rest rescale so the distribution always sums to one. Watch the entropy readout: it is highest when all five bars are level (the uniform, \(\log_2 5 = 2.322\) bits) and falls toward zero as you pile all the mass onto one symbol. The mint guideline marks the uniform height; pull a bar above it and entropy drops, because concentration is the opposite of surprise. 8.2 Cross-entropy & KL divergence Entropy assumes you know the true distribution \(p\). But a model only ever has an estimate \(q\). Cross-entropy asks the practical question: if reality is \(p\) but you encode it using a code built for \(q\), how many bits per symbol do you actually pay? EQ S8.3 — CROSS-ENTROPY $$ H(p, q) \;=\; -\sum_{x} p(x)\,\log_2 q(x) \qquad [\text{bits}] $$ Outcomes still happen with the true frequency \(p(x)\), but each is charged at the wrong codeword length \(-\log_2 q(x)\). If your model is right (\(q = p\)), cross-entropy collapses to entropy, \(H(p,p) = H(p)\) — you cannot beat the source's own entropy. If your model is wrong, you pay strictly more. That excess is the entire point of the next equation. The gap between paying \(H(p,q)\) and the irreducible floor \(H(p)\) is the Kullback–Leibler divergence — the number of wasted bits caused by believing \(q\) when the truth is \(p\). EQ S8.4 — KL DIVERGENCE (RELATIVE ENTROPY) $$ D_{\mathrm{KL}}(p \,\Vert\, q) \;=\; \sum_{x} p(x)\,\log_2 \frac{p(x)}{q(x)} \;=\; H(p, q) - H(p) \;\ge\; 0 $$ The decomposition \(H(p,q) = H(p) + D_{\mathrm{KL}}(p \Vert q)\) is the load-bearing identity of this chapter: the cross-entropy you minimize in training is the irreducible entropy of the data plus the divergence of your model from the truth. Since \(H(p)\) is a constant you cannot change, minimizing cross-entropy is exactly minimizing KL divergence. By Gibbs' inequality \(D_{\mathrm{KL}} \ge 0\), with equality iff \(q = p\) everywhere. WORKED EXAMPLE ▾ 01 Truth \(p = (0.7, 0.2, 0.1)\), model \(q = (0.2, 0.5, 0.3)\). First the entropy floor: \(H(p) = -0.7\log_2 0.7 - 0.2\log_2 0.2 - 0.1\log_2 0.1 = 1.157\) bits. 02 Cross-entropy: \(H(p,q) = -0.7\log_2 0.2 - 0.2\log_2 0.5 - 0.1\log_2 0.3 = 1.625 + 0.200 + 0.174 = 1.999\) bits. 03 The waste: \(D_{\mathrm{KL}}(p\Vert q) = H(p,q) - H(p) = 1.999 - 1.157 = 0.842\) bits — the price of using the wrong code. 04 Reverse it: \(D_{\mathrm{KL}}(q\Vert p) = 0.775\) bits. Different number — KL is not symmetric, and is not a distance. RESULT: KL(p‖q) = 0.842 bits ≠ KL(q‖p) = 0.775 bits KL is not a metric. It is non-negative and zero only when the distributions match, but it is asymmetric — \(D_{\mathrm{KL}}(p\Vert q) \ne D_{\mathrm{KL}}(q\Vert p)\) in general — and it violates the triangle inequality. The asymmetry is not a flaw; it encodes a real modelling choice. Forward KL \(D_{\mathrm{KL}}(p\Vert q)\), the form inside maximum-likelihood training, is mass-covering: it punishes \(q\) heavily for assigning near-zero probability anywhere \(p\) has mass, so it spreads \(q\) to cover every mode. Reverse KL \(D_{\mathrm{KL}}(q\Vert p)\), the form inside variational inference (the ELBO, §8.5), is mode-seeking: it lets \(q\) ignore parts of \(p\) and lock onto a single mode. Which way you write the bars decides whether your model hedges or commits. What is \( D_{\mathrm{KL}}(p \,\Vert\, p) \) — the KL divergence of any distribution from itself ? Each term is \( p(x)\log_2\dfrac{p(x)}{p(x)} = p(x)\log_2 1 = p(x)\cdot 0 = 0 \), so the sum is 0. A perfect model wastes no bits — this is the floor every cross-entropy loss is descending toward, and the equality case of Gibbs' inequality. PYTHON · RUNNABLE IN-BROWSER # EQ S8.3 / S8.4: entropy, cross-entropy, KL -- and KL's asymmetry import numpy as np def entropy(p): return float(-(p * np.log2(p)).sum()) def cross_entropy(p,q):return float(-(p * np.log2(q)).sum()) def kl(p, q): return float((p * np.log2(p/q)).sum()) p = np.array([0.7, 0.2, 0.1]) # the truth q = np.array([0.2, 0.5, 0.3]) # a wrong model H = entropy(p) Hpq = cross_entropy(p, q) print(f"H(p) = {H:.4f} bits (irreducible floor)") print(f"H(p, q) = {Hpq:.4f} bits (what you actually pay)") print(f"KL(p || q) = {kl(p, q):.4f} bits (wasted bits)") print(f"identity check: H(p)+KL = {H + kl(p,q):.4f} == H(p,q)?", np.isclose(H + kl(p,q), Hpq)) print(f"\nKL(p || q) = {kl(p, q):.4f}") print(f"KL(q || p) = {kl(q, p):.4f} RUN ▶ edits are live — break it on purpose INSTRUMENT S8.2 — KL ASYMMETRY VISUALIZER 3 SYMBOLS · KL(P‖Q) vs KL(Q‖P) q₁ 0.20 q₂ 0.50 KL(P ‖ Q) — FORWARD — KL(Q ‖ P) — REVERSE — ASYMMETRY GAP — The fixed truth is P = (0.7, 0.2, 0.1) (mint bars); drag the sliders to reshape your model Q (blue bars; the third bar fills the remainder). Forward and reverse KL are almost never equal — push a Q-bar toward zero where P has real mass and forward KL explodes (mass-covering punishes it), while reverse KL stays mild. Both readouts hit 0 only when the blue bars exactly overlay the mint ones. 8.3 Mutual information — shared surprise So far, one variable. Mutual information asks how much knowing one random variable tells you about another: how many bits of \(Y\)'s uncertainty vanish once you observe \(X\). It is the KL divergence between the joint distribution and the product of the marginals — i.e. how far \(X\) and \(Y\) are from being independent. EQ S8.5 — MUTUAL INFORMATION $$ I(X; Y) \;=\; \sum_{x, y} p(x, y)\,\log_2 \frac{p(x, y)}{p(x)\,p(y)} \;=\; H(Y) - H(Y \mid X) \;=\; D_{\mathrm{KL}}\big(p(x,y) \,\Vert\, p(x)p(y)\big) $$ The middle form is the most intuitive: \(H(Y)\) is your uncertainty about \(Y\) before, \(H(Y\mid X)\) is what remains after seeing \(X\), and the drop is the information \(X\) carried. \(I(X;Y) \ge 0\), and \(I(X;Y) = 0\) iff \(X\) and \(Y\) are independent (the joint factorizes, the KL vanishes). Unlike KL, mutual information is symmetric: \(I(X;Y) = I(Y;X)\). It captures arbitrary nonlinear dependence — where correlation sees only straight lines. Mutual information is the quiet workhorse behind a surprising amount of machine learning. Decision trees split on the feature with the highest information gain — mutual information between a feature and the label. Feature-selection ranks inputs by \(I(\text{feature}; \text{target})\). The information bottleneck frames representation learning as compressing \(X\) into \(Z\) while preserving \(I(Z; Y)\); InfoNCE and contrastive objectives are lower bounds on mutual information between views of the same datum. Wherever the question is "how related are these, beyond linear correlation?", mutual information is the honest answer — though estimating it from samples in high dimensions is notoriously hard and an active research area. KEY Correlation sees lines; mutual information sees structure. Two variables related by \(Y = X^2\) with \(X\) symmetric about zero have correlation exactly zero — yet \(X\) determines \(Y\) completely, so their mutual information is large. Any time you reach for "are these independent?", the bit-accurate test is \(I(X;Y) = 0\), not \(\rho = 0\). PYTHON · RUNNABLE IN-BROWSER # EQ S8.5: mutual information from a joint table, three identities agree import numpy as np # joint p(x,y) over a 2x2 grid -- correlated, not independent P = np.array([[0.40, 0.10], [0.10, 0.40]]) px = P.sum(1, keepdims=True) # marginal of X py = P.sum(0, keepdims=True) # marginal of Y mask = P > 0 I = float((P[mask] * np.log2(P[mask] / (px @ py)[mask])).sum()) def H(p): # entropy of a flat distribution p = p[p > 0]; return float(-(p * np.log2(p)).sum()) HY = H(py.ravel()) HY_X = float(-(P[mask] * np.log2((P / px)[mask])).sum()) # H(Y|X) print(f"I(X;Y) via KL of joint vs product: {I:.4f} bits") print(f"I(X;Y) via H(Y) - H(Y|X): {HY - HY_X:.4f} bits") print(f"H(Y) = {HY:.3f}, H(Y|X) = {HY_X:.3f} -> X removes that gap from Y") indep = px @ py # what independence would look like print("\nif X,Y were independent, I would be:", round(float((indep[indep>0]*np.log2((indep/(px@py))[indep>0])).sum()), 4)) RUN ▶ edits are live — break it on purpose 8.4 Source coding — entropy as a compression limit Entropy is not just a measure of uncertainty; it is a hard physical bound. Shannon's source coding theorem says: to encode symbols from a source with entropy \(H\) into bits without loss, you need on average at least \(H\) bits per symbol, and you can get arbitrarily close to \(H\) with a clever enough code. No lossless compressor — not ZIP, not a neural one — can beat the entropy of the source it is fed. Entropy is the compression limit. EQ S8.6 — SOURCE CODING THEOREM (BOUNDS) $$ H(p) \;\le\; L^{*} \; \(L^{*}\) is the expected codeword length of the best possible prefix-free code; \(\ell(x)\) is the length assigned to symbol \(x\). The optimal length is \(\ell(x) = -\log_2 p(x)\) — the surprisal of EQ S8.1 — so common symbols get short codes and rare symbols get long ones. The "\(+1\)" slack is the integer rounding penalty (you cannot use \(2.3\) bits for one symbol); coding many symbols at once, or arithmetic coding, drives the average down to \(H\) itself. Huffman coding is the classic constructive proof: repeatedly merge the two least-probable symbols into a subtree, and the resulting prefix code is provably optimal among integer-length codes. When all probabilities are negative powers of two — a dyadic distribution — Huffman hits the entropy bound exactly, with zero slack. A source emits four symbols with probabilities \( (\tfrac12, \tfrac14, \tfrac18, \tfrac18) \). What is its entropy — and the expected length of the optimal Huffman code — in bits per symbol ? Surprisals: \(-\log_2\tfrac12 = 1\), \(-\log_2\tfrac14 = 2\), \(-\log_2\tfrac18 = 3\), \(-\log_2\tfrac18 = 3\). So \( H = \tfrac12(1) + \tfrac14(2) + \tfrac18(3) + \tfrac18(3) = 0.5 + 0.5 + 0.375 + 0.375 = \) 1.75 bits. Because every probability is a power of two (dyadic), Huffman assigns lengths \(1,2,3,3\) and achieves this entropy exactly — no rounding waste. PYTHON · RUNNABLE IN-BROWSER # EQ S8.6: build a Huffman code, compare its length to the entropy bound import numpy as np, heapq def huffman_lengths(p): # heap of (prob, tie, node); node is leaf-id or (left,right) h = [(pi, i, i) for i, pi in enumerate(p)] heapq.heapify(h); nxt = len(p) while len(h) > 1: a = heapq.heappop(h); b = heapq.heappop(h) heapq.heappush(h, (a[0]+b[0], nxt, (a[2], b[2]))); nxt += 1 lengths = {} def walk(node, d): if isinstance(node, tuple): walk(node[0], d+1); walk(node[1], d+1) else: lengths[node] = max(d, 1) walk(h[0][2], 0) return [lengths[i] for i in range(len(p))] for name, p in [("dyadic ", [0.5, 0.25, 0.125, 0.125]), ("uniform", [0.25]*4), ("skewed ", [0.6, 0.2, 0.1, 0.1])]: p = np.array(p) H = float(-(p*np.log2(p)).sum()) L = float((p * np.array(huffman_lengths(p))).sum()) print(f"{name}: H = {H:.4f} Huffman L = {L:.4f} " f"slack = {L-H:+.4f} (theorem: 0 RUN ▶ edits are live — break it on purpose INSTRUMENT S8.3 — CODING-LENGTH / HUFFMAN DEMO 4 SYMBOLS · L vs ENTROPY BOUND · EQ S8.6 SOURCE DYADIC (½ ¼ ⅛ ⅛) UNIFORM SKEWED ENTROPY H (FLOOR) — HUFFMAN LENGTH L — SLACK L − H — Each row shows a symbol, its probability, the Huffman codeword built by merging the two rarest symbols, and that codeword's length. The DYADIC preset hits zero slack — \(L = H = 1.75\) bits — because every probability is a power of two and the surprisal \(-\log_2 p\) is already a whole number of bits. UNIFORM over four symbols also lands exactly on \(H = 2\); the SKEWED source pays a small rounding penalty (slack between 0 and 1), exactly as EQ S8.6 promises. 8.5 The bridge to ML — cross-entropy loss, perplexity, the ELBO Here is the payoff. A classifier outputs a predicted distribution \(q\) over labels; the true label is a one-hot distribution \(p\) (all mass on the correct class \(c\)). Plug into cross-entropy, EQ S8.3: EQ S8.7 — CROSS-ENTROPY LOSS = NEGATIVE LOG-LIKELIHOOD $$ \mathcal{L} \;=\; H(p, q) \;=\; -\sum_{k} p_k \log q_k \;=\; -\log q_c \qquad (p \text{ one-hot at the true class } c) $$ Because \(p\) is one-hot, every term vanishes except \(k = c\), and the loss reduces to the negative log-probability the model assigned to the correct answer — the negative log-likelihood (NLL). Minimizing it over a dataset is maximum-likelihood estimation; via EQ S8.4 it is also minimizing \(D_{\mathrm{KL}}(p \Vert q)\), pushing the model's distribution toward the data's. The softmax that produces \(q\) and this cross-entropy are paired precisely because the softmax's gradient through the loss is the clean \(q - p\) (see Vol I · EQ M2.3 and Vol II · EQ 4.1). Every neural classifier and every language model is trained by descending this single Shannon quantity. WORKED EXAMPLE ▾ 01 Logits \(z = (2.0,\ 1.0,\ 0.1)\), true class \(c = 0\). Softmax: \(e^{2}, e^{1}, e^{0.1} = 7.39, 2.72, 1.11\); sum \(= 11.21\). 02 Predicted probabilities \(q = (0.659,\ 0.242,\ 0.099)\). The model gives the right class \(0.659\). 03 Cross-entropy loss \(= -\log q_0 = -\log(0.659) = 0.417\) nats. (In bits, \(-\log_2 0.659 = 0.601\).) 04 A confident-correct model (\(q_0 \to 1\)) drives the loss to \(0\); a confident-wrong one (\(q_0 \to 0\)) sends it to \(+\infty\). That unbounded penalty for confident mistakes is why cross-entropy trains so well. RESULT: loss = −log(0.659) = 0.417 nats = 0.601 bits For language models, the same loss wears a friendlier name. The geometric-mean per-token uncertainty is perplexity — the exponential of the cross-entropy — interpreted as the effective number of equally likely choices the model faces at each step. EQ S8.8 — PERPLEXITY $$ \mathrm{PPL} \;=\; b^{\,H(p, q)} \;=\; \exp\!\Big(\!-\tfrac{1}{N}\textstyle\sum_{i=1}^{N} \log q(x_i \mid x_{ 0.5 return (pred == y[te]).mean() print(f"honest features only: {fit_score([honest]):.3f} test accuracy") print(f"+ leaked 'future' feature: {fit_score([honest, leak]):.3f} test accuracy") print("\nThe leak looks like a miracle feature -- because it IS the answer.") print("On real holdout where 'leak' is unavailable, that gain evaporates.") RUN ▶ edits are live — break it on purpose PYTHON · RUNNABLE IN-BROWSER # Scaling-before-split leakage: same model, two preprocessing orders. # Manual 5-fold CV; the only difference is WHERE the scaler is fit. import numpy as np rng = np.random.default_rng(1) N, d = 400, 8 X = rng.normal(0, 1, (N, d)) y = (X[:, 0] + 0.5*rng.normal(0, 1, N) > 0).astype(float) # weak signal def cv(leaky): folds = np.array_split(rng.permutation(N), 5); accs = [] for k in range(5): te = folds[k]; tr = np.concatenate([folds[j] for j in range(5) if j != k]) rows = slice(None) if leaky else tr # WRONG vs RIGHT: which rows scale? mu, sd = X[rows].mean(0), X[rows].std(0) + 1e-9 Xb = np.column_stack([(X - mu) / sd, np.ones(N)]) w = np.zeros(d + 1) for _ in range(300): p = 1/(1+np.exp(-Xb[tr] @ w)); w -= 0.1*Xb[tr].T @ (p - y[tr])/len(tr) accs.append(((1/(1+np.exp(-Xb[te] @ w)) > 0.5) == y[te]).mean()) return np.mean(accs) print(f"scaler fit on FULL data (leaky): CV acc {cv(True):.3f}") print(f"scaler fit INSIDE each fold: CV acc {cv(False):.3f}") print("\nThe gap is the leak. It is tiny per feature and pure illusion;") print("a Pipeline re-fits the scaler per fold so the honest number is all you see.") RUN ▶ edits are live — break it on purpose INSTRUMENT D1.1 — LEAKAGE DEMONSTRATOR VALIDATION (LEAKY) vs TRUE HOLDOUT · EQ D1.3 LEAKY FEATURE OFF ON LEAK STRENGTH 0.90 VALIDATION ACC (REPORTED) — TRUE HOLDOUT ACC — ILLUSION (THE DROP) — With the leak OFF, both bars sit at the model's honest skill (~0.82). Turn the leak ON and the validation bar climbs toward 1.0 as you raise leak strength — because the validation rows share the leaked feature — while the true holdout bar barely moves, since the leaked column is unavailable at real prediction time (EQ D1.3). The gap between the bars is the size of the lie you would have shipped. 1.4 Sampling, representativeness & distribution shift A split keeps the test data unseen, but it does not guarantee the data resembles the world the model will face. The whole evaluation rests on one assumption — that training, test, and deployment data are drawn from the same distribution. When that fails, even a flawless split measures the wrong thing. EQ D1.4 — THE i.i.d. ASSUMPTION (AND ITS FAILURE) $$ \text{evaluation is valid} \iff p_{\text{train}}(x, y) \approx p_{\text{test}}(x, y) \approx p_{\text{deploy}}(x, y) $$ "i.i.d." = independent and identically distributed. Covariate shift moves \(p(x)\) (the input mix changes — new users, new regions); label shift moves \(p(y)\) (the base rate changes — fraud surges); concept drift moves \(p(y\mid x)\) (the rule itself changes — last year's spam looks innocuous today). A random split hides all three, because it makes train and test identical by construction while deployment quietly diverges. The remedy depends on the structure of the data. When time matters — anything forecasting, anything where today's model predicts tomorrow — a random split is a lie, because it lets the model train on the future and test on the past. The honest protocol is a time-based split: train on the past, validate and test on strictly later periods, exactly as deployment will run. When records cluster by entity, use a grouped split so no entity straddles the boundary (§1.3). Sometimes you need both at once. Sampling bias is upstream of all of this. If the data was collected in a way that over- or under-represents part of the world — survivorship bias, self-selection, a sensor that only logged failures — no split or model can recover what was never sampled. Stratified sampling (preserving class or subgroup proportions in every split) protects measurement when classes are imbalanced, but it cannot conjure a population that was never observed. The cheapest fix to a representativeness problem is almost always collecting better data, not a cleverer estimator. INSTRUMENT D1.2 — SPLIT VISUALIZER RANDOM · TIME-BASED · GROUPED SPLIT STRATEGY RANDOM TIME-BASED GROUPED STRATEGY — ENTITIES SPANNING THE SPLIT — FUTURE→PAST ORDER VIOLATIONS — Each cell is one row, ordered left-to-right by time, its letter the entity it belongs to. RANDOM scatters train (mint) and test (grey) freely — and the readouts flag both group leakage (an entity in both sets) and time violations (test rows earlier than train rows). TIME-BASED puts every test row strictly after every train row: zero order violations. GROUPED keeps each lettered entity wholly on one side: zero spanning entities. Notice no single strategy zeroes out every risk — that is the real lesson. 1.5 Building the modeling dataset — a protocol Pulling the pieces together, here is the order of operations that keeps every later number honest. The sequence matters more than any single step: most leakage is an ordering bug, a transformation that happened one line too early. # A leakage-safe pipeline. The ORDER is the point, not any one line. 1 define: the prediction target y AND the exact moment t_pred it is made 2 audit: every feature against EQ D1.3 — knowable at t_pred? drop if not 3 dedup: remove duplicate / near-duplicate rows BEFORE splitting 4 split: choose random / time-based / grouped to match the real task (split FIRST — everything below sees only its own partition) 5 fit prep: fit scalers / imputers / encoders on TRAIN only 6 transform: apply TRAIN-fitted statistics to val and test 7 decontaminate: check no train row (or its duplicate) is in val/test 8 evaluate: tune on val; touch test ONCE; report with EQ D1.2 error bars Two habits make this durable. First, wrap steps 5–6 in a single object — an sklearn Pipeline or its equivalent — so the preprocessing is re-fit automatically inside every cross-validation fold and can never accidentally span the split. Second, treat decontamination as a first-class step: hash your rows and confirm no training example (or a trivial variant of one) appears in validation or test. This is the same discipline that fine-tuning a language model demands against its eval sets (Vol II · CH 06), and the same arithmetic that information theory gives the loss it minimizes (STATS · 08). Same 70 / 15 / 15 split on the 1000-row dataset from §1.2. After you correctly split first and will fit your scaler on training data only, how many rows is that scaler fit on — i.e. how many training rows are there? The training fraction is \(70\% = 0.70\), so \(N_{\text{train}} = 0.70 \times 1000 = \) 700 rows. The scaler's mean and variance are computed from these 700 rows alone (step 5), then applied unchanged to the 150 validation and 150 test rows (step 6) — never the reverse. NEXT This chapter assumed your rows were at least present. They rarely are. The most common quality defect — the empty cell — turns out to carry information of its own: why a value is missing often predicts the value itself, and the wrong imputation quietly biases everything downstream. Next: Data · 02 — Missing Data, where we make the absence itself a feature. 1.R References Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM TKDD 6(4) — the field's working definition and taxonomy of leakage (§1.3, EQ D1.3). Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — free online; cross-validation, the train/test contract, and the right vs wrong way to cross-validate (§1.2, §1.5). Northcutt, C. G., Athalye, A. & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets & Benchmarks — measured label-error rates in ImageNet and nine other canonical test sets (§1.1). Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. (2019). Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366(6464) — a real-world label / proxy that leaked the wrong target into a deployed model (§1.1, §1.4). Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning on the test set inflates results, and nested cross-validation as the fix (§1.2). Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press — covariate shift, label shift, and concept drift formalized (§1.4, EQ D1.4). ← PREVIOUS 08 Stats · Information Theory NEXT CHAPTER 02 Missing Data AI // ENCYCLOPEDIA — DATA · CH 01 FULL CONTENTS ↗ ## DATA · Missing Data & Imputation (https://ai-encyclopedia.com/data/02-missing-data.html) Missing Data & Imputation — AI Encyclopedia AI // ENCYCLOPEDIA / DATA / 02 / MISSING DATA INDEX NEXT: ENCODING & SCALING → DATA & FEATURE ENGINEERING · CHAPTER 02 / 05 Missing Data & Imputation Real datasets arrive with holes, and the holes are rarely random. How a value went missing constrains how you may fill it, and naive mean-imputation degrades the relationships a model depends on. This chapter starts with Rubin's three missingness mechanisms, then works through the fixes: simple fills, kNN, multiple imputation by chained equations (MICE), and model-based strategies, noting where each one fails. LEVEL CORE READING TIME ≈ 22 MIN BUILDS ON DATA 01 INSTRUMENTS IMPUTATION COMPARATOR · MECHANISM TOY · VARIANCE SHRINKAGE IN THIS CHAPTER 2.1 Missingness mechanisms 2.2 Simple imputation 2.3 kNN imputation 2.4 MICE 2.5 Choosing in practice 2.R References 2.1 Missingness mechanisms: MCAR, MAR, MNAR Before you fill a single cell, ask why it is empty. Donald Rubin's 1976 framework — still the field's bedrock — sorts the reason into three mechanisms by asking what the probability of a value being missing depends on. Write \(R\) for the missingness indicator (\(R=1\) if a cell is observed, \(0\) if missing), \(X_{\text{obs}}\) for the data you can see, and \(X_{\text{mis}}\) for the values that are hidden. EQ D2.1 — THE THREE MECHANISMS $$ \begin{aligned} \textbf{MCAR:}\quad & P(R \mid X_{\text{obs}}, X_{\text{mis}}) = P(R) \\ \textbf{MAR:}\quad & P(R \mid X_{\text{obs}}, X_{\text{mis}}) = P(R \mid X_{\text{obs}}) \\ \textbf{MNAR:}\quad & P(R \mid X_{\text{obs}}, X_{\text{mis}}) \text{ depends on } X_{\text{mis}} \end{aligned} $$ MCAR (missing completely at random): the holes are pure coincidence — a dropped sensor reading, a corrupted row. MAR (missing at random): the chance of missingness depends only on things you did observe — older respondents skip the income question, but you recorded age. MNAR (missing not at random): missingness depends on the hidden value itself — high earners refuse to state their income because it is high. The names are notoriously misleading: MAR is not "random", it is "explainable by observed data". The distinction is not academic — it dictates what is recoverable: Mechanism Depends on Complete-case analysis Imputation MCAR nothing unbiased (just less efficient) Optional; any sensible fill is safe MAR observed \(X_{\text{obs}}\) Biased in general Recoverable — condition on the observed predictors MNAR unseen \(X_{\text{mis}}\) Biased Not fixable from the data alone — needs a model of the missingness You cannot test MAR vs MNAR from the data. The only difference between them lives in the values you never saw, so no statistic computed on the observed data can distinguish them — this is the contested, uncomfortable heart of the field. In practice you assume MAR (it makes the math tractable and the assumption is often defensible once you condition on enough covariates), then probe sensitivity to MNAR with explicit what-if models. Honesty about this assumption is the difference between an imputation that helps and one that launders bias into a clean-looking table. Under which missingness mechanism is a complete-case analysis (simply dropping rows with any missing value) guaranteed unbiased — losing only efficiency, not correctness? Answer with the acronym. Only when missingness is independent of everything — observed and unobserved — are the complete cases a representative subsample of the full data. That is the definition of missing completely at random: MCAR. Under MAR or MNAR the surviving rows are a skewed slice and dropping them biases estimates. INSTRUMENT D2.1 — MECHANISM TOY MEAN-IMPUTATION BIAS BY MECHANISM MECHANISM MCAR MAR MNAR MISSING FRACTION 35% TRUE MEAN — OBSERVED MEAN (FILL) — BIAS — Each column is a value of \(Y\); grey dots are hidden, mint dots observed. Mean-imputation fills holes at the observed mean (mint line) and reports it as the estimate. Under MCAR the observed mean tracks the true mean (dashed) — bias near zero. Switch to MNAR, where the largest values hide themselves, and watch the fill collapse downward: the estimate is biased no matter how cleverly you fill, because the information is gone. 2.2 Simple imputation — and what it costs The reflex fix is to replace every missing entry of a column with a single constant: the mean (numeric, roughly symmetric), the median (numeric, skewed or outlier-prone), or the mode (categorical). It is one line of code and it is the most over-used tool in applied machine learning. EQ D2.2 — MEAN IMPUTATION $$ \hat{x}_i = \bar{x}_{\text{obs}} = \frac{1}{|O|}\sum_{j \in O} x_j, \qquad O = \{\, j: x_j \text{ observed} \,\} $$ Every hole in the column gets the same number. This is unbiased for the column mean only under MCAR, and even then it commits two quieter crimes: it shrinks the variance (every imputed point sits exactly on the mean, contributing zero spread) and it destroys correlations (a flat fill is unrelated to every other column). Your downstream model sees a column that is artificially calm and artificially independent. The variance damage is exact and worth committing to memory. If you mean-impute \(m\) of the \(n\) entries in a column, the population variance of the completed column is the original observed variance scaled down by the fraction of real data: EQ D2.3 — VARIANCE SHRINKAGE $$ \mathrm{Var}_{\text{filled}} \;=\; \frac{n-m}{n}\,\mathrm{Var}_{\text{obs}}, \qquad \text{so } 40\% \text{ missing} \Rightarrow \text{variance} \times 0.6 $$ The mean of the column is preserved, but the spread is not: \(m\) points contribute a squared deviation of exactly zero. Standard errors computed downstream are too small, confidence intervals too narrow, and significance overstated — the model is confident about data it never had. The collapse is linear in the missing fraction, which is why mean-imputing a column that is 50% empty halves its variance. You mean-impute the column \([\,2,\ 4,\ \text{NA},\ 8\,]\) per EQ D2.2. What single value fills the missing entry? Average the observed entries only: \(\bar{x}_{\text{obs}} = \dfrac{2 + 4 + 8}{3} = \dfrac{14}{3} = 4.6\overline{6} \approx\) 4.67. Every hole in the column would be filled with this same number. A column has \(n = 1000\) rows, of which \(m = 400\) are missing and get mean-imputed. By what factor is the column's variance multiplied, relative to the variance of the observed values (EQ D2.3)? The multiplier is \(\dfrac{n-m}{n} = \dfrac{1000 - 400}{1000} = \dfrac{600}{1000} = \) 0.6. The filled column keeps 60% of its true spread; the missing 40% all pile onto the mean and contribute nothing. PYTHON · RUNNABLE IN-BROWSER # Mean vs kNN imputation: RMSE-to-truth on a masked, correlated column import numpy as np rng = np.random.default_rng(0) n = 300 x = rng.normal(0, 1, n) # a predictor we always observe y = 2.0 * x + rng.normal(0, 0.4, n) # truth: y is strongly tied to x mask = rng.random(n) < 0.35 # 35% of y goes missing (MAR on nothing here = MCAR) y_obs = y.copy(); y_obs[mask] = np.nan # (1) mean imputation: one flat number for every hole y_mean = y_obs.copy() y_mean[mask] = np.nanmean(y_obs) # (2) kNN imputation in x-space: average the k nearest observed neighbours' y def knn_impute(x, y_obs, mask, k=7): out = y_obs.copy() obs = ~mask for i in np.where(mask)[0]: d = np.abs(x[obs] - x[i]) # distance in the observed feature nn = np.argsort(d)[:k] out[i] = y_obs[obs][nn].mean() return out y_knn = knn_impute(x, y_obs, mask) rmse = lambda a: np.sqrt(np.mean((a[mask] - y[mask])**2)) print(f"mean-impute RMSE to truth: {rmse(y_mean):.3f}") print(f"kNN-impute RMSE to truth: {rmse(y_knn):.3f}") print(f"std(observed y): {np.nanstd(y_obs):.3f}") print(f"std(after mean-impute): {np.std(y_mean):.3f} <- shrunk") plot_scatter(x[mask], y[mask], [0]*mask.sum()) # the points we had to guess RUN ▶ edits are live — break it on purpose INSTRUMENT D2.2 — VARIANCE SHRINKAGE EQ D2.3 · MEAN-FILL COLLAPSES A DISTRIBUTION MISSING FRACTION 40% ORIGINAL VARIANCE — AFTER MEAN-FILL — MULTIPLIER (n−m)/n — The mint curve is the true distribution; the bar at the mean is the spike of imputed points that mean-fill manufactures. As you raise the missing fraction, real spread is replaced by a stack of identical values at the center — the variance multiplier drops exactly as \((n-m)/n\). At 80% missing, four-fifths of the column is a single repeated number masquerading as data. Median and mode share the same structural flaw — a single constant per column — but resist outliers (median) and apply to categories (mode). They are reasonable defaults for a quick baseline or a column you do not believe carries much signal; they are never the right answer for a feature whose relationships matter. 2.3 kNN imputation: borrow from your neighbours The first real upgrade is to stop filling with a global constant and start filling with a local one. k-nearest-neighbour imputation finds the \(k\) most similar rows (by distance over the columns you do observe) and fills each hole with their average — a weighted average if you weight by distance. It made its name imputing DNA microarray expression matrices, where it beat row-average filling decisively. EQ D2.4 — WEIGHTED kNN FILL $$ \hat{x}_{ic} \;=\; \frac{\sum_{j \in N_k(i)} w_{ij}\, x_{jc}}{\sum_{j \in N_k(i)} w_{ij}}, \qquad w_{ij} = \frac{1}{d(i,j) + \varepsilon}, \quad d(i,j) = \!\!\sqrt{\sum_{c' \in O_{ij}} (x_{ic'} - x_{jc'})^2} $$ \(N_k(i)\) are the \(k\) donors nearest to row \(i\); the distance \(d(i,j)\) is computed only over columns \(O_{ij}\) that both rows observe (so missingness does not poison the metric). Because the fill is conditioned on a row's own neighbourhood, kNN preserves local structure and inter-column correlation that a flat mean erases. The price: distances need features on a comparable scale (Chapter 03), it is sensitive to the curse of dimensionality, and scoring is \(O(n^2)\) in the naive form. Two parameters decide its behaviour. Small \(k\) is flexible but noisy — a single odd neighbour swings the fill; large \(k\) smooths toward the global mean and re-introduces the very shrinkage you were trying to avoid. As always with kNN, you must scale your features first: an unscaled distance is dominated by whichever column happens to have the largest units, and the "nearest" neighbours become an artifact of measurement choice rather than similarity. INSTRUMENT D2.3 — IMPUTATION COMPARATOR MEAN vs kNN vs REGRESSION · RMSE TO TRUTH METHOD MEAN kNN REGRESSION NEIGHBOURS k 7 METHOD — RMSE TO TRUTH — CORR x↔ŷ RECOVERED — A fixed scatter of \(y\) against \(x\); the largest-\(x\) points have their \(y\) hidden (open circles) and each method guesses them (mint crosses). MEAN fills a flat horizontal line — RMSE high, the \(x\)–\(y\) correlation gone. kNN tracks the local trend; the \(k\) slider trades noise for over-smoothing. REGRESSION fits the line and lands the crosses on it — lowest RMSE here precisely because the truth is linear. Change the method and read how RMSE and recovered correlation move. 2.4 MICE: multiple imputation by chained equations Every method so far fills a single best guess and then proceeds as if it were ground truth — which pretends the imputed values carry no uncertainty. Multiple imputation fixes that at the root: generate several complete datasets, each with plausibly different fills, analyse each, and pool the results so the extra variance from imputation flows into your final standard errors. MICE (also called fully conditional specification) is the dominant way to generate those datasets. The chained-equations idea is elegant. Initialize every hole with a simple fill, then sweep the columns one at a time: for each column with missing data, regress it on all the others using the currently-filled rows, and draw new imputations from that conditional model. Repeat the sweep until the fills stop changing. EQ D2.5 — ONE MICE SWEEP (FULLY CONDITIONAL) $$ \text{for each } c:\quad x^{(t+1)}_{\cdot c\,\in\,\text{mis}} \;\sim\; P\!\left(x_{\cdot c} \,\middle|\, x_{\cdot 1}, \ldots, x_{\cdot c-1}, x_{\cdot c+1}, \ldots, x_{\cdot p};\ \hat{\theta}_c\right) $$ Each column gets its own conditional model \(\hat{\theta}_c\) (linear regression for a continuous column, logistic for binary, and so on) fit on the other columns. Sweeping cycles until convergence — a Gibbs-sampler-style procedure that, under MAR, draws from the joint posterior of the missing data. Run it \(M\) times with different random draws to get \(M\) complete datasets. Drawing from the conditional distribution — not just its mean — is what injects honest uncertainty: take the conditional mean instead and you get a sharper point estimate but lose the variance MICE exists to preserve. The payoff is the pooling step, Rubin's rules: average the \(M\) point estimates, and combine their variances so the total reflects both within-imputation and between-imputation uncertainty: EQ D2.6 — RUBIN'S POOLING $$ \bar{Q} = \frac{1}{M}\sum_{m=1}^{M} \hat{Q}_m, \qquad T = \underbrace{\frac{1}{M}\sum_{m=1}^{M} U_m}_{\text{within } \bar{U}} \;+\; \underbrace{\left(1 + \tfrac{1}{M}\right) \frac{1}{M-1}\sum_{m=1}^{M}\!\big(\hat{Q}_m - \bar{Q}\big)^2}_{\text{between } B} $$ \(\hat{Q}_m\) is your estimate (a coefficient, a mean) from imputed dataset \(m\); \(U_m\) is its own variance. The total variance \(T\) adds the between-imputation spread \(B\) — the part single-imputation throws away. The \((1 + 1/M)\) factor corrects for using a finite number of imputations. This is why \(M = 5\!-\!20\) imputations beat one perfect-looking fill: the disagreement between them is the uncertainty you would otherwise hide. PYTHON · RUNNABLE IN-BROWSER # Mini-MICE: iteratively regress each column on the others; watch it converge import numpy as np rng = np.random.default_rng(1) n, p = 200, 3 # correlated columns: a shared factor plus noise f = rng.normal(0, 1, (n, 1)) X = f * np.array([1.0, 0.8, -0.6]) + rng.normal(0, 0.5, (n, p)) M = rng.random((n, p)) < 0.20 # 20% missing, scattered Xm = X.copy(); Xm[M] = np.nan col_mean = np.nanmean(Xm, axis=0) Xf = Xm.copy() for c in range(p): # step 0: mean-init every hole Xf[M[:, c], c] = col_mean[c] for sweep in range(8): # chained equations prev = Xf.copy() for c in range(p): # regress column c on the rest rows = M[:, c] if not rows.any(): continue others = [k for k in range(p) if k != c] A = np.column_stack([np.ones(n), Xf[:, others]]) beta, *_ = np.linalg.lstsq(A[~rows], Xf[~rows, c], rcond=None) Xf[rows, c] = A[rows] @ beta # conditional-mean fill delta = np.abs(Xf - prev)[M].mean() print(f"sweep {sweep+1}: mean change in filled cells = {delta:.5f}") err = np.sqrt(np.mean((Xf[M] - X[M])**2)) print(f"\nfinal RMSE of MICE fills to truth: {err:.3f}") print("change shrinks toward 0 -> the chained equations reached a fixed point.") RUN ▶ edits are live — break it on purpose The honest caveat. Chained equations specify each column's conditional separately, so there is no guarantee a coherent joint distribution exists that matches all of them — yet the procedure is remarkably robust in practice and is the default in R's mice and scikit-learn's IterativeImputer. Convergence is monitored by eye (trace plots of imputed means across sweeps), not a hard stopping rule. 2.5 Model-based & indicator strategies; choosing in practice Two more tools round out the kit. Model-based imputation fits a single probabilistic model of the whole feature matrix and reads the missing values off it — Gaussian/EM imputation (maximum-likelihood under a multivariate-normal assumption), low-rank matrix completion (SVD/soft-impute, the engine behind recommender systems), and increasingly tree- and neural-network-based imputers. The missing-indicator method adds a binary "was-this-missing" column alongside the (imputed) feature, letting a flexible model learn whether the fact of missingness is itself predictive — which it very often is under MNAR. Strategy Preserves variance Preserves correlation Quantifies uncertainty Reach for it when… Mean / median / mode no no no Quick baseline; a low-signal column; MCAR and you only need a point estimate kNN partly yes (local) no Nonlinear local structure, modest dimensionality, features already scaled MICE yes yes yes Inference, reported standard errors, MAR data — the statistical gold standard Model-based (EM / low-rank) yes yes partly A defensible global model; wide sparse matrices (completion) Missing-indicator n/a adds signal no Suspected MNAR; tree/GBM models that can use the flag directly A few rules survive contact with reality. Impute inside the cross-validation fold, never before — fitting the imputer on the full dataset leaks test information into training and inflates your scores. Match the method to the mechanism: MCAR forgives anything, MAR rewards conditioning on observed predictors (kNN, MICE, model-based), MNAR demands you model the missingness explicitly and report a sensitivity analysis. And when you need honest standard errors, single imputation is not enough — multiple imputation is the only one of these that carries the uncertainty of the guess into the final answer. Some learners (notably gradient-boosted trees like XGBoost and LightGBM) handle NaN natively by learning a default split direction, which is frequently the strongest baseline of all — try it before you impute. PITFALLS The four ways imputation goes wrong: (1) imputing before the train/test split — leakage that makes offline metrics fiction; (2) mean-filling a feature whose correlations matter — quiet variance collapse and washed-out relationships; (3) assuming MAR when the value hides itself — MNAR bias dressed up as a tidy table; (4) reporting single-imputation standard errors as if the fill were certain — overconfident intervals. NEXT Once the holes are filled, the values still need to be made comparable. kNN and most distance- or gradient-based methods assume features share a scale and that categories are numbers a model can read — Chapter 03 covers encoding categoricals and scaling numerics, the step that makes everything in this chapter actually work. 2.R References Rubin, D. B. (1976). Inference and Missing Data. Biometrika 63(3):581–592 — the paper that defined MCAR, MAR, and MNAR. Little, R. J. A. & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley — the canonical textbook on mechanisms, likelihood-based, and multiple imputation. van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). CRC Press — the practical, freely-readable reference for MICE / chained equations. Troyanskaya, O. et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525 — the kNN-impute (KNNimpute) paper. White, I. R., Royston, P. & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 30(4):377–399 — practical guidance on running and pooling MICE. scikit-learn developers. Imputation of missing values (User Guide). Official docs — SimpleImputer, KNNImputer, and IterativeImputer (MICE). ← PREVIOUS 01 The Data Problem NEXT CHAPTER 03 Encoding & Scaling AI // ENCYCLOPEDIA — DATA · CH 02 FULL CONTENTS ↗ ## DATA · Encoding, Scaling & Transforms (https://ai-encyclopedia.com/data/03-encoding-scaling.html) Encoding, Scaling & Transforms — AI Encyclopedia AI // ENCYCLOPEDIA / DATA / 03 / ENCODING & SCALING INDEX NEXT: FEATURE ENGINEERING → DATA & FEATURE ENGINEERING · CHAPTER 03 / 05 Encoding, Scaling & Transforms Models consume numbers, so the encoding of categories and the scaling of features often matters more than the choice of model. A linear model, an SVM, or a k-NN classifier given raw, unscaled, poorly encoded columns will lose to a mediocre model given clean ones. This chapter covers the arithmetic of turning messy columns into the well-behaved numeric matrix every estimator assumes it was handed. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON DATA 01–02 INSTRUMENTS ENCODER · SCALER · BOX-COX IN THIS CHAPTER 3.1 Categorical encoding 3.2 Target & WOE encoding 3.3 Scaling features 3.4 Distribution transforms 3.5 Binning & discretization 3.R References 3.1 Categorical encoding: one-hot, ordinal, frequency Almost every real dataset arrives with columns that are not numbers: a country, a product category, a browser, a job title. A model cannot multiply a weight by the string "Berlin". Encoding is the map from categories to numbers, and the wrong map silently injects assumptions the data never made. The first thing to settle is whether a categorical variable is nominal (unordered — colours, cities, payment methods) or ordinal (genuinely ranked — small < medium < large, bronze < silver < gold). That single distinction decides almost everything that follows. One-hot encoding The default for nominal variables. A column with \(K\) distinct levels becomes \(K\) binary indicator columns, exactly one of which is hot (1) per row: EQ D3.1 — ONE-HOT ENCODING $$ \text{onehot}(x_i)_j \;=\; \mathbb{1}\!\left[\, x_i = c_j \,\right], \qquad j = 1, \ldots, K, \qquad \sum_{j=1}^{K} \text{onehot}(x_i)_j = 1 $$ \(c_1, \ldots, c_K\) are the \(K\) distinct categories; \(\mathbb{1}[\cdot]\) is the indicator (1 if true, 0 otherwise). Every row becomes a unit vector pointing at its category — all categories sit at equal, unit distance from one another, so no false ordering is implied. The cost is width: a 50-state column becomes 50 columns, a ZIP-code column becomes tens of thousands. For linear models with an intercept you often drop one level (dummy encoding, \(K-1\) columns) to avoid perfect collinearity; tree models and regularized models can keep all \(K\). You one-hot encode a single categorical column that has 4 distinct categories. How many new indicator columns does the encoding add? One-hot creates exactly one binary indicator per distinct level, so \(K = 4\) categories produce \(K = \) 4 columns. (If you instead used dummy encoding and dropped one level to avoid collinearity, you would add \(K-1 = 3\) — but plain one-hot adds the full 4.) Ordinal encoding When the categories really are ordered, map them to ascending integers — small → 0, medium → 1, large → 2. This keeps the column to a single feature and tells the model that large is "more" than small. Applied to a nominal variable, though, ordinal encoding is a trap: labelling {red, green, blue} as {0, 1, 2} tells a linear model that blue is twice green and green sits exactly between red and blue — pure fiction the model will dutifully exploit. Ordinal encoding is correct only when the order is real. Frequency / count encoding A cheap, single-column escape from one-hot's width problem: replace each category by how often it appears (its count or its relative frequency). It collapses \(K\) levels into one numeric feature, which suits high-cardinality columns and tree models well. The implicit claim is that rarity carries signal — often true (rare merchant codes correlate with fraud), sometimes meaningless, and it deliberately collapses two equally-frequent-but-different categories onto the same value. Encoding New columns Best for Footgun One-hot K (or K−1) Nominal, low cardinality, linear/SVM/k-NN Cardinality blow-up; sparse, wide matrices Ordinal 1 Genuinely ranked categories Invents an order on nominal data Frequency 1 High cardinality, tree models Distinct-but-equally-common levels collide Target (§3.2) 1 High cardinality + a target Leakage if fit on the same rows it encodes The cardinality wall. One-hot is the textbook default precisely because it is honest about nominal structure, but it scales linearly with the number of levels. At a few dozen categories it is fine; at thousands (user IDs, product SKUs, ZIP codes) the matrix becomes enormous and sparse, distances degrade, and you reach for frequency or target encoding instead. That trade-off — fidelity vs. width — is the whole game, and the instrument below lets you feel it. PYTHON · RUNNABLE IN-BROWSER # One-hot, ordinal & frequency encoding of a small categorical column (numpy only) import numpy as np col = np.array(["red","green","blue","red","blue","red","green","red"]) cats, inv, counts = np.unique(col, return_inverse=True, return_counts=True) K = len(cats) print("categories:", list(cats), " (K =", K, ")") onehot = np.eye(K, dtype=int)[inv] # EQ D3.1: K indicator columns print("\none-hot matrix (rows = samples, cols =", list(cats), "):") print(onehot) print("one-hot adds", K, "columns; every row sums to", set(onehot.sum(1).tolist())) ordinal = inv # integer code per category (order = alpha here) freq = counts[inv] / len(col) # frequency encoding: share of each category print("\nordinal codes:", ordinal.tolist()) print("frequency codes:", np.round(freq, 3).tolist(), " (1 column, not", K, ")") print("\nNote: ordinal would falsely tell a linear model blue(0) RUN ▶ edits are live — break it on purpose INSTRUMENT D3.1 — ENCODING EXPLORER ONE-HOT vs TARGET · CARDINALITY BLOW-UP · EQ D3.1 / D3.2 DISTINCT CATEGORIES K 6 ENCODING ONE-HOT TARGET COLUMNS ADDED — MATRIX CELLS (n=10K rows) — DENSITY (NON-ZERO) — Drag K from 2 to 40 in ONE-HOT mode and watch the matrix grow one column per category — at 40 levels and 10K rows you are storing 400K cells of which only 10K (2.5%) are non-zero: the sparse, wide blow-up. Switch to TARGET and the whole thing collapses to a single dense column whatever K is. The canvas shows the actual encoded matrix; the bars show one column being added per category in one-hot, versus one fixed column in target. 3.2 Target & WOE encoding — and how to keep them leakage-safe When a categorical column has hundreds or thousands of levels, one-hot is unwieldy and frequency throws away the relationship with the label. Target encoding (also "mean encoding", introduced by Micci-Barreca in 2001) replaces each category with the average value of the target for that category — one informative numeric column, regardless of cardinality. EQ D3.2 — SMOOTHED TARGET ENCODING $$ \hat{t}(c) \;=\; \frac{n_c\, \bar{y}_c \;+\; m\, \bar{y}}{n_c + m}, \qquad \bar{y}_c = \frac{1}{n_c}\sum_{i:\,x_i = c} y_i $$ \(\bar{y}_c\) is the target mean inside category \(c\); \(n_c\) is how many rows fall in \(c\); \(\bar{y}\) is the global target mean; \(m\) is a smoothing strength. The encoded value is a credibility-weighted blend: a category seen thousands of times trusts its own mean (\(n_c \gg m\)); a category seen twice is pulled toward the global prior \(\bar{y}\) (\(n_c \ll m\)). Without smoothing, a category that appears once would be encoded as exactly its single row's label — a perfect, useless memory of the answer. This shrinkage toward the prior is the entire reason target encoding generalizes. Weight of evidence (WOE) For binary classification, the closely-related weight of evidence encoding — a staple of credit scoring — replaces each category with the log-odds it contributes: EQ D3.3 — WEIGHT OF EVIDENCE $$ \mathrm{WOE}(c) \;=\; \ln\!\left( \frac{\Pr(x = c \mid y = 1)}{\Pr(x = c \mid y = 0)} \right) \;=\; \ln\!\left( \frac{\text{(events in }c)\,/\,\text{(total events)}}{\text{(non-events in }c)\,/\,\text{(total non-events)}} \right) $$ WOE is the log-ratio of the share of positives to the share of negatives within a category. It is monotonic in the target rate, lives on the natural log-odds scale a logistic regression already speaks, and the associated Information Value \(\mathrm{IV} = \sum_c (\text{share}_1 - \text{share}_0)\,\mathrm{WOE}(c)\) gives a single number for how predictive the whole feature is. Like target encoding, WOE must be computed with smoothing (and a small \(\varepsilon\) to avoid \(\ln 0\)) and on held-out folds. THE LEAKAGE TRAP Target encoding looks at the label — so if you fit the encoding on the same rows you then train on, every row gets to peek at its own answer. The model sees a feature that is partly a copy of \(y\), validation scores soar, and production collapses. This is the single most common way a leaderboard-topping pipeline dies on real data. The fix is never to encode a row using its own target. The disciplined remedy is out-of-fold (cross-fitted) encoding: split the training data into \(k\) folds; to encode the rows in fold \(j\), compute the category means using only the other \(k-1\) folds. No row ever contributes to its own encoded value, so the feature carries the category's signal without memorizing the answer. The test set is then encoded from statistics computed on the full training set. Smoothing (EQ D3.2) and out-of-fold computation are complementary, not alternatives — serious pipelines use both. A category appears \(n_c = 4\) times with a positive rate \(\bar{y}_c = 0.75\). The global mean is \(\bar{y} = 0.5\) and the smoothing strength is \(m = 4\). What smoothed target-encoded value \(\hat{t}(c)\) does EQ D3.2 give? \(\hat{t}(c) = \dfrac{n_c\,\bar{y}_c + m\,\bar{y}}{n_c + m} = \dfrac{4 \times 0.75 + 4 \times 0.5}{4 + 4} = \dfrac{3 + 2}{8} = \dfrac{5}{8} = \) 0.6. With \(n_c = m\) the encoding is the simple average of the category mean (0.75) and the prior (0.5) — exactly halfway, because the category has been seen just as often as the smoothing strength assumes. PYTHON · RUNNABLE IN-BROWSER # Target encoding: naive (LEAKS) vs out-of-fold (safe). Watch the leak signal. import numpy as np rng = np.random.default_rng(0) # 600 rows, a high-cardinality column with 200 levels, target unrelated to it n, K = 600, 200 cat = rng.integers(0, K, n) y = rng.integers(0, 2, n).astype(float) # pure coin flips: TRUE signal = 0 def naive_encode(cat, y): # fit on the SAME rows -> leak enc = np.zeros(len(cat)) for c in np.unique(cat): enc[cat == c] = y[cat == c].mean() return enc def oof_encode(cat, y, k=5, m=20.0): # out-of-fold + smoothing (safe) enc, gm = np.zeros(len(cat)), y.mean() fold = np.arange(len(cat)) % k for j in range(k): tr, te = fold != j, fold == j for c in np.unique(cat[te]): mask = tr & (cat == c); nc = mask.sum() enc[te & (cat == c)] = (nc*y[mask].mean() + m*gm)/(nc+m) if nc else gm return enc def corr(a, b): a,b=a-a.mean(),b-b.mean(); return float((a*b).sum()/np.sqrt((a*a).sum()*(b*b).sum())) print("corr(naive encoding, y):", round(corr(naive_encode(cat,y), y), 3), " RUN ▶ edits are live — break it on purpose In the cell above the target is literally a coin flip — there is no real relationship to the category — yet naive encoding manufactures a sizeable correlation with \(y\) out of thin air, because each rare category memorized its own rows. Out-of-fold encoding reports the true near-zero. Run it a few times: the leak is consistent, the honest version is consistently honest. 3.3 Scaling: standardize, min-max, robust Once everything is numeric, the columns still live on wildly different scales — age in years (0–100), income in dollars (0–10 6), a fraction in [0, 1]. Any algorithm that measures distance or sums weighted features will let the large-magnitude column dominate purely by accident of units. Feature scaling puts every column on comparable footing. Who cares about scale, and who does not? It is worth memorizing the split, because scaling a tree model is wasted effort and not scaling a k-NN model is a bug. Scaling matters Why Scaling is irrelevant k-NN, k-means Euclidean distance Decision trees, random forests, gradient-boosted trees — they split on thresholds within a single feature, so monotone rescaling changes nothing. SVM (RBF), PCA dot products / variance Linear/logistic + regularization, neural nets gradient conditioning; L1/L2 penalize raw coefficients Standardization (z-score) Subtract the mean, divide by the standard deviation. Every column ends up centered at 0 with unit variance: EQ D3.4 — STANDARDIZATION (z-SCORE) $$ z = \frac{x - \mu}{\sigma}, \qquad \mu = \frac{1}{n}\sum_i x_i, \qquad \sigma = \sqrt{\frac{1}{n}\sum_i (x_i - \mu)^2} $$ \(z\) is the number of standard deviations \(x\) sits from the mean. The transformed column has mean 0 and standard deviation 1, but its shape is unchanged — standardizing a skewed column gives a skewed column with nicer units (that is what §3.4 is for). It is the default for most linear models, SVMs, PCA and neural nets. It does not bound the range and it is not robust: a single huge outlier inflates \(\sigma\) and squashes everyone else toward zero. Standardize the value \( x = 8 \) for a feature whose mean is \( \mu = 5 \) and standard deviation is \( \sigma = 3 \). What is the z-score \( z \)? \( z = \dfrac{x - \mu}{\sigma} = \dfrac{8 - 5}{3} = \dfrac{3}{3} = \) 1.0. The value sits exactly one standard deviation above the mean — which is precisely what a z-score of 1 means. Min-max scaling Linearly squeeze the column into a fixed interval, usually [0, 1]: EQ D3.5 — MIN-MAX SCALING $$ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \;\in\; [0, 1] $$ The minimum maps to 0, the maximum to 1, everything else lands proportionally between. It preserves the exact shape of the distribution and the relative spacing of points, which is why it is favoured for image pixels and for inputs to bounded activations. Its weakness is the mirror image of standardization's: it is defined by the extremes, so one outlier at \(x_{\max}\) compresses every real value into a thin band near 0. Use it when you know the bounds and trust them. Robust scaling When outliers are a fact of life, scale by quantities that ignore the tails — the median for centering, the interquartile range (IQR) for spread: EQ D3.6 — ROBUST SCALING $$ x'' = \frac{x - \mathrm{median}(x)}{\mathrm{IQR}(x)}, \qquad \mathrm{IQR}(x) = Q_3 - Q_1 $$ The median has a 50% breakdown point and the IQR uses only the middle half of the data, so a handful of extreme values barely move either statistic. Robust scaling therefore keeps the bulk of the data on a sensible scale even when 10–20% of it is garbage — at the cost of the clean "mean 0, var 1" guarantee. Reach for it whenever a histogram shows fat tails or known measurement errors; reach for standardization when the data is roughly Gaussian and clean. FIT ON TRAIN ONLY Every scaler has parameters learned from data — \(\mu, \sigma\) for z-score, \(x_{\min}, x_{\max}\) for min-max, median/IQR for robust. Fit those parameters on the training set, then apply the frozen transform to validation and test. Recomputing the mean on the test set leaks test information into preprocessing and quietly inflates your scores — the scaling-stage twin of the target-encoding leak in §3.2. PYTHON · RUNNABLE IN-BROWSER # Standardize vs min-max, and a Box-Cox normality gain on skewed data import numpy as np rng = np.random.default_rng(0) x = rng.exponential(2.0, 4000) + 0.5 # right-skewed, strictly positive def stats(name, v): print(f"{name:11s} mean {v.mean():7.3f} std {v.std():6.3f} " f"min {v.min():7.3f} max {v.max():8.3f}") stats("raw", x) z = (x - x.mean()) / x.std() # EQ D3.4: mean 0, std 1 mm = (x - x.min()) / (x.max() - x.min()) # EQ D3.5: [0, 1] stats("z-score", z); stats("min-max", mm) def skew(v): v=(v-v.mean())/v.std(); return float((v**3).mean()) # 0 = symmetric print("\nscaling does NOT change shape -> skew(raw)=%.2f skew(z)=%.2f" % (skew(x), skew(z))) # Box-Cox (lambda chosen by a small grid) pulls the skew toward 0: best = min(np.linspace(-1, 1, 41), key=lambda L: abs(skew(np.log(x) if abs(L) skew={skew(bc):+.2f} (much closer to normal)") RUN ▶ edits are live — break it on purpose INSTRUMENT D3.2 — SCALING VISUALIZER TWO FEATURE CLOUDS · STANDARDIZE / MIN-MAX / ROBUST · OUTLIER TOGGLE SCALER RAW STANDARD MIN-MAX ROBUST OUTLIER OFF INJECT FEATURE A → range — FEATURE B → range — OUTLIER POSITION — Two clouds on very different native scales (A wide, B narrow). In RAW, feature A dominates any distance. Cycle the scalers: STANDARD and MIN-MAX equalize them — until you hit INJECT, which drops one extreme outlier. Now watch min-max crush the real data into a sliver near 0 and standard inflate its spread, while ROBUST barely flinches because the median and IQR ignore the rogue point. Grid lines mark the target scale of each method. 3.4 Distribution transforms: log, Box-Cox, Yeo-Johnson Scaling moves and stretches a column but never changes its shape. Yet many real features are badly skewed — incomes, prices, durations, counts — and many estimators (linear regression, anything assuming Gaussian-ish residuals, distance methods) work best on roughly symmetric inputs. Distribution transforms are nonlinear maps that pull a long right tail back toward symmetry. The log transform The workhorse. For strictly positive, right-skewed data, \(x \mapsto \ln x\) compresses large values far more than small ones, taming multiplicative spread into additive spread. It is the right move when a variable is naturally relative — a doubling of income matters the same whether from $10K or $1M. Use \(\ln(1+x)\) ( log1p) when the column contains exact zeros. Box-Cox Box and Cox (1964) generalized the log into a one-parameter family and let the data choose the exponent: EQ D3.7 — BOX-COX TRANSFORM $$ x^{(\lambda)} = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\[6pt] \ln x & \lambda = 0 \end{cases} \qquad (x > 0) $$ A single knob \(\lambda\) sweeps a whole spectrum of shapes: \(\lambda = 1\) is (almost) the identity, \(\lambda = 0.5\) a square root, \(\lambda = 0\) the log, \(\lambda = -1\) a reciprocal. The \(-1\) and division by \(\lambda\) make the family continuous at \(\lambda = 0\), where it smoothly becomes the log. \(\lambda\) is chosen by maximum likelihood — the value that makes the transformed data most Gaussian. The hard constraint: Box-Cox requires strictly positive inputs. Apply the Box-Cox transform (EQ D3.7) with \( \lambda = 1 \) to the value \( x = 2 \). What is \( x^{(\lambda)} \)? For \( \lambda \neq 0 \), \( x^{(\lambda)} = \dfrac{x^{\lambda} - 1}{\lambda} = \dfrac{2^{1} - 1}{1} = \dfrac{1}{1} = \) 1.0. At \( \lambda = 1 \) the transform is just \( x - 1 \), a pure shift — it leaves the distribution's shape untouched, which is exactly why \( \lambda = 1 \) is the "do nothing" point of the family. Yeo-Johnson Box-Cox's positivity requirement is a real nuisance — temperatures, profits, and standardized features all go negative. Yeo-Johnson (2000) extends the same idea to the whole real line by applying mirrored power transforms on each side of zero: EQ D3.8 — YEO-JOHNSON TRANSFORM $$ x^{(\lambda)} = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda} & x \ge 0,\ \lambda \neq 0 \\[4pt] \ln(x+1) & x \ge 0,\ \lambda = 0 \\[4pt] -\dfrac{(-x+1)^{2-\lambda} - 1}{2 - \lambda} & x < 0,\ \lambda \neq 2 \\[4pt] -\ln(-x+1) & x < 0,\ \lambda = 2 \end{cases} $$ For non-negative \(x\) it is essentially Box-Cox on \(x+1\); for negative \(x\) it mirrors the transform with exponent \(2-\lambda\). The result is one continuous, differentiable function over all of \(\mathbb{R}\) — no positivity constraint, no \(+\)constant hacks. \(\lambda\) is again fit by maximum likelihood for maximal normality. Default to Yeo-Johnson when the column can be zero or negative; reach for plain Box-Cox or log only when you know the data is strictly positive and want the cleaner interpretation. Honest caveats. These transforms optimize for marginal normality, which is neither necessary nor sufficient for a good model — modern gradient-boosted trees are invariant to any monotone transform of a feature, so this whole section is largely moot for them. Transforms also distort interpretability (a coefficient on \(\ln(\text{income})\) is an elasticity, not a dollar effect) and they extrapolate dangerously outside the fitted range. They earn their keep most for linear models, classical statistics, and any pipeline where Gaussian-ish inputs genuinely help. PYTHON · RUNNABLE IN-BROWSER # Box-Cox: scan lambda, pick the most-Gaussian, quantify the normality gain import numpy as np rng = np.random.default_rng(1) x = rng.lognormal(0.0, 0.9, 5000) + 0.2 # heavy right skew, all positive def boxcox(x, lam): return np.log(x) if abs(lam) RUN ▶ edits are live — break it on purpose INSTRUMENT D3.3 — BOX-COX TRANSFORMER SKEWED DISTRIBUTION · λ SLIDER · LIVE SKEWNESS · EQ D3.7 LAMBDA λ 0.00 SNAP TO λ* SKEWNESS (0 = SYMMETRIC) — TRANSFORM AT THIS λ — BEST λ (MIN |SKEW|) — The histogram is a strongly right-skewed (log-normal) feature. Drag λ from 1 (identity, the raw skew) down toward 0 (the log) and watch the long tail fold back into a near-symmetric bell as the skewness readout drives toward zero. Press SNAP TO λ* to jump to the maximum-normality value computed live. Push λ past the sweet spot toward −1 and you over-correct into a left skew — the transform is a dial, not a switch. 3.5 Binning & discretization The opposite move from a smooth transform: binning chops a continuous variable into a handful of discrete intervals — age → {child, adult, senior}, income → deciles. You deliberately throw away resolution to buy something else: robustness to outliers, the ability to capture a non-monotonic effect with a linear model, interpretable "score bands", or a categorical handoff into the encoders of §3.1–§3.2. There are two everyday strategies, and the difference is whether the bin edges or the bin counts are held constant: Strategy Edges chosen by Each bin has… Good / bad Equal-width range / k equal interval, unequal counts Simple & interpretable; empty bins on skewed data Equal-frequency quantiles equal counts, unequal widths Robust to skew; edges shift with the data Supervised (e.g. tree / MDL) target purity edges where the label changes Most predictive; can overfit & leak — fit on train EQ D3.9 — EQUAL-WIDTH vs EQUAL-FREQUENCY BIN EDGES $$ \text{equal-width: } e_j = x_{\min} + j\,\frac{x_{\max} - x_{\min}}{k}; \qquad \text{equal-frequency: } e_j = Q_{j/k}(x), \quad j = 0, \ldots, k $$ Equal-width splits the value axis into \(k\) equal pieces — trivial to read ("ages 0–20, 20–40, …") but on a skewed column most points pile into one or two bins and the rest sit empty. Equal-frequency splits the data into \(k\) equal piles using quantiles, so every bin is equally populated, at the price of uneven, data-dependent widths. Equal-frequency is the safer default for skewed real-world data; equal-width wins when the bin boundaries themselves must be round, fixed, human numbers. Binning is genuinely contested. It can rescue a linear model from a U-shaped relationship and it makes credit-scorecards legible — but it discards information, plants artificial discontinuities at the bin edges, and (when bins are chosen using the target) leaks exactly like target encoding. The modern view: prefer letting a flexible model learn the nonlinearity (splines, gradient-boosted trees) over hand-binning, and reserve discretization for interpretability, regulatory, or robustness reasons rather than raw accuracy. PITFALLS Four ways encoding & scaling silently break a model: (1) fitting any data-dependent transform — scaler, target encoder, supervised bins — on the full dataset instead of train-only, leaking test/label information; (2) ordinal-encoding a nominal variable and inventing an order; (3) min-max scaling in the presence of outliers, crushing the real data to a sliver; (4) unseen categories at inference time that the encoder has no value for — always reserve an "unknown" bucket and a global-mean fallback. NEXT Encoding and scaling make the columns you have well-behaved; feature engineering creates the columns you wish you had. Chapter 04 — Feature Engineering — covers interactions, polynomial and spline bases, date/time and cyclical features, aggregations and lag features, and the discipline of building them without leaking the future into the past. 3.R References Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations 3(1) — the smoothed target/mean encoding of EQ D3.2. Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society B 26(2) — the Box-Cox power-transform family, EQ D3.7. Yeo, I.-K. & Johnson, R. A. (2000). A New Family of Power Transformations to Improve Normality or Symmetry. Biometrika 87(4) — the Yeo-Johnson extension to real-valued data, EQ D3.8. Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (full text online) — encoding, scaling, transforms and leakage-safe resampling. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12 — the reference implementations of StandardScaler, MinMaxScaler, RobustScaler and PowerTransformer. Liu, H., Hussain, F., Tan, C. L. & Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6 — a survey of binning / discretization methods (EQ D3.9). ← PREVIOUS 02 Missing Data NEXT CHAPTER 04 Feature Engineering AI // ENCYCLOPEDIA — DATA · CH 03 FULL CONTENTS ↗ ## DATA · Feature Engineering & Selection (https://ai-encyclopedia.com/data/04-feature-engineering.html) Feature Engineering & Selection — AI Encyclopedia AI // ENCYCLOPEDIA / DATA / 04 / FEATURE ENGINEERING INDEX NEXT: IMBALANCED DATA → DATA & FEATURE ENGINEERING · CHAPTER 04 / 05 Feature Engineering & Selection The right feature can let a linear model beat a neural net. Feature engineering is the point where domain knowledge enters the math. This chapter works from both ends: how to create features that expose structure a model cannot find on its own, and how to select from the resulting flood the few that carry signal, without contaminating your own evaluation in the process. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON DATA 01–03 INSTRUMENTS INTERACTION · SELECTION RACE · VIF IN THIS CHAPTER 4.1 Creating features 4.2 Datetime, text & aggregation 4.3 Selection: filter · wrapper · embedded 4.4 Importance & redundancy 4.5 Selection bias & nested CV 4.R References 4.1 Creating features: interactions, ratios, polynomials A model can only learn relationships its inputs make expressible. A linear model on raw columns \(x_1, x_2\) can fit only \(w_0 + w_1 x_1 + w_2 x_2\) — a flat hyperplane. If the truth lives on a curve, or in the product of two variables, no amount of training data and no clever optimizer will recover it: the hypothesis class simply does not contain the answer. Feature engineering changes the hypothesis class by changing the inputs. You are doing, by hand and with domain knowledge, the representation learning that a deep network would otherwise have to discover from scratch — and on tabular data you will frequently win, because you know things about the problem that the data alone does not say. The three workhorse transforms each inject a specific kind of structure: Transform New feature What it expresses Reach for it when… Interaction x₁ · x₂ The effect of one variable depends on another (non-additivity) Effects are conditional: a drug works only at a certain dose and age Ratio x₁ / x₂ Scale-free intensity; a rate rather than a level Density, price-per-area, debt-to-income — the meaningful quantity is normalized Polynomial x², x³, … Smooth curvature in a single variable Diminishing or accelerating returns; a clear bend in the partial-dependence plot The interaction is the most important and the most underused. Consider the exclusive-or pattern: a point is positive when its two coordinates share a sign and negative otherwise. The two classes are perfectly determined, yet completely inseparable by any line in the \((x_1, x_2)\) plane — every straight cut puts roughly half of each class on each side. Add one feature, the product \(x_1 x_2\), and the problem collapses to a single threshold: \(x_1 x_2 > 0\). A linear model — a linear model — now solves it exactly. That is the whole thesis of this chapter in one example. EQ D4.1 — INTERACTION LINEARIZES XOR $$ y = \operatorname{sign}(x_1 x_2), \qquad \hat{y} = \operatorname{sign}\!\big(w\,(x_1 x_2)\big) \quad\text{is exact, while}\quad \hat{y} = \operatorname{sign}(w_1 x_1 + w_2 x_2 + b) \text{ cannot exceed } 50\%. $$ The raw inputs carry all the information, but in a form no linear decision boundary can read. The engineered product \(z = x_1 x_2\) is a change of coordinates in which the same data becomes linearly separable on one axis. The model did not get smarter — the representation did. A neural net would have learned an equivalent product inside a hidden layer; you supplied it directly, with one multiplication and zero training. Polynomials generalize this. Polynomial feature expansion of degree \(d\) emits every monomial up to total degree \(d\): for two inputs at degree 2 that is \(\{1,\ x_1,\ x_2,\ x_1^2,\ x_1 x_2,\ x_2^2\}\). The number of terms grows combinatorially — and that growth is the central danger of the technique. EQ D4.2 — SIZE OF A POLYNOMIAL EXPANSION $$ \#\text{terms} = \binom{n + d}{d}, \qquad \#\text{(degree-2, no bias, } n \text{ inputs)} = n + \binom{n}{2} + n = \underbrace{n}_{\text{linear}} + \underbrace{\binom{n}{2}}_{\text{interactions}} + \underbrace{n}_{\text{squares}} $$ \(\binom{n+d}{d}\) counts all monomials of total degree \(\le d\) in \(n\) variables, including the constant. For \(n=2,\ d=2\) that is \(\binom{4}{2}=6\) terms; the squared-plus-interaction part alone (dropping the bias and the two linear terms) is \(x_1^2, x_1 x_2, x_2^2\) — exactly 3. At \(n=100,\ d=2\) you already have \(\binom{102}{2}=5151\) features; at degree 3 the count explodes into the hundreds of thousands. Curvature is cheap; the curse of dimensionality it buys is not. This is precisely why §4.3 (selection) is not optional once you start §4.1 (creation). You take two raw features \( x_1, x_2 \) and apply a degree-2 polynomial expansion. Excluding the bias term and the two original linear terms, how many squared-plus-interaction features does it add? The degree-2 monomials beyond linear are the two squares \( x_1^2,\ x_2^2 \) and the single interaction \( x_1 x_2 \). That is \( 2 + 1 = \) 3 new features — matching the squares-plus-interactions decomposition in EQ D4.2 with \( n = 2 \): \( n + \binom{n}{2} = 2 + 1 = 3 \). PYTHON · RUNNABLE IN-BROWSER # EQ D4.1: one interaction feature lets a LINEAR model solve XOR-like data. import numpy as np rng = np.random.default_rng(0) n = 400 X = rng.uniform(-1, 1, (n, 2)) # two raw features in [-1, 1] y = (X[:, 0] * X[:, 1] > 0).astype(float) # XOR pattern: same-sign => class 1 def fit_logreg(F, y, steps=400, lr=0.5): w = np.zeros(F.shape[1]); b = 0.0 for _ in range(steps): p = 1 / (1 + np.exp(-(F @ w + b))) g = p - y w -= lr * (F.T @ g) / len(y); b -= lr * g.mean() return w, b def acc(F, w, b): return ((F @ w + b > 0) == (y > 0.5)).mean() raw = X # [x1, x2] -> a flat plane poly = np.column_stack([X, X[:, 0]*X[:, 1]]) # add the x1*x2 interaction wr, br = fit_logreg(raw, y); wp, bp = fit_logreg(poly, y) print(f"linear on [x1, x2]: accuracy {acc(raw, wr, br):.3f} (~chance)") print(f"linear on [x1, x2, x1*x2]: accuracy {acc(poly, wp, bp):.3f} (solved)") print(f"learned weight on x1*x2: {wp[2]:+.2f} RUN ▶ edits are live — break it on purpose INSTRUMENT D4.1 — THE INTERACTION FEATURE XOR DATA · TOGGLE x₁·x₂ · EQ D4.1 LABEL NOISE 0.05 FEATURES THE MODEL SEES [ x₁, x₂ ] [ x₁, x₂, x₁·x₂ ] TRAIN ACCURACY — DECISION RULE — SEPARABLE? — The four quadrants form a checkerboard: same-sign points are one class, opposite-sign the other. With [ x₁, x₂ ] the best straight boundary the logistic model can draw is hopeless — accuracy hovers near 50%, and the canvas shows the flat shaded half-planes failing. Flip to [ x₁, x₂, x₁·x₂ ] and the very same model snaps to ~100%: the engineered product turns the checkerboard into a single threshold (the boundary becomes the two axes). Raising label noise is the only thing that can hurt it now. A practical warning. Engineered features are not free: each one is another dimension in which the model can overfit, another column to compute and store at serving time, and — for ratios — another place a zero denominator can blow up your pipeline. The discipline is to create with intent (a hypothesis about why this feature should matter) and then prune hard (§4.3). Create generously in the lab; ship parsimoniously. 4.2 Datetime, text & aggregation features Most real-world signal does not arrive as tidy numeric columns. It arrives as timestamps, free text, and one-to-many relationships between tables. Each demands its own family of feature transforms — and each is where domain knowledge pays off most. Datetime A raw timestamp is nearly useless to a model: as a single monotonically increasing integer it can only express "later". The information lives in its components — hour of day, day of week, month, is-weekend, is-holiday, days-since-last-event — extracted into separate features. The subtlety is that several of these are cyclical: hour 23 and hour 0 are adjacent, not maximally distant, yet a plain integer encoding tells the model they are 23 apart. The fix is a sine/cosine pair that wraps the cycle onto a circle. EQ D4.3 — CYCLICAL ENCODING $$ x_{\sin} = \sin\!\left(\frac{2\pi\,t}{P}\right), \qquad x_{\cos} = \cos\!\left(\frac{2\pi\,t}{P}\right) $$ \(t\) is the position within the cycle (e.g. the hour, \(0\ldots23\)) and \(P\) its period (here \(24\)). The pair places each time on a unit circle, so the Euclidean distance between hour 23 and hour 0 is small — as it should be — while opposite hours sit far apart. One number cannot encode a cycle without a discontinuity; two can. Both features are needed: \(\sin\) alone is ambiguous (it gives the same value at 6:00 and 18:00), and the \(\cos\) component breaks the tie. Tree models, which split on thresholds, often do fine with the raw integer components and need this trick less than linear models and neural nets. Text Free text is turned into features along a spectrum of sophistication. The classical baseline is the bag of words / TF-IDF representation: count each term, then down-weight terms that appear in many documents so that common words contribute little and distinctive words contribute much. EQ D4.4 — TF-IDF $$ \text{tfidf}(t, d) = \underbrace{\text{tf}(t, d)}_{\text{count in doc}} \times \underbrace{\log\!\frac{N}{1 + \text{df}(t)}}_{\text{inverse doc frequency}} $$ \(\text{tf}(t,d)\) is how often term \(t\) appears in document \(d\); \(\text{df}(t)\) is how many of the \(N\) documents contain it. A term in every document (\(\text{df}\approx N\)) gets an IDF near zero and is effectively ignored; a rare, document-specific term gets a large weight. The \(1+\) in the denominator avoids division by zero for unseen terms. TF-IDF is still a strong, cheap, fully interpretable baseline for classification; dense embeddings (Vol II) beat it on meaning but lose the per-term transparency that makes TF-IDF easy to debug. Simpler text features — length, digit count, punctuation ratio, sentiment lexicon hits — are often surprisingly predictive and cost almost nothing. Aggregation When the unit of prediction (a customer) maps to many rows in another table (their transactions), you must aggregate the many into features of the one: count, sum, mean, min, max, standard deviation, recency, and ratios of these over time windows. "Mean transaction value over the last 30 days," "number of distinct merchants this week," "ratio of this month's spend to the trailing-6-month average" — these grouped statistics are typically the most predictive features in churn, fraud, and recommendation systems, and they are exactly what automated tooling (featuretools' deep feature synthesis, modern feature stores) was built to manufacture and serve consistently between training and production. LEAKAGE Aggregation and time features are the two richest sources of target leakage. If an aggregate is computed over a window that includes the prediction moment — "average outcome for this customer," "total refunds including the one you are trying to predict" — your offline metric will be spectacular and your production model will fail. Every windowed feature must be computed strictly from information available before the prediction timestamp. The discipline is a point-in-time correct join: as-of each event, use only rows that existed then. Leakage through aggregation is the single most common reason a model that "worked" in a notebook collapses on deployment. PYTHON · RUNNABLE IN-BROWSER # EQ D4.3: cyclical hour encoding keeps midnight next to 11pm. import numpy as np hours = np.arange(24) P = 24 hs = np.sin(2*np.pi*hours/P) hc = np.cos(2*np.pi*hours/P) def dist(a, b, vec): # Euclidean distance in feature space return np.hypot(vec[0][a]-vec[0][b], vec[1][a]-vec[1][b]) raw = (hours[None,:], np.zeros((1, 24))) # raw integer "encoding" (1-D) cyc = (hs, hc) # sin/cos pair (2-D, on a circle) print(" raw-integer dist cyclical dist") for a, b, name in [(23, 0, "23h -> 00h"), (0, 12, "00h -> 12h"), (6, 18, "06h -> 18h")]: print(f"{name:14s} {abs(hours[a]-hours[b]):>8.2f} {dist(a, b, cyc):>8.3f}") print("\nraw says 23h and 00h are 23 apart (max); cyclical says they are adjacent.") print("00h 12h and 06h 18h are the true opposites -> largest cyclical distance.") plot_xy(hs, hc) # the 24 hours laid out on a circle RUN ▶ edits are live — break it on purpose 4.3 Feature selection: filter, wrapper, embedded §4.1 generates features by the hundred; §4.3 throws most of them away. Selection matters for three reasons that compound: fewer features means less overfitting (especially when \(p\) approaches or exceeds \(n\)), faster and cheaper models in training and serving, and — often most valuable — a model a human can actually read. The three families of methods trade compute against fidelity to the final model. Family How it scores features Cost Blind spot Filter Univariate statistic vs the target (correlation, MI, χ², ANOVA F), model-agnostic cheap Judges each feature alone — misses interactions and redundancy Wrapper Train the model on candidate subsets, search for the best (forward, backward, RFE) expensive Combinatorial; prone to overfitting the search itself Embedded Selection happens inside training (L1/Lasso zeros weights; trees rank by gain) moderate Tied to one model family; unstable under collinearity Filter methods score every feature against the target independently and keep the top \(k\). They are blisteringly fast and a fine first pass, but their independence assumption is exactly their weakness: a filter ranks each feature in isolation, so it will happily keep ten copies of the same signal and discard a feature that is useless alone yet decisive in combination (the XOR product of §4.1 has zero univariate correlation with the label, yet is the whole answer). Wrapper methods close that gap by judging features through the actual model. Recursive feature elimination (RFE) is the canonical example: train the model on all features, drop the least important one, refit, and repeat until the target count remains. Because the model sees feature combinations at every step, RFE can keep the XOR product and discard the redundant copies — at the cost of training the model many times. EQ D4.5 — RECURSIVE FEATURE ELIMINATION $$ S_0 = \{1,\dots,p\}, \qquad j^\star = \arg\min_{j \in S_t} \text{importance}_j\big(\text{fit on } S_t\big), \qquad S_{t+1} = S_t \setminus \{j^\star\} $$ Start with all \(p\) features. At each step, fit the model on the surviving set \(S_t\), find the feature \(j^\star\) the refit model ranks lowest, and remove it. Repeat until \(|S| = k\). Importance is whatever the model exposes — \(|\text{coefficient}|\) for a linear model, split-gain for a tree. The recursion is the point: a feature that looks weak among all \(p\) may become essential once its redundant partners are gone, so importances are recomputed after every elimination rather than ranked once. RFE is \(O(p)\) model fits — far cheaper than the \(2^p\) of exhaustive subset search, far more faithful than a single filter pass. Embedded methods fold selection into the fit itself. L1 regularization (the Lasso) adds a penalty proportional to the sum of absolute weights; the geometry of that penalty drives many coefficients to exactly zero, performing selection and fitting in a single optimization. EQ D4.6 — LASSO: SELECTION BY L1 PENALTY $$ \hat{\beta} = \arg\min_{\beta}\ \tfrac{1}{2n}\,\lVert y - X\beta \rVert_2^2 \;+\; \lambda \sum_{j=1}^{p} \lvert \beta_j \rvert $$ The squared-error loss pulls toward the least-squares fit; the L1 term \(\lambda\sum|\beta_j|\) pulls toward zero. Because the L1 ball has corners on the axes (unlike the round L2 ball of ridge regression), the optimum tends to land on an axis — i.e. with some \(\beta_j = 0\) exactly. Larger \(\lambda\) zeros more coefficients; sweep \(\lambda\) and you trace a selection path. L1 selects; L2 only shrinks. The caveat experts will raise: among a group of highly correlated features the Lasso tends to pick one arbitrarily and zero the rest, which is unstable — the elastic net (L1+L2) was invented precisely to tame that, and tree-based importances (§4.4) suffer a related instability. PYTHON · RUNNABLE IN-BROWSER # EQ D4.5: recursive feature elimination by hand on a linear model. # 3 of 12 features are real signal; RFE should recover exactly those 3. import numpy as np rng = np.random.default_rng(1) n, p, k = 300, 12, 3 X = rng.normal(0, 1, (n, p)) X /= X.std(0) # standardize so |coef| is comparable true = [2, 5, 9] # the only features that matter y = 3.0*X[:, 2] - 2.0*X[:, 5] + 1.5*X[:, 9] + 0.3*rng.normal(0, 1, n) def ridge_coef(Xs, y, lam=1.0): # closed-form ridge => stable importances A = Xs.T @ Xs + lam*np.eye(Xs.shape[1]) return np.linalg.solve(A, Xs.T @ y) kept = list(range(p)) while len(kept) > k: w = ridge_coef(X[:, kept], y) drop = int(np.argmin(np.abs(w))) # j*: smallest |coef| in the refit model print(f"have {len(kept):2d} -> drop original feature #{kept[drop]:2d} (|coef|={abs(w[drop]):.3f})") kept.pop(drop) print("\nRFE kept:", sorted(kept)) print("truth:", true) print("match:", sorted(kept) == true) RUN ▶ edits are live — break it on purpose INSTRUMENT D4.2 — FEATURE-SELECTION RACE FILTER vs WRAPPER vs L1 · 5 SIGNAL + NOISE NOISE FEATURES 25 SIGNAL STRENGTH 1.8 RECALL — TRUE FEATURES RECOVERED IN THE TOP-5 (5 = PERFECT) FILTER (|CORR|) — WRAPPER (RFE) — EMBEDDED (L1) — Five true features drive the target; the rest are pure noise. Each method picks its top 5; the bars show how many of the real ones it recovered. With strong signal and little noise all three score 5/5. Drive NOISE FEATURES up and SIGNAL STRENGTH down: the univariate filter degrades first — it cannot tell a true feature from a noise feature that happens to correlate by chance — while the model-aware RFE and L1 hold on longer. This is the cheap-vs-faithful trade-off made visible. 4.4 Importance & redundancy: MI, correlation, VIF "Is this feature useful?" splits into two distinct questions that beginners conflate. Importance: how much does this feature tell me about the target? Redundancy: how much of this feature is already told by the others? You want features that score high on the first and low on the second — informative and non-overlapping. Three measures cover the ground. Correlation is the cheap importance measure, but it sees only linear association (Vol & DATA 03). A feature with a perfect quadratic relationship to the target can have correlation zero. Mutual information fixes this: it measures any statistical dependence, linear or not, in bits. EQ D4.7 — MUTUAL INFORMATION $$ I(X; Y) = \sum_{x}\sum_{y} p(x, y)\,\log\frac{p(x, y)}{p(x)\,p(y)} \;=\; H(Y) - H(Y \mid X) $$ MI is zero if and only if \(X\) and \(Y\) are statistically independent, and it grows with any form of dependence — capturing the curved relationships correlation is blind to. The second form reads it as the reduction in the uncertainty (entropy) of \(Y\) once you know \(X\): how many bits the feature buys you about the target. MI catches non-linear importance that correlation misses; the price is that estimating it from continuous data needs binning or a \(k\)-nearest-neighbour estimator, and noisy estimates can over-rank features in small samples. Used as a filter score, MI is strictly more general than correlation. Redundancy is the other axis, and it has its own canonical diagnostic. When several features are linear combinations of one another — multicollinearity — a linear model can still predict fine, but its coefficients become unstable and uninterpretable: the model cannot decide how to split credit between the duplicates, so tiny data changes swing the weights wildly (and sometimes flip their signs). The variance inflation factor (VIF) quantifies exactly how badly each feature is explained by the rest. EQ D4.8 — VARIANCE INFLATION FACTOR $$ \text{VIF}_j = \frac{1}{1 - R_j^2}, \qquad R_j^2 = \text{the } R^2 \text{ from regressing feature } x_j \text{ on all the other features.} $$ Regress feature \(x_j\) on every other feature and read off the \(R_j^2\). If the others explain none of \(x_j\) (\(R_j^2 = 0\)) then \(\text{VIF}_j = 1\) — no inflation. As \(R_j^2 \to 1\) the feature becomes a near-perfect combination of the others and \(\text{VIF}_j \to \infty\). The name is literal: the variance of the estimated coefficient \(\hat\beta_j\) is multiplied by exactly \(\text{VIF}_j\) relative to the no-collinearity case. Rules of thumb: VIF > 5 warrants a look, VIF > 10 signals serious collinearity — though these thresholds are conventions, not laws, and high VIF only hurts coefficient interpretation, not pure predictive accuracy. At \(R_j^2 = 0.8\), \(\text{VIF}_j = 1/(1-0.8) = 5\): the borderline case. Regressing feature \( x_j \) on all the other features gives \( R_j^2 = 0.8 \) — the rest of the design explains 80% of its variance. What is its variance inflation factor \( \text{VIF}_j \)? By EQ D4.8, \( \text{VIF}_j = \dfrac{1}{1 - R_j^2} = \dfrac{1}{1 - 0.8} = \dfrac{1}{0.2} = \) 5. The coefficient's variance is inflated 5×, and \( R_j^2 = 0.8 \) sits right at the conventional "warrants a look" threshold — a feature this redundant is a prime candidate to drop or combine. PYTHON · RUNNABLE IN-BROWSER # EQ D4.8: variance inflation factor, computed directly from R^2. import numpy as np rng = np.random.default_rng(0) n = 600 x1 = rng.normal(0, 1, n) x2 = rng.normal(0, 1, n) x3 = 0.9*x1 + 0.1*rng.normal(0, 1, n) # x3 is almost a copy of x1 -> high VIF X = np.column_stack([x1, x2, x3]) names = ["x1", "x2", "x3 (~x1)"] def vif(X, j): # regress column j on the others, read R^2 y = X[:, j] others = np.delete(X, j, axis=1) A = np.column_stack([np.ones(len(y)), others]) beta, *_ = np.linalg.lstsq(A, y, rcond=None) resid = y - A @ beta r2 = 1 - resid.var() / y.var() return 1.0 / (1.0 - r2), r2 for j, nm in enumerate(names): v, r2 = vif(X, j) flag = " 5 else "" print(f"{nm:10s} R^2={r2:5.3f} VIF={v:6.2f}{flag}") print("\nx1 and x3 inflate each other; x2 is independent and sits near VIF=1.") print("check: R^2=0.80 -> VIF = 1/(1-0.80) =", round(1/(1-0.80), 2)) RUN ▶ edits are live — break it on purpose INSTRUMENT D4.3 — MULTICOLLINEARITY / VIF EXPLORER TWO FEATURES · TUNE THEIR CORRELATION · EQ D4.8 corr(x₁, x₂) = ρ 0.80 R² (x₁ ~ x₂) — VIF = 1/(1−R²) — VERDICT — Two features with a tunable correlation ρ. With two features, \(R^2 = \rho^2\) exactly, so VIF \(= 1/(1-\rho^2)\) — the curve plotted on the canvas. Slide ρ up: at ρ = 0 the cloud is round and VIF = 1 (no inflation); at ρ = 0.80, R² = 0.64 and VIF ≈ 2.8; past ρ ≈ 0.89 you cross VIF = 5 and the verdict flips to the warning zone; as ρ → 1 the cloud collapses to a line and VIF blows up toward infinity. The reading is the coefficient-variance multiplier the collinearity is costing you. 4.5 Selection bias & nested cross-validation Here is the most expensive mistake in applied machine learning, and it is committed daily by people who know better. You have 10,000 features and 200 samples. You score every feature against the target on the full dataset, keep the 20 that correlate best, then run cross-validation on those 20 — and report a beautiful cross-validated accuracy. The number is a fiction. You have already let the test folds influence which features survive, so every fold's "held-out" data was used to choose the model. This is feature-selection bias, and with enough noise features it can manufacture impressive cross-validated accuracy out of pure noise. EQ D4.9 — WHY SELECTION-ON-ALL-DATA LEAKS $$ \widehat{\text{acc}}_{\text{biased}} = \text{CV}\Big(\text{model} \,\big|\, \underbrace{\text{features chosen using all } (X, y)}_{\text{test folds already seen}}\Big) \;\gg\; \widehat{\text{acc}}_{\text{honest}} $$ The selection step is part of the model-fitting procedure, so it must live inside the cross-validation loop, not before it. Pick features on the full data and the labels of every eventual test fold have leaked into the choice; the CV estimate is then biased upward — sometimes wildly. With \(p \gg n\) and only noise, selecting the top-\(k\) "best" features on all the data and then cross-validating can report accuracy far above chance for data with no signal whatsoever. The rule is absolute: every data-dependent decision — imputation statistics, scaling parameters, feature selection, hyperparameters — must be fit on the training portion of each fold alone. The fix has two layers. First, put feature selection inside the cross-validation: each fold selects its own features from its own training data, and the held-out fold judges that whole pipeline honestly. Second — when you are also tuning something (which \(k\), which \(\lambda\)) — you need nested cross-validation: an inner loop selects features and tunes hyperparameters, an outer loop estimates the performance of that entire selection-and-tuning procedure. The outer fold never touches anything the inner loop saw. THE RULE Anything you learn from the data is part of the model and must be cross-validated as a unit. If a step looks at \(y\) — selecting features, fitting an imputer's means, choosing a scaling, tuning \(\lambda\) — it belongs inside the resampling loop. Fit it once on the whole dataset "to save time" and you have leaked the test set into training. The honest pipeline is more code and a smaller, truer number; the biased one is less code and a lie. Nested CV is simply this rule applied twice: once for selection/tuning (inner), once for honest performance estimation (outer). PYTHON · RUNNABLE IN-BROWSER # EQ D4.9: selection bias on PURE NOISE. There is no signal at all, # yet selecting features on all the data fakes high CV accuracy. import numpy as np rng = np.random.default_rng(3) n, p, k = 120, 4000, 20 # p >> n: a leakage trap X = rng.normal(0, 1, (n, p)) y = (rng.random(n) > 0.5).astype(float) # label is a COIN FLIP -- zero signal def cv_acc(Xs, y, folds=4): idx = np.array_split(rng.permutation(len(y)), folds); accs = [] for f in range(folds): te = idx[f]; tr = np.concatenate([idx[g] for g in range(folds) if g != f]) w = np.linalg.lstsq(np.column_stack([np.ones(len(tr)), Xs[tr]]), y[tr]-0.5, rcond=None)[0] pred = (np.column_stack([np.ones(len(te)), Xs[te]]) @ w > 0) accs.append((pred == (y[te] > 0.5)).mean()) return np.mean(accs) corr = np.array([abs(np.corrcoef(X[:, j], y)[0, 1]) for j in range(p)]) top = np.argsort(corr)[-k:] # 0) acc_h.append((pred == (y[te] > 0.5)).mean()) print(f"HONEST (select inside folds): CV acc = {np.mean(acc_h):.3f} (~0.50, the truth)") RUN ▶ edits are live — break it on purpose NEXT Good features and honest selection assume your classes are balanced enough to learn from. They often are not: fraud, disease, and defaults are rare by definition, and a 99%-accurate model that always predicts "no" is worthless. Chapter 05 — Imbalanced Data — covers resampling (SMOTE and friends), class weighting, threshold moving, and the precision/recall-based metrics that tell the truth when accuracy lies. 4.R References Guyon, I. & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 — the canonical survey of filter, wrapper and embedded selection (§4.3). Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press / open web edition — interactions, encodings, resampling-aware selection and leakage (§4.1–4.5). Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B 58(1) — the L1 penalty that performs embedded selection (EQ D4.6). Ambroise, C. & McLachlan, G. J. (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. PNAS 99(10) — the definitive demonstration of feature-selection bias and why selection must sit inside cross-validation (§4.5). Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46 — recursive feature elimination, EQ D4.5. Zou, H. & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B 67(2) — the L1+L2 fix for Lasso's instability under collinearity (EQ D4.6 note). Kraskov, A., Stögbauer, H. & Grassberger, P. (2004). Estimating Mutual Information. Physical Review E 69(6) — the k-nearest-neighbour estimator behind practical MI feature scores (EQ D4.7). ← PREVIOUS 03 Encoding & Scaling NEXT CHAPTER 05 Imbalanced Data AI // ENCYCLOPEDIA — DATA · CH 04 FULL CONTENTS ↗ ## DATA · Imbalanced Data (https://ai-encyclopedia.com/data/05-imbalanced.html) Imbalanced Data — Resampling & SMOTE — AI Encyclopedia AI // ENCYCLOPEDIA / DATA / 05 / IMBALANCED DATA INDEX NEXT: LEARNING FROM DATA → DATA & FEATURE ENGINEERING · CHAPTER 05 / 05 Imbalanced Data — Resampling & SMOTE When 1 case in 1000 is the one that matters, as with fraud, disease, or default, accuracy stops being informative. Accuracy misleads under imbalance, and the rebalancing method you choose determines what the model learns. This chapter moves from the failure of naive training through resampling, SMOTE and its descendants, loss-level fixes such as class weights and focal loss, and the metrics that remain meaningful at a 99:1 split. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON DATA 04 · ML 03 INSTRUMENTS IMBALANCE · SMOTE · THRESHOLD IN THIS CHAPTER 5.1 Why imbalance breaks training 5.2 Resampling 5.3 SMOTE & variants 5.4 Algorithm-level fixes 5.5 Evaluating under imbalance 5.R References 5.1 Why imbalance breaks training & metrics A dataset is imbalanced when one class vastly outnumbers another. The ratio is not a curiosity — it is the whole problem. Credit-card fraud runs near 1 transaction in 1,000; a screening test for a rare cancer might see 1 case in 10,000; a churn flag fires for a few percent of users. In every case the class you actually care about is the rare one, and the loss function — left to its own devices — barely notices it exists. Start with the metric everyone reaches for. Accuracy is the fraction of predictions that are correct, and on imbalanced data it is worse than useless — it is actively misleading. Consider the majority-class baseline: a "model" that ignores its input and always predicts the common class. EQ D5.1 — THE ACCURACY TRAP $$ \text{Acc}_{\text{majority}} \;=\; \frac{N_{\text{maj}}}{N_{\text{maj}} + N_{\text{min}}} \;=\; 1 - \pi, \qquad \pi \;=\; \frac{N_{\text{min}}}{N} $$ \(\pi\) is the minority prevalence — the base rate of the positive class. A constant predictor that always says "majority" scores \(1-\pi\) accuracy while detecting nothing. At \(\pi = 0.001\) it reads 99.9% accurate; at \(\pi = 0.05\), 95%. The number is real and the model is useless — accuracy measures the imbalance, not the model. The honest signals are recall (of the real positives, how many did you catch?) and precision (of your alarms, how many were real?), defined in §5.5. WORKED EXAMPLE ▾ 01 A fraud set: \(N = 100{,}000\) transactions, of which \(N_{\text{min}} = 100\) are fraudulent. So \(\pi = 100 / 100{,}000 = 0.001\). 02 The always-legitimate predictor is correct on all 99,900 legit rows and wrong on all 100 frauds: accuracy \(= 99{,}900 / 100{,}000 = 0.999\). 03 Its recall on fraud is \(0/100 = 0\): it has never caught a single case. Precision is undefined (no positive predictions). Accuracy applauds; the bank is robbed. 04 To beat 99.9% accuracy a real model must make almost no false alarms — yet catching frauds inevitably costs some. This is why accuracy is the wrong objective here: it punishes the very behavior you want. RESULT: 99.9% accurate, 0% recall — the trap in one line A dataset has a 95:5 class split (95% negative, 5% positive). A model that always predicts the majority (negative) class achieves what accuracy? (Give a decimal.) By EQ D5.1, accuracy \(= N_{\text{maj}}/N = 95/100 = \) 0.95. The constant predictor scores 95% while catching zero positives — which is exactly why accuracy cannot be trusted under imbalance. The damage runs deeper than the scorecard. Most classifiers are trained by minimizing an average loss over examples (cross-entropy, Vol I · EQ M3.3). With 999 majority examples for every minority one, the gradient is dominated by the easy majority: the model can drive total loss down by becoming an excellent detector of the common class and a blind one for the rare class. The decision boundary is pushed into the minority region — the cheapest way to shave the average loss is to misclassify the few. Imbalance is therefore not just an evaluation headache; it is an optimization bias baked into the objective. The instrument below makes this concrete. Dial the minority ratio down and watch accuracy march toward 100% while recall on the rare class collapses — the model has stopped learning the thing you built it for. INSTRUMENT D5.1 — IMBALANCE PLAYGROUND TWO GAUSSIAN CLOUDS · LOGISTIC FIT · EQ D5.1 MINORITY SHARE π 5.0% TRAINING DATA AS-IS OVERSAMPLE UNDERSAMPLE ACCURACY — RECALL (MINORITY) — MAJORITY BASELINE ACC — Mint = minority (the class that matters), blue = majority; the white line is the fitted boundary. Drag π toward 0.5%: accuracy climbs past 99% as the boundary swallows the minority cloud and recall craters — the model is acing the wrong test. Now switch to OVERSAMPLE or UNDERSAMPLE and watch the boundary swing back to bisect the clouds: accuracy dips, recall jumps. Rebalancing trades a meaningless metric for a meaningful one. 5.2 Resampling — random over- and under-sampling The simplest cure operates on the data, before any model sees it: change the class proportions so the loss can no longer ignore the minority. Two opposite moves achieve the same balanced ratio. Random over-sampling (ROS). Duplicate minority examples (sampling with replacement) until the classes match. Keeps all majority information, but the copies are exact — the model can memorize them, inflating training scores and inviting overfitting to the few real minority points. Random under-sampling (RUS). Discard majority examples until the classes match. Fast, light, and a strong baseline — but it throws away potentially useful majority data, which hurts when the majority class is itself varied or the dataset is small. To reach a target minority share \(\rho\) (with \(\rho = 0.5\) meaning a balanced 1:1 set) by over-sampling, the minority class must be grown to match. The arithmetic is worth internalizing because every resampling library is doing exactly this under the hood: EQ D5.2 — RESAMPLING TO A TARGET RATIO $$ N_{\text{min}}^{\text{target}} \;=\; \frac{\rho}{1-\rho}\, N_{\text{maj}}, \qquad \text{(1:1 balance)} \;\;\rho = \tfrac12 \;\Rightarrow\; N_{\text{min}}^{\text{target}} = N_{\text{maj}} $$ \(\rho\) is the desired minority fraction of the resampled set. Over-sampling duplicates the minority up to \(N_{\text{min}}^{\text{target}}\) (total grows); under-sampling instead cuts the majority down to \(\frac{1-\rho}{\rho} N_{\text{min}}\) (total shrinks). Cardinal rule: resample the training fold only. Touching validation or test data — or resampling before the train/test split — leaks information and manufactures fictional scores. The held-out set must keep the real-world prevalence \(\pi\), because that is the distribution your model will actually face. A training fold has 50 minority and 950 majority examples. You over-sample the minority up to 950 (a 1:1 balance). What is the minority's share of the resampled set ? (Give a decimal.) After over-sampling, minority \(= 950\) and majority \(= 950\), so the new total is \(950 + 950 = 1900\). Minority share \(= 950 / 1900 = \) 0.5. Note the original prevalence was \(50/1000 = 0.05\); over-sampling to 1:1 has moved it to exactly one half — by construction (EQ D5.2 with \(\rho = \tfrac12\)). Resampling does not add information. Duplicating a point tells the model nothing it did not already know; it only reweights how loudly that point speaks in the loss — which is mathematically close to the class-weighting of §5.4. The honest framing: resampling and reweighting both move the effective prevalence the optimizer sees, nudging the decision threshold without changing the underlying separability of the classes. That realization is what motivates SMOTE — a way to add genuinely new minority points instead of mere copies. PYTHON · RUNNABLE IN-BROWSER # EQ D5.2: random over- vs under-sampling to a 1:1 balance import numpy as np rng = np.random.default_rng(0) # a 95:5 training fold: 950 majority (label 0), 50 minority (label 1) maj = rng.normal(0.0, 1.0, (950, 2)) mn = rng.normal(2.4, 1.0, (50, 2)) X = np.vstack([maj, mn]); y = np.array([0]*950 + [1]*50) print(f"before: maj={np.sum(y==0):4d} min={np.sum(y==1):4d} " f"min-share={np.mean(y==1):.3f}") # random OVER-sampling: duplicate minority (with replacement) up to majority count idx_min = np.where(y == 1)[0] extra = rng.choice(idx_min, size=950 - 50, replace=True) # 900 duplicates Xo, yo = np.vstack([X, X[extra]]), np.concatenate([y, y[extra]]) print(f"oversampled: maj={np.sum(yo==0):4d} min={np.sum(yo==1):4d} " f"min-share={np.mean(yo==1):.3f} (rho=0.5, EQ D5.2)") # random UNDER-sampling: keep all 50 minority, randomly keep 50 majority keep_maj = rng.choice(np.where(y == 0)[0], size=50, replace=False) Xu = np.vstack([X[keep_maj], X[idx_min]]); yu = np.array([0]*50 + [1]*50) print(f"undersampled: maj={np.sum(yu==0):4d} min={np.sum(yu==1):4d} " f"min-share={np.mean(yu==1):.3f} (kept only 100 of 1000 rows)") RUN ▶ edits are live — try a different rho by changing the target counts 5.3 SMOTE & variants Random over-sampling copies points; SMOTE — Synthetic Minority Over-sampling Technique (Chawla et al., 2002) — invents them. Instead of duplicating a minority example, it draws a brand-new point along the line segment connecting that example to one of its minority near neighbors. The result is a denser, smoother minority region rather than a stack of identical copies, which forces the classifier to carve out broader minority territory instead of memorizing isolated dots. EQ D5.3 — SMOTE INTERPOLATION $$ x_{\text{new}} \;=\; x_i \;+\; \lambda \,\bigl(x_{nn} - x_i\bigr), \qquad \lambda \sim \mathcal{U}(0,1), \qquad x_{nn} \in \text{kNN}_{\text{min}}(x_i) $$ \(x_i\) is a minority example; \(x_{nn}\) is one of its \(k\) nearest minority neighbors (typically \(k = 5\)), chosen at random; \(\lambda\) is a uniform random step along the segment between them. \(\lambda = 0\) returns \(x_i\); \(\lambda = 1\) lands on the neighbor; in between you get a convex blend — a new, plausible minority point. The synthetic point lives inside the convex hull of the minority class, never extrapolating outside it. Caveat: in regions where minority and majority overlap, SMOTE happily interpolates across the gap and plants synthetic points in majority territory — which is exactly what the Borderline and ADASYN variants try to fix. WORKED EXAMPLE ▾ 01 Two 1-D minority points: \(x_i = 2\), and a chosen neighbor \(x_{nn} = 6\). The gap is \(x_{nn} - x_i = 4\). 02 Draw \(\lambda = 0.25\). The synthetic point is \(x_{\text{new}} = 2 + 0.25 \times 4 = 2 + 1 = 3\). 03 Draw \(\lambda = 0.75\) instead: \(x_{\text{new}} = 2 + 0.75 \times 4 = 5\). Every \(\lambda\) yields a different point on the segment \([2, 6]\) — never outside it. 04 In 2-D the same formula runs componentwise: with \(x_i = (2, 1)\), \(x_{nn} = (6, 5)\), \(\lambda = 0.25\) gives \((3, 2)\). The new point sits a quarter of the way along the connecting line. RESULT: λ = 0.25 between 2 and 6 → x_new = 3 SMOTE picks a minority point \(x_i = 2\), a neighbor \(x_{nn} = 6\), and draws \(\lambda = 0.25\). By EQ D5.3, what is the synthetic point \(x_{\text{new}}\)? \(x_{\text{new}} = x_i + \lambda(x_{nn} - x_i) = 2 + 0.25\,(6 - 2) = 2 + 0.25\times 4 = 2 + 1 = \) 3. The new point lies a quarter of the way from \(x_i\) toward its neighbor — inside the segment, never beyond it. Plain SMOTE treats every minority point equally. Its two most-used descendants spend their synthetic budget where it helps most — near the decision boundary, where errors actually happen: Variant Where it synthesizes Intuition SMOTE uniformly across all minority points Densifies the whole minority region; simple, strong default. Borderline-SMOTE only from minority points near the boundary A point is "in danger" if most of its neighbors are majority; reinforce exactly those frontier cases. ADASYN more for minority points that are harder to learn Generate inversely to local density — pour synthetic mass where the minority is most outnumbered. Honest caveats. SMOTE assumes the space between two minority neighbors is itself minority — true for smooth, continuous features, false for categorical ones (use SMOTE-NC) and shaky in high dimensions, where "near neighbor" loses meaning and interpolation can land in nonsense regions. It can amplify noise (a mislabeled minority point spawns a cluster of synthetic noise) and, by design, blurs the boundary in overlapping classes. Modern practice often pairs it with a cleaning step — SMOTE-Tomek or SMOTE-ENN remove the majority points SMOTE's new neighbors now contradict. And on large deep-learning problems, loss-level fixes (§5.4) frequently beat resampling outright. SMOTE is a sharp tool, not a magic wand. INSTRUMENT D5.2 — SMOTE VISUALIZER EQ D5.3 · k-NN INTERPOLATION · SEEDED MINORITY NEIGHBORS k 5 SYNTHETIC POINTS 60 REAL MINORITY — SYNTHETIC (SMOTE) — EFFECTIVE MIN-SHARE — Solid mint dots are the 14 real minority examples; faint blue dots are majority; hollow mint dots are synthetic points, each drawn on a segment between a real minority point and one of its k neighbors (the thin connecting line shows the parent pair). Raise k and the synthetic cloud reaches farther between sub-clusters; raise the count and the minority region fills in. Watch the effective minority share climb toward balance — without a single duplicated point. PYTHON · RUNNABLE IN-BROWSER # SMOTE in pure numpy: interpolate between minority k-NN (EQ D5.3) import numpy as np rng = np.random.default_rng(1) # a 90:10 fold: 90 majority, 10 minority, 2 features maj = rng.normal(0.0, 1.0, (90, 2)) mn = rng.normal(2.6, 0.7, (10, 2)) X, y = np.vstack([maj, mn]), np.array([0]*90 + [1]*10) P = X[y == 1] # minority points only def smote(P, n_new, k=5): out = [] D = np.sqrt(((P[:, None] - P[None]) ** 2).sum(-1)) # pairwise distances for _ in range(n_new): i = rng.integers(len(P)) # a random minority point nn = np.argsort(D[i])[1:k+1] # its k nearest minority neighbors j = nn[rng.integers(len(nn))] # pick one neighbor lam = rng.random() # lambda ~ U(0,1) out.append(P[i] + lam * (P[j] - P[i])) # the interpolated synthetic point return np.array(out) S = smote(P, n_new=80, k=5) before = y.mean() after = (y.sum() + len(S)) / (len(y) + len(S)) print(f"minority before SMOTE: {y.sum():2d} / {len(y)} = {before:.3f}") print(f"synthetic generated: {len(S)}") print(f"minority after SMOTE: {y.sum()+len(S):2d} / {len(y)+len(S)} = {after:.3f}") inside = bool((S.min(0) >= P.min(0)).all() and (S.max(0) <= P.max(0)).all()) print("every synthetic point sits inside the real-minority box:", inside) plot_scatter(np.r_[X[:,0], S[:,0]], np.r_[X[:,1], S[:,1]], np.r_[y, np.full(len(S), 2)]) # 0 maj, 1 real-min, 2 synthetic RUN ▶ edits are live — set k=1 (nearest only) or push n_new to 200 5.4 Algorithm-level fixes — class weights, focal loss, threshold moving Resampling rewrites the data; the alternative is to leave the data alone and rewrite the objective. Three loss- and decision-level levers do this without touching a single row. Class weights (cost-sensitive learning) Scale each example's contribution to the loss by a class-dependent weight, so a minority mistake costs more than a majority one. The standard inverse-frequency weighting gives each class influence proportional to its rarity: EQ D5.4 — WEIGHTED CROSS-ENTROPY $$ \mathcal{L} \;=\; -\frac{1}{N}\sum_{i=1}^{N} w_{y_i}\,\log p_{i,\,y_i}, \qquad w_c \;=\; \frac{N}{C\,N_c} $$ \(w_c\) is the weight for class \(c\), \(N_c\) its count, \(C\) the number of classes; the formula (scikit-learn's class_weight="balanced") makes each class contribute equally to the total loss in expectation. A class \(10\times\) rarer gets \(\sim\!10\times\) the per-example weight. This is the loss-level twin of over-sampling — both inflate the minority's voice in the gradient — but it adds no rows and no duplicates, so it is cheaper and overfits less. It moves the effective decision threshold toward the minority class, trading precision for recall. A binary problem has \(N = 1000\) examples: 950 majority and 50 minority (\(C = 2\) classes). Using balanced weighting \(w_c = N/(C\,N_c)\), what weight does the minority class receive? \(w_{\text{min}} = \dfrac{N}{C\,N_{\text{min}}} = \dfrac{1000}{2 \times 50} = \dfrac{1000}{100} = \) 10. Each minority example counts ten times as heavily in the loss as it would unweighted — and the majority gets \(1000/(2\times 950) \approx 0.53\), so the two classes contribute equally overall. Focal loss Class weights up-weight a whole class; focal loss (Lin et al., 2017, for dense object detection) up-weights the hard examples within it — the ones the model still gets wrong — and lets the easy, already-correct majority examples fade from the gradient automatically: EQ D5.5 — FOCAL LOSS $$ \mathrm{FL}(p_t) \;=\; -\,\alpha_t\,(1 - p_t)^{\gamma}\,\log p_t, \qquad p_t \;=\; \begin{cases} p & y = 1 \\ 1 - p & y = 0 \end{cases} $$ \(p_t\) is the probability assigned to the true class; \(\alpha_t\) is an optional class weight as in EQ D5.4; \(\gamma \ge 0\) is the focusing parameter. The modulating factor \((1-p_t)^{\gamma}\) is the whole idea: for a well-classified example (\(p_t \to 1\)) it \(\to 0\), nearly deleting that example's gradient; for a hard one (\(p_t\) small) it stays near 1. At \(\gamma = 0\) focal loss is exactly cross-entropy; the paper used \(\gamma = 2\). The effect: a flood of easy majority examples no longer drowns out the rare, hard minority ones — imbalance is handled inside the loss, no resampling required. An easy example is classified with \(p_t = 0.9\). Using focal loss with \(\gamma = 2\), what is the modulating factor \((1 - p_t)^{\gamma}\) that scales its loss? \((1 - p_t)^{\gamma} = (1 - 0.9)^2 = (0.1)^2 = \) 0.01. This easy example's contribution to the loss is cut to 1% of its cross-entropy value — so the gradient budget flows to the hard cases instead. A hard example at \(p_t = 0.1\) keeps a factor of \((0.9)^2 = 0.81\), almost untouched. Threshold moving The cheapest fix of all changes nothing about training. A probabilistic classifier outputs \(p = P(y=1 \mid x)\); the default rule "predict positive if \(p > 0.5\)" is a convention, not a law. Under imbalance — or under asymmetric costs, where a missed fraud dwarfs a false alarm — the optimal cut sits elsewhere. Sweep the threshold \(\tau\) and you trace the entire precision/recall trade-off from a single trained model: EQ D5.6 — COST-OPTIMAL THRESHOLD $$ \hat{y} = \mathbb{1}\!\left[\,p > \tau\,\right], \qquad \tau^{\star} \;=\; \frac{C_{\text{FP}}}{C_{\text{FP}} + C_{\text{FN}}} \quad \text{(Bayes-optimal cut for costs } C_{\text{FP}},\, C_{\text{FN}}) $$ Lower \(\tau\) below 0.5 to catch more positives (recall ↑, precision ↓); raise it to flag only the confident ones (precision ↑, recall ↓). The Bayes-optimal \(\tau^{\star}\) depends only on the relative cost of a false positive versus a false negative: if missing a fraud is 9× costlier than a false alarm (\(C_{\text{FN}} = 9, C_{\text{FP}} = 1\)), then \(\tau^{\star} = 1/(1+9) = 0.1\) — flag anything over 10% probability. Threshold moving and proper probability calibration together often recover most of what resampling promised, with none of its risks. INSTRUMENT D5.3 — THRESHOLD & COST EXPLORER EQ D5.6 · 1000 SCORED CASES · 95:5 PREVALENCE THRESHOLD τ 0.50 COST OF A MISS (FN) 10× PRECISION — RECALL — TOTAL COST (FP + c·FN) — The two curves are precision (mint) and recall (blue) as the threshold sweeps; the white line is your current τ. The dashed mint marker is the cost-optimal cut \(\tau^{\star} = 1/(1 + c)\) from EQ D5.6. Slide τ left and recall rises while precision falls; raise the miss-cost c and watch \(\tau^{\star}\) march left — when a miss costs 10× a false alarm, the optimal threshold drops to 0.09. The "TOTAL COST" readout is minimized near that marker, not at 0.5. PYTHON · RUNNABLE IN-BROWSER # Accuracy lies; recall/precision trade off as the threshold moves (99:1) import numpy as np rng = np.random.default_rng(3) n = 10000; n_pos = 100 # 1% prevalence -> 99:1 # simulate calibrated scores: positives skew high, negatives skew low s_pos = np.clip(rng.beta(5, 2, n_pos), 0, 1) # true positives, score-ish high s_neg = np.clip(rng.beta(2, 6, n-n_pos), 0, 1) # true negatives, score-ish low score = np.r_[s_pos, s_neg] y = np.r_[np.ones(n_pos), np.zeros(n-n_pos)].astype(int) def report(tau): yhat = (score > tau).astype(int) tp = int(((yhat==1)&(y==1)).sum()); fp = int(((yhat==1)&(y==0)).sum()) fn = int(((yhat==0)&(y==1)).sum()); tn = int(((yhat==0)&(y==0)).sum()) acc = (tp+tn)/n prec = tp/(tp+fp) if tp+fp else float('nan') rec = tp/(tp+fn) if tp+fn else 0.0 return acc, prec, rec, tp, fp, fn print(" tau acc prec recall TP FP FN") for tau in (0.5, 0.3, 0.1): acc, prec, rec, tp, fp, fn = report(tau) print(f"{tau:.2f} {acc:.4f} {prec:.3f} {rec:.3f} {tp:4d} {fp:4d} {fn:4d}") print("\nalways-predict-negative: acc =", round((n-n_pos)/n, 4), " recall = 0.0 (caught nothing)") print("dropping tau 0.5 -> 0.1 trades precision for the recall you actually need.") RUN ▶ edits are live — add tau=0.05, or change n_pos to make it 99.9:0.1 5.5 Evaluating under imbalance — PR curves, the right metric Every prediction lands in one of four cells of the confusion matrix, and every honest metric is built from them: CONFUSION MATRIX PREDICTED + (alarm) PREDICTED − (clear) ACTUAL + (rare) TP · caught it FN · a miss ACTUAL − (common) FP · false alarm TN · correct all-clear From these, two questions — and they are genuinely different questions: EQ D5.7 — PRECISION, RECALL, F1 $$ \text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \qquad \text{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \qquad F_1 = \frac{2\,\text{P}\,\text{R}}{\text{P}+\text{R}} $$ Precision = of everything you flagged, how much was real (the false-alarm tax). Recall = of everything real, how much you caught (the miss rate's complement). Crucially, neither uses TN — so the giant pile of easy true negatives that inflates accuracy simply cannot rig these numbers. \(F_1\) is their harmonic mean, harsh on any large gap between the two. When false-negative and false-positive costs differ, use the weighted \(F_\beta\) (β > 1 favors recall) instead of \(F_1\). On a 1000-row test set with 10 true positives, a model flags all 10 (\(\mathrm{TP}=10\), \(\mathrm{FN}=0\)) plus 90 negatives by mistake (\(\mathrm{FP}=90\)). What is its precision ? Precision \(= \dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} = \dfrac{10}{10 + 90} = \dfrac{10}{100} = \) 0.1. Recall is a perfect \(10/10 = 1.0\), yet 9 of every 10 alarms are false — the classic rare-event ambush, and exactly the trade-off the threshold of §5.4 controls. Sweeping the threshold turns these point metrics into curves. Two summaries dominate, and the choice between them is the single most important evaluation decision under imbalance: ROC curve (TPR vs. FPR) and its area, ROC-AUC. Because FPR = FP/(FP+TN) has the huge TN count in its denominator, ROC is insensitive to prevalence — which sounds like a virtue but is the opposite here. On a 99:1 problem a model can post a flattering 0.95 ROC-AUC while its precision is dismal, because thousands of false positives barely dent the FPR. Precision–Recall curve and its area, PR-AUC (a.k.a. average precision). Precision does feel every false positive directly, so the PR curve exposes exactly the failure ROC hides. On imbalanced problems, prefer PR-AUC. The base-rate ambush, in numbers. Screen 10,000 people for a condition with 1% prevalence (100 positives). A genuinely good test — 90% recall, 8% false-positive rate — catches 90 of the 100 cases but also flags 8% of 9,900 healthy people = 792 false alarms. Precision is \(90 / (90 + 792) \approx 10.2\%\): nine of every ten alarms are wrong, even though the test is "90% accurate" by recall. No amount of resampling fixes this — it is the prevalence speaking. The defenses are honest metrics (PR-AUC, precision at fixed recall), explicit cost modeling, and a calibrated threshold. SCREENED 10,000 prevalence 1% → 100 actually positive RECALL 90% 90 TP 10 real cases slip through (FN) FP RATE 8% 792 FP 8% of 9,900 healthy people flagged PRECISION 10.2 % 90 / 882 alarms are real Beyond curves, two more metrics earn their place: balanced accuracy (the mean of recall on each class — the right "accuracy" when you must report one number) and Matthews correlation coefficient (MCC), a single value in \([-1, 1]\) that uses all four confusion cells and stays honest across any prevalence. Whatever you choose, the iron rule from §5.2 holds: measure on data at the real prevalence. Resample to train; never resample to evaluate. PITFALLS The four classic imbalance mistakes: (1) reporting accuracy — it grades the base rate, not the model; (2) resampling before the train/test split, leaking synthetic minority points into the test fold and inventing scores; (3) trusting ROC-AUC on a 99:1 problem while precision quietly collapses; (4) shipping the default \(\tau = 0.5\) when your costs are asymmetric — the threshold is a free dial you forgot to turn. NEXT You now know how to prepare and weigh data so a model learns what matters. The Machine Learning volume opens by stepping back to first principles — what it even means to learn from data, the bias–variance decomposition, and why every technique in this volume is ultimately a bet about generalization. Volume I · Chapter 01: Learning from Data. 5.R References Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 — the interpolation method of EQ D5.3. He, H. & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9) — the canonical survey of resampling, cost-sensitive learning, and evaluation. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017 (RetinaNet) — focal loss, EQ D5.5. Han, H., Wang, W.-Y. & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. ICIC 2005 — synthesizing only near the decision boundary (§5.3). He, H., Bai, Y., Garcia, E. A. & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IJCNN 2008 — density-adaptive synthetic generation (§5.3). Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. ICML 2006 — why PR-AUC, not ROC-AUC, is the metric to trust under imbalance (§5.5). Chicco, D. & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 and Accuracy. BMC Genomics 21 — the case for MCC on imbalanced binary problems (§5.5). ← PREVIOUS 04 Feature Engineering NEXT CHAPTER 01 Machine Learning · Learning from Data AI // ENCYCLOPEDIA — DATA & FEATURE ENGINEERING · CH 05 FULL CONTENTS ↗ ======================================================================== MACHINE LEARNING ======================================================================== ## VOL I · 01 · Learning from Data (https://ai-encyclopedia.com/ml/01-learning-from-data.html) 01 · Learning from Data — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 01 / LEARNING FROM DATA INDEX NEXT: LINEAR REGRESSION → VOLUME I — FOUNDATIONS OF ML · CHAPTER 01 / 08 Learning from Data Regression lines, neural networks, and trillion-parameter language models all run on one idea. Instead of writing the rules yourself, you write a score and let the data turn the knobs. This chapter builds that idea from three pieces: a function with two knobs, a number that measures disagreement, and the test that separates learning from memorizing. LEVEL INTRO READING TIME ≈ 18 MIN BUILDS ON NOTHING — START HERE INSTRUMENTS HAND-FIT · TRAIN/TEST IN THIS CHAPTER 1.1 The trick behind all of it 1.2 A function with knobs 1.3 Loss: keeping score 1.4 Generalization 1.5 The loop § Further reading 1.1 The trick behind all of it For seventy years, making a computer do something meant one thing: a person figures out the rules, writes them down precisely, and the machine follows them. This works beautifully when the rules are knowable — payroll, physics simulations, chess-piece movement. It collapses when they are not. Nobody can write down the rules for recognizing a face, transcribing mumbled speech, or deciding whether an email is spam. We do these things effortlessly, and we cannot say how. Machine learning inverts the contract. The human supplies three things: a pile of examples of the job done correctly, a flexible function with adjustable numbers inside it, and a score that measures how badly the function currently does the job. The machine's only task is to adjust the numbers until the score improves. The rules are never written by anyone. They condense out of the data, the way a curve condenses out of scattered points. Classical programming Machine learning Human writes the rules, by hand examples + a score + a flexible function Machine produces answers the rules (as knob settings) Wins when rules are crisp and known rules are unknown, fuzzy, or drift over time Fails by crashing — loudly, traceably being statistically wrong — quietly, sometimes confidently Take spam. The hand-written version — if subject contains "FREE!!!" then spam — was the actual state of the art in the 1990s, and it aged badly: spammers read the rules too. The learned version is handed two million emails that humans already labeled spam or not spam and tunes itself to agree with those labels. When spammers adapt, you don't rewrite code; you feed in fresh examples and tune again. The maintenance burden moves from logic to data — which is the real reason this paradigm conquered the industry. The last table row is not a throwaway. A learned system's failures are statistical: it will be wrong on some inputs, with no stack trace pointing at the offending line, because there is no offending line. Knowing how often it is wrong — and on which inputs — is most of the discipline you are about to learn. 1.2 A model is a function with knobs To make "adjust the numbers until the score improves" precise, we need names for the pieces. The cleanest setting — and the one this whole volume lives in — is supervised learning: each example is a pair \((x, y)\), where \(x\) is the input and \(y\) is the correct answer, the label. Square footage and sale price. Email text and spam-or-not. A photo and the word "cat". Someone, somewhere, supervised: they supplied the right answers. A model (the older literature says hypothesis, hence the letter \(h\)) is a function that takes \(x\) and emits a guess for \(y\). What makes it a learnable function is that its behavior depends on adjustable numbers — its parameters, also called weights. The simplest interesting model on Earth has exactly two: EQ M1.1 — A FUNCTION WITH TWO KNOBS $$ h_{w,b}(x) \;=\; w\,x + b $$ A straight line. \(w\) is the slope — how much the prediction rises per unit of input — and \(b\) is the intercept, the prediction at \(x = 0\). The subscript records the central fact: pick different numbers \(w, b\) and you get a different function. Learning means searching the space of knob settings for the function that fits. Parameters are written collectively as \(\theta\) (theta), so you will see \(h_\theta\) everywhere; here \(\theta = (w, b)\). WORKED EXAMPLE ▾ 01 Set the knobs: \(w = 2\), \(b = 1\). The model is now the concrete function \(h(x) = 2x + 1\). 02 Feed it an input: \(h(3) = 2\cdot 3 + 1 = \) 7. 03 Now turn the knobs to \(w = -1\), \(b = 5\): same formula, different function. \(h(3) = -1\cdot 3 + 5 = \) 2. 04 Same input, two answers — because the pair \((w, b)\) is the model. Learning is choosing between these two (and every other) knob setting. RESULT: h(3) = 7 under θ = (2, 1) · h(3) = 2 under θ = (−1, 5) Set the knobs to \(w = 3\) and \(b = -2\), giving the model \(h(x) = 3x - 2\). What does it predict for the input \(x = 4\)? \(h(4) = 3\cdot 4 + (-2) = 12 - 2 = \) 10. The two knobs and one input fully determine the output — that is all a model does at prediction time. Hold onto the geometry of that sentence: two knobs define a two-dimensional space of candidate lines, and "learning" is a search through that space. Every model in this encyclopedia is the same object scaled up. A frontier language model is a function with roughly \(10^{12}\) knobs instead of two — harder to search, impossible to visualize, but not a different kind of thing. The vocabulary you are acquiring on this page transfers without modification. An honest caveat before we proceed. Not all learning is supervised. Models can learn structure from unlabeled data (unsupervised), from data that labels itself (self-supervised — how language models pre-train, Vol II Ch 04), or from trial-and-error reward (reinforcement learning). Supervised learning is where the vocabulary is cleanest, and the other regimes reuse nearly all of it. 1.3 Loss: keeping score "Fits the data" must become a number, or the machine has nothing to improve. For one example, the natural measure of failure is the residual — the gap between prediction and truth, \(h(x_i) - y_i\). To grade the model on the whole dataset, square each residual and average: EQ M1.2 — MEAN SQUARED ERROR $$ \mathcal{L}(w, b) \;=\; \frac{1}{n} \sum_{i=1}^{n} \big( h_{w,b}(x_i) - y_i \big)^{2} $$ \(n\) examples; the \(\Sigma\) just means "add them all up". Squaring does three jobs at once: it kills the sign (overshoot and undershoot both count), it punishes large misses far more than small ones (a residual of 4 costs 16; two residuals of 2 cost 8), and it leaves a smooth bowl-shaped surface with no kinks — which is what makes the automatic tuning of Chapter 02 possible. The loss is a function of the knobs, not of the data: the data is fixed; \(w\) and \(b\) move; \(\mathcal{L}\) reports disagreement at every setting. WORKED EXAMPLE ▾ 01 Data: three points \((1,3)\), \((2,5)\), \((3,4)\). Knobs: \(w = 1\), \(b = 2\), so \(h(x) = x + 2\). 02 Predict: \(h(1)=3\), \(h(2)=4\), \(h(3)=5\). Residuals (prediction − truth): \(3-3 = 0\), \(4-5 = -1\), \(5-4 = +1\). 03 Square each: \(0^2 = 0\), \((-1)^2 = 1\), \(1^2 = 1\). The signs are gone; both misses cost the same. 04 Average over \(n = 3\): \((0 + 1 + 1)/3 = 2/3 \approx 0.67\). RESULT: MSE = 0.67 SLOPE w 1.00 INTERCEPT b 2.0 h(x) = 1.00x + 2.0 → MSE = 0.67 A model makes three predictions \(\hat y = (5, 8, 6)\) for three points whose true labels are \(y = (4, 6, 9)\). Using EQ M1.2, what is the mean squared error? Residuals (prediction − truth): \(5-4 = 1\), \(8-6 = 2\), \(6-9 = -3\). Square each: \(1, 4, 9\). Sum \(= 14\); average over \(n = 3\): \(14/3 \approx \) 4.667. The single residual of \(-3\) contributes 9 — more than the other two combined, which is exactly the disproportionate punishment squaring is designed to deliver. Now feel it in your hands. Below are 25 measurements from a noisy linear process. Your job is the machine's job: turn the two knobs and drive the disagreement down. The red stalks are the residuals — the exact quantities EQ M1.2 squares and averages. INSTRUMENT M1.1 — HAND-FIT 25 NOISY POINTS · EQ M1.2 LIVE · TARGET: MSE BELOW 4.00 SLOPE w 1.00 INTERCEPT b 0.0 MSE — EQ M1.2 — CHALLENGE: BEAT 4.00 — WORST SINGLE MISS — Drive the MSE below 4.00 — it is possible, but only just: the best achievable on these points is 3.57, at w ≈ 2.19, b ≈ 0.31. Notice the strategy your hands discover: big slope moves first, small intercept corrections after, ever-finer wiggles as you close in. That instinct — large steps far from the answer, small steps near it — is precisely what Chapter 02 turns into an algorithm. And here is the same arithmetic with the curtain pulled back — the identical 25 points, two candidate knob settings, scored in four lines of numpy. The second candidate beats the instrument's target; neither is optimal. PYTHON · RUNNABLE IN-BROWSER import numpy as np # The exact 25 points behind Instrument M1.1 x = np.array([0.117, 4.055, 2.578, 3.680, 2.475, 2.910, 5.537, 9.212, 4.253, 4.267, 3.306, 3.224, 6.843, 9.199, 9.605, 5.486, 7.243, 8.232, 3.124, 8.334, 7.168, 8.049, 5.069, 8.135, 6.093]) y = np.array([3.041, 9.265, 5.475, 8.950, 4.430, 8.188, 14.821, 16.660, 7.241, 12.262, 6.484, 5.253, 17.360, 19.157, 23.652, 10.813, 17.723, 17.977, 5.106, 16.549, 17.812, 18.369, 8.996, 20.632, 14.282]) def mse(w, b): # EQ M1.2, verbatim return np.mean((w * x + b - y) ** 2) candidates = [(1.0, 0.0), (2.0, 1.0)] for w, b in candidates: print(f"h(x) = {w:.2f}x + {b:.2f} -> MSE = {mse(w, b):6.2f}") plot_scatter(x, y) RUN ▶ edit the candidates — can you beat the instrument by hand? Units, briefly. Squaring changes units: if \(y\) is in dollars, MSE is in dollars-squared, which no human can feel. Practitioners report \(\sqrt{\mathrm{MSE}}\) (RMSE) when they want interpretability. And MSE is one loss among many — classification tasks use cross-entropy (Chapter 04), and the freedom to choose the score is a design lever, not a footnote. What never changes: some single number measures disagreement, and learning means pushing it down. 1.4 Generalization: the only thing that matters Here is the trap at the heart of the field. If low loss on the examples were the goal, the perfect model would be a lookup table: store every \((x_i, y_i)\) pair, return \(y_i\) when asked about \(x_i\), achieve a loss of exactly zero. It is also perfectly useless — ask it about any \(x\) it hasn't stored and it has nothing to say. Zero training loss, zero learning. Memorization is not the goal. The goal is performance on data the model has never seen. That property is called generalization, and it is the only thing anyone is ever actually paying for. The defense is almost embarrassingly simple, and it is the single most important habit in machine learning: before doing anything else, split the data. Tune the knobs on one part (the training set) and measure on a part the model never touched (the test set). The held-out score is a rehearsal for the future; the training score is just a record of the past. Formally, the quantity we minimize is a stand-in for the quantity we want: EQ M1.3 — EMPIRICAL RISK STANDS IN FOR TRUE RISK $$ \hat{R}(h) \;=\; \frac{1}{n} \sum_{i=1}^{n} \ell\big(h(x_i),\, y_i\big) \qquad\text{approximates}\qquad R(h) \;=\; \mathbb{E}_{(x,y)\sim\mathcal{D}}\Big[\, \ell\big(h(x),\, y\big) \Big] $$ \(\ell\) is any per-example loss (squared error, here). \(\mathcal{D}\) is the unseen process that generates the data — houses being sold, emails being sent — and \(\mathbb{E}\) means "the average over everything that process will ever produce". We can never compute \(R\), so we minimize \(\hat{R}\) on a sample and hope the sample speaks for the population. Training loss measures fit. Test loss estimates risk. Only the second predicts the future. The gap between them is overfitting, made visible. WORKED EXAMPLE ▾ 01 Let \(\ell\) be squared error. A model scores four training examples with per-example losses 1.0, 0.5, 0.3, 0.2. 02 Empirical risk: \(\hat{R} = (1.0 + 0.5 + 0.3 + 0.2)/4 = 2.0/4 = \) 0.50. 03 True risk \(R\) averages over everything \(\mathcal{D}\) will ever produce — uncomputable. The four-sample average is our stand-in. 04 A held-out sample of four (never trained on) gives losses 1.2, 0.9, 1.1, 0.8 → test estimate \(4.0/4 = \) 1.00. The gap \(1.00 - 0.50 = 0.50\) is the overfitting EQ M1.3 warned about. RESULT: train R̂ = 0.50 · test estimate of R = 1.00 · gap = 0.50 A model scores five held-out examples with per-example losses \(0.4,\ 1.2,\ 0.6,\ 0.8,\ 1.0\). What is the empirical risk \(\hat R\) (EQ M1.3) on this sample? \(\hat R = \tfrac{1}{5}(0.4 + 1.2 + 0.6 + 0.8 + 1.0) = 4.0/5 = \) 0.8. Empirical risk is nothing more exotic than the average loss over the sample — our computable stand-in for the uncomputable true risk \(R\). A model reaches training loss \(0.30\) but its held-out test loss is \(0.95\). How large is the generalization gap (test − train)? Gap \(= 0.95 - 0.30 = \) 0.65. A model that fits the training data far better than the test data is overfitting; the gap is that failure made into a number. To see the gap open wide, give a model too much flexibility. The instrument below fits the same 25 points two ways: a straight line (two knobs), and a degree-9 polynomial (ten knobs — enough to snake through nearly every training point individually). Both are fitted to the same 18 training points; 7 points are held out. Watch what each extra knob buys, and what it costs. INSTRUMENT M1.2 — TRAIN/TEST SPLIT SAME DATA · 18 TRAIN / 7 HELD OUT · EQ M1.3 MODEL CAPACITY DEGREE 1 — STRAIGHT LINE DEGREE 9 — MEMORIZER ● TRAIN (18) ● TEST (7) — HELD OUT, NEVER FITTED TRAIN MSE (18 PTS) — TEST MSE (7 PTS) — TEST / TRAIN GAP — Flip to DEGREE 9. Train MSE collapses from 3.13 to 0.87 — by the training score, the wiggly curve is the better model, and it always will be: more knobs can never fit the training data worse. But test MSE detonates from 5.0 to 1,373, because between the memorized points the polynomial swings wildly through territory no data constrains. The degree-9 coefficients are precomputed (an exact least-squares fit to the 18 training points); both MSE readouts are computed live from them. Run the same experiment yourself — a 90/10 split this time, fits via np.polyfit. With only 3 points held out the verdict is noisier than the instrument's (small test sets are unreliable juries — that is a real lesson, not an apology), but it points the same way: PYTHON · RUNNABLE IN-BROWSER import numpy as np x = np.array([0.117, 4.055, 2.578, 3.680, 2.475, 2.910, 5.537, 9.212, 4.253, 4.267, 3.306, 3.224, 6.843, 9.199, 9.605, 5.486, 7.243, 8.232, 3.124, 8.334, 7.168, 8.049, 5.069, 8.135, 6.093]) y = np.array([3.041, 9.265, 5.475, 8.950, 4.430, 8.188, 14.821, 16.660, 7.241, 12.262, 6.484, 5.253, 17.360, 19.157, 23.652, 10.813, 17.723, 17.977, 5.106, 16.549, 17.812, 18.369, 8.996, 20.632, 14.282]) perm = np.random.default_rng(0).permutation(len(x)) train, test = perm[:22], perm[22:] # 90 / 10 split z = x / 10 # rescale so degree 9 stays well-conditioned for deg in (1, 9): c = np.polyfit(z[train], y[train], deg) mse_tr = np.mean((np.polyval(c, z[train]) - y[train]) ** 2) mse_te = np.mean((np.polyval(c, z[test]) - y[test]) ** 2) print(f"degree {deg}: train MSE = {mse_tr:5.2f} test MSE = {mse_te:5.2f}") held_out = np.isin(np.arange(len(x)), test).astype(int) plot_scatter(x, y, held_out) # blue = the 3 points the fit never saw RUN ▶ change the rng seed — watch the 3-point test verdict wobble FINE PRINT The split certifies less than it seems to. (1) It assumes test data is drawn from the same process \(\mathcal{D}\) as training data — but the world drifts, and a model certified on last year's emails meets next year's spammers. This failure mode, distribution shift, is endemic in deployment. (2) The certificate expires with use: every time you peek at the test score and adjust your model in response, information leaks, and the test set quietly becomes training signal. Serious practice holds out a final untouched set and looks at it once. (3) For language models this discipline has a sharper name — contamination — because when your training set is the internet, your test set is usually in it somewhere (Vol II, Ch 04). 1.5 The loop you will see thirty more times Assemble the pieces and you get the universal cadence of machine learning — the loop every chapter in this encyclopedia will replay at larger scale: FIG M1.1 PREDICT → MEASURE → ADJUST TRAINING DATA (x, y) pairs MODEL ŷ = wx + b LOSS mean (ŷ − y)² ADJUST nudge w, b x predict measure y — the right answers, for comparison new knob settings — repeat until the score stops falling The loop. Predict with the current knobs, measure disagreement against the labels, adjust the knobs, repeat. Everything else in machine learning is a refinement of one of these four boxes. In Instrument M1.1, you were the ADJUST box — eyes on the residuals, hands on the sliders. That works for two knobs. It does not work for ten, and it is unthinkable for \(10^{12}\). The entire next chapter is about firing you from the job: calculus can read the slope of the loss surface and announce, for every knob simultaneously, which direction reduces disagreement. That announcement is called the gradient, and following it is called gradient descent — the algorithm that trains essentially everything, from the straight line above to the largest models ever built. What will change as this volume proceeds: the model grows from a line to a network of millions of units; the loss changes shape for new tasks; the data swells from 25 points to trillions of tokens; ADJUST acquires momentum, schedules, and tricks. What will never change: predict, measure, adjust. When the architecture of Volume II towers over you, find the four boxes. They are always there. NEXT You fit the line by feel; the machine fits it by calculus. Chapter 02: the loss surface as a landscape, the gradient as a compass pointing downhill, the learning rate as stride length — and why the exact solution to linear regression exists yet almost nobody uses it. § Further reading Mitchell, T. (1997). Machine Learning. — the cleanest formal statement of "learning = improving at a task from experience," and the source of the task/experience/performance framing. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). — the canonical reference for supervised learning, loss functions, and the train/test split. Domingos, P. (2012). A Few Useful Things to Know About Machine Learning. — distils the field's hard-won folk wisdom: generalization, overfitting, and "data beats a cleverer algorithm." Vapnik, V. (1995). The Nature of Statistical Learning Theory. — the formal account of why minimizing training error is not the same as learning, and what closes the gap. Wolpert, D. (1996). The Lack of A Priori Distinctions Between Learning Algorithms. — the "no free lunch" result: no learner is best across all problems, so assumptions are unavoidable. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning, Ch. 5. — a modern, self-contained primer on the learning-algorithm anatomy: model, loss, optimizer, generalization. ← PREVIOUS ∎ Encyclopedia Index NEXT CHAPTER 02 Linear Regression & Gradient Descent AI // ENCYCLOPEDIA — VOL I · CH 01 FULL CONTENTS ↗ ## VOL I · 02 · Linear Regression & Gradient Descent (https://ai-encyclopedia.com/ml/02-linear-regression.html) 02 · Linear Regression & Gradient Descent — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 02 / LINEAR REGRESSION INDEX NEXT: CLASSIFICATION → VOLUME I — FOUNDATIONS OF ML · CHAPTER 02 / 08 Linear Regression & Gradient Descent One model, a weighted sum. One loss, squared error. One algorithm, step downhill. Linear regression is the smallest setting in which the full machinery of modern machine learning runs end to end. The training loop you learn here is, line for line, the loop that trains GPT; every later model swaps in a richer function and a different error. LEVEL INTRO READING TIME ≈ 24 MIN BUILDS ON CH 01 INSTRUMENTS DESCENT STEPPER · LR SWEEP IN THIS CHAPTER 2.1 The model & the loss surface 2.2 The closed form 2.3 Gradient descent 2.4 The learning rate 2.5 Features & scaling § Further reading 2.1 The model and the loss surface A linear model predicts by taking a weighted sum of the input features: each feature gets one learned number saying how much it matters and in which direction. Stack all \(n\) training examples as rows of a matrix \(X\), and the whole dataset's predictions become a single matrix–vector product. One bookkeeping trick makes the intercept disappear as a special case: append a constant feature of 1 to every example, and the bias \(b\) becomes just another weight. EQ M2.1 — THE MODEL, VECTORIZED $$ \hat{y} \;=\; X w, \qquad X \in \mathbb{R}^{n \times (d+1)}, \quad w \in \mathbb{R}^{d+1} $$ Row \(i\) of \(X\) is example \(i\)'s features plus the appended 1; \(\hat{y}_i = x_i^\top w\) is its prediction. The entire model is \(d+1\) numbers. Everything this volume builds — and everything Volume II builds — replaces this product with a richer function \(h(x; w)\), but the surrounding machinery never changes. How wrong is a given \(w\)? Chapter 01's answer was a loss function; here the natural one is mean squared error — average the squared miss over the dataset. Squaring does three jobs at once: misses in both directions count, large misses count disproportionately, and the result is smooth everywhere, so it has a well-defined slope we can follow. EQ M2.2 — MSE: THE LOSS IS A BOWL $$ \mathcal{L}(w) \;=\; \frac{1}{n}\,\lVert Xw - y \rVert^2 \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( x_i^\top w - y_i \right)^2 $$ Because \(\hat{y}\) is linear in \(w\) and the error is squared, \(\mathcal{L}\) is a quadratic bowl in weight space: slice it at any height and you get nested ellipses. The bowl may be stretched, squashed, or tilted — that shape decides everything in §2.4 and §2.5 — but it has no ripples and no false valleys. A linear model produces predictions \(Xw = (2, 5, 7)\) for three examples whose targets are \(y = (3, 3, 8)\). What is the MSE \(\tfrac{1}{n}\lVert Xw - y\rVert^2\) (EQ M2.2)? Residuals \(Xw - y = (2-3,\ 5-3,\ 7-8) = (-1,\ 2,\ -1)\). Squared: \(1, 4, 1\). Sum \(= 6\); divide by \(n = 3\): \(6/3 = \) 2. Convexity, in words. A loss surface is convex when the straight line between any two points on it never dips below the surface — no hidden dimples for an optimizer to fall into. The practical guarantee is the one that matters: every downhill path leads to the same lowest point. Wherever you start and however clumsily you descend, if you keep going down, you arrive. Linear regression with MSE is convex; deep networks are emphatically not — which makes this chapter the one place you can watch optimization work with the guarantee switched on, before later chapters take it away. One honest wrinkle: if two features are exact copies of each other (or one is a linear combination of others), the bowl's floor becomes a flat trench — infinitely many \(w\) achieve the same minimum loss. Convexity still holds; uniqueness doesn't. Real pipelines hit this with duplicated columns and one-hot encodings more often than you'd think. 2.2 The closed form — and why we don't use it at scale A smooth bowl has exactly one flat point: the bottom. Setting the gradient of EQ M2.2 to zero and solving gives the minimizer outright — no iteration, no hyperparameters, no luck. This is the normal equation, and linear regression is nearly alone among learning algorithms in having one: EQ M2.3 — THE NORMAL EQUATION $$ \nabla \mathcal{L}(w^\star) = 0 \quad\Longrightarrow\quad w^\star \;=\; \left( X^\top X \right)^{-1} X^\top y $$ \(X^\top X\) is the \((d{+}1)\times(d{+}1)\) matrix of feature co-occurrences; \(X^\top y\) measures how each feature co-varies with the target. In one matrix solve, the exact bottom of the bowl. In practice you never form the inverse — np.linalg.solve or QR-based lstsq do the same job with far better numerical behavior. So why does the rest of machine learning iterate instead? Three reasons, in increasing order of importance: Concern Normal equation Gradient descent Compute O(nd² + d³) — the d³ solve is fatal once d hits 10⁵+ O(nd) per pass; scales to billions of parameters Memory must hold the d×d matrix XᵀX only w and one gradient Numerics conditioning of XᵀX is the square of X's — correlated features amplify rounding error tolerant; error shrinks geometrically Data access needs all data at once works on streams and mini-batches Generality exists only because h is linear and the loss quadratic works for any differentiable h — including a transformer The last row is the real verdict. The normal equation is a one-off gift of linear algebra: change the model to anything nonlinear — a logistic curve, a two-layer network — and the closed form vanishes forever. Gradient descent asks only that the loss have a slope. We learn the closed form not to use it, but to keep it as an oracle: in §2.5 we'll let it grade gradient descent's answer. 2.3 Gradient descent: feel the slope, step downhill Imagine standing on the bowl blindfolded. You can't see the bottom, but at your feet you can feel which way is steepest. The gradient \(\nabla \mathcal{L}\) is that feeling, made precise: the vector of partial derivatives, pointing in the direction of steepest increase. So walk the other way. For MSE the gradient comes out of the chain rule in one line, and it has a shape worth memorizing: EQ M2.4 — THE GRADIENT OF MSE $$ \nabla_w \mathcal{L} \;=\; \frac{2}{n}\, X^\top \left( Xw - y \right) $$ Read it from the inside out: \(Xw - y\) is the vector of residuals — current prediction minus truth, one per example. \(X^\top(\cdot)\) then credits each feature with the residuals of the examples where it was active. The gradient is the data's errors, projected back onto the features that caused them. That sentence, generalized through many layers by backpropagation, is all of deep learning's credit assignment. WORKED EXAMPLE ▾ 01 Two examples with the bias trick: rows \(x_1 = (1, 1)\), \(x_2 = (2, 1)\); targets \(y = (2, 3)\). Current weights \(w = (1, 0)\): predictions \(Xw = (1, 2)\). 02 Residuals \(Xw - y = (1-2,\; 2-3) = (-1, -1)\) — both predictions are too low. 03 Project onto features, \(X^\top r\): feature column \((1, 2)\cdot(-1, -1) = -3\); bias column \((1, 1)\cdot(-1, -1) = -2\). 04 Scale by \(2/n = 2/2 = 1\): \(\nabla \mathcal{L} = (-3, -2)\). Both components negative → both weights should rise. EQ M2.5 does exactly that. RESULT: ∇𝓛 = (−3, −2) Two examples with the bias trick: rows \(x_1 = (1, 1)\), \(x_2 = (3, 1)\); targets \(y = (4, 6)\); current weights \(w = (1, 1)\). Using EQ M2.4, what is the first component of the gradient \(\nabla_w\mathcal{L}\) (the feature weight)? Predictions \(Xw = (1\cdot1 + 1\cdot1,\ 1\cdot3 + 1\cdot1) = (2, 4)\). Residuals \(Xw - y = (2-4,\ 4-6) = (-2, -2)\). Project onto the feature column \((1, 3)\): \(1\cdot(-2) + 3\cdot(-2) = -8\). Scale by \(2/n = 2/2 = 1\): the first gradient component is −8. Negative, so the update rule will push this weight up. EQ M2.5 — THE UPDATE RULE $$ w \;\leftarrow\; w \;-\; \eta\, \nabla_w \mathcal{L}(w) $$ \(\eta\) (eta) is the learning rate: how far to step in the downhill direction. This single line — compute the slope, take a step — is the update inside every optimizer in this encyclopedia. Adam, momentum, and friends decorate it; none replace it. WORKED EXAMPLE ▾ 01 Continue from EQ M2.4's example: \(w = (1, 0)\), \(\nabla \mathcal{L} = (-3, -2)\), current MSE \(= ((-1)^2 + (-1)^2)/2 = 1.00\). Pick \(\eta = 0.1\). 02 Step: \(w' = (1, 0) - 0.1\,(-3, -2) = (1 + 0.3,\; 0 + 0.2) = (1.3,\; 0.2)\). Subtracting a negative gradient pushes the weights up. 03 Re-score: predictions \(1.3 + 0.2 = 1.5\) and \(2.6 + 0.2 = 2.8\); residuals \(-0.5\) and \(-0.2\). 04 New MSE: \((0.25 + 0.04)/2 = 0.145\). One step cut the loss by 85%. Drag \(\eta\) below — past ≈ 0.29 the same step makes things worse. RESULT: w′ = (1.3, 0.2) · MSE 1.000 → 0.145 LEARNING RATE η 0.10 w′ = (1.30, 0.20) · MSE 1.000 → 0.145 ↓ Take \(w = (1, 0)\), gradient \(\nabla\mathcal{L} = (-3, -2)\), and learning rate \(\eta = 0.2\). After one gradient-descent step, what is the new first weight \(w_1'\)? \(w_1' = w_1 - \eta\,\nabla_1 = 1 - 0.2\cdot(-3) = 1 + 0.6 = \) 1.6. Subtracting a negative gradient pushes the weight up — exactly what a too-low prediction should do. That's the whole algorithm: predict, measure residuals, push the error back through the features, step, repeat. Watch it run on a real (toy) regression below — the left panel shows the bowl from above as loss contours in \((w, b)\) space; the right panel shows what each position means: a candidate line through the data. INSTRUMENT M2.1 — DESCENT STEPPER EQ M2.4 + M2.5 · LIVE ON 60 POINTS LEARNING RATE η 0.120 CONTROL STEP AUTO ▶ RESET STEP 0 MSE — η STATUS — η CRITICAL (THIS DATA) — STEP applies EQ M2.5 once; AUTO loops it. At the default η the path makes an L: it drops fast down the steep wall, then crawls along the valley floor. Raise η toward the critical value and the path starts ricocheting across the valley; push past it and every step lands higher than the last — divergence, live. The critical value isn't a constant of nature: it is computed from this dataset's curvature (§2.4). The same loop in real code — numpy, no libraries, nothing hidden. Run it, then break it: try eta = 0.9, or delete the 2.0 / n and watch the effective step size change. PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(0) n = 80 x = rng.uniform(-2, 2, n) y = 1.7 * x - 0.4 + rng.normal(0, 0.5, n) # truth: slope 1.7, intercept -0.4 X = np.column_stack([x, np.ones(n)]) # bias trick: w = [slope, intercept] w = np.zeros(2) eta = 0.1 steps, losses = [], [] for t in range(201): r = X @ w - y # residuals mse = (r @ r) / n steps.append(t); losses.append(mse) if t % 20 == 0: print(f"step {t:3d} mse {mse:7.4f} w = {np.round(w, 3)}") grad = 2.0 / n * (X.T @ r) # EQ M2.4 w = w - eta * grad # EQ M2.5 plot_xy(steps, losses) RUN ▶ edits are live — break it on purpose The loss curve you just plotted has the signature shape of healthy gradient descent on a convex problem: steep early progress, then a long geometric glide toward the noise floor — it never reaches zero, because the data contains genuine noise no line can explain. A training curve that does hit zero should make you suspicious (Chapter 01: memorization). 2.4 The learning rate: the most important hyperparameter Everything about gradient descent's behavior is decided by one number. Too small, and you inch downhill for geological time. Too large, and each step overshoots the bottom, lands on the far wall higher than it started, and the loss explodes exponentially. The boundary between those fates is sharp, and on a quadratic bowl you can compute it exactly. Take the one-dimensional case: a parabola with curvature \(\lambda\) (its second derivative). One algebra step shows each update multiplies the remaining error by a fixed factor: EQ M2.6 — THE CONVERGENCE BAND $$ w_{t+1} - w^\star \;=\; \left( 1 - \eta\lambda \right) \left( w_t - w^\star \right) \qquad\Longrightarrow\qquad \text{stable} \iff 0 < \eta < \frac{2}{\lambda} $$ If \(|1-\eta\lambda| < 1\) the error shrinks every step; the moment \(\eta\) exceeds \(2/\lambda\), the factor passes \(-1\) and the error flips sign and grows — the overshooting spiral you saw in Instrument M2.1. Between \(1/\lambda\) and \(2/\lambda\) the factor is negative with magnitude below 1: the iterate converges while hopping from side to side of the minimum. With many dimensions, each direction of the bowl has its own curvature, and η must respect the steepest one — while progress is paced by the shallowest. That tension is the whole story of §2.5. WORKED EXAMPLE ▾ 01 One-dimensional bowl \(\mathcal{L}(w) = (w - 3)^2\): curvature \(\lambda = 2\), minimum \(w^\star = 3\), critical rate \(2/\lambda = 1\). Start at \(w_0 = 0\), so the error is \(w_0 - w^\star = -3\). 02 \(\eta = 0.4\): factor \(1 - \eta\lambda = 1 - 0.8 = 0.2\). Error per step: \(-3 \to -0.6 \to -0.12\) — smooth geometric convergence. 03 \(\eta = 0.9\): factor \(1 - 1.8 = -0.8\). Error: \(-3 \to +2.4 \to -1.92\) — converging while hopping sides of the minimum. 04 \(\eta = 1.2\): factor \(-1.4\). Error: \(-3 \to +4.2 \to -5.88\) — every step lands farther away. Divergence. RESULT: stable here iff η < 1 (= 2/λ) RATE η 0.40 CURVATURE λ 2.0 1 − ηλ = 0.20 · error ×0.20 per step · CONVERGES A one-dimensional loss bowl has curvature \(\lambda = 5\). By EQ M2.6, what is the critical learning rate — the largest \(\eta\) for which gradient descent still converges? Stability requires \(\eta < 2/\lambda\). The boundary is \(2/\lambda = 2/5 = \) 0.4. Any \(\eta\) above this multiplies the error by a factor below \(-1\) every step, and the loss explodes. With learning rate \(\eta = 0.3\) and curvature \(\lambda = 2\), what is the convergence factor \(1 - \eta\lambda\) by which each step shrinks the remaining error? \(1 - \eta\lambda = 1 - 0.3\cdot 2 = 1 - 0.6 = \) 0.4. Its magnitude is below 1, so the error shrinks 60% per step — smooth, monotone convergence. Four learning rates, one problem, eighty steps each — the canonical picture every practitioner carries in their head: INSTRUMENT M2.2 — LR SWEEP SAME DATA AS M2.1 · 80 GD STEPS PER η · LOG SCALE Each curve is gradient descent actually run in your browser on the M2.1 dataset, from the same starting point. η = 0.01 descends — but after 80 steps it still sits an order of magnitude above the noise floor. η = 0.1 glides down and settles on it. η = 0.5 lives in EQ M2.6's zigzag band: on this perfectly quadratic bowl its loss still falls every step (fast, even), but its parameter path in M2.1 ricochets wall to wall — and on real, non-convex surfaces that ricochet turns into loss spikes. η = 1.1 is past critical: every step multiplies the error, and the curve exits the chart within ten steps. Beyond the bowl. On a deep network's non-convex surface no single \(\lambda\) exists — curvature varies wildly across the landscape and across training time. That's why real recipes use learning-rate schedules (a gentle warmup so early chaotic gradients don't launch the weights, then a long decay) and per-coordinate adaptive optimizers like Adam, which effectively give every weight its own η. The full machinery appears with pre-training at scale in Vol II · Chapter 04. But every schedule and every optimizer is still negotiating with EQ M2.6's constraint — they never escape it. 2.5 Features and preprocessing: why scale decides convergence Here is the trap every beginner falls into once. Suppose one feature is "number of rooms" (range 1–10) and another is "square footage" (range 500–5,000). The loss bowl in those two weight directions has wildly different curvatures — the square-footage direction is roughly \((500)^2\) times steeper, because curvature scales with the square of the feature's magnitude. EQ M2.6 says η must stay below \(2/\lambda\) for the steepest direction; at that η, the shallow direction's error shrinks by a factor so close to 1 that convergence takes millions of steps. One η, shared by all weights, can only be as brave as the steepest direction allows. Geometrically: the bowl is a canyon, and gradient descent ping-pongs between its walls while drifting imperceptibly along its floor. The fix is almost embarrassingly simple — standardization: shift each feature to zero mean and rescale to unit variance, \(x' = (x - \mu)/\sigma\). All directions of the bowl now have comparable curvature, the contours become near-circles, and a single η serves every weight. For gradient descent this is not a nicety; it is frequently the difference between converging in a hundred steps and not converging at all. (Its descendants — BatchNorm, LayerNorm — apply the same idea inside deep networks, and Volume II leans on them constantly.) The leakage rule, again. Compute \(\mu\) and \(\sigma\) on the training set only, then apply those frozen values to validation and test data. Estimating them on the full dataset leaks test-set statistics into training — Chapter 01's cardinal sin, in its most common disguise. Proof, in code: the normal equation and gradient descent — run on standardized features, then un-standardized back — agree to four decimals. The oracle approves of the iterator. Then sabotage it: set eta = 0.1 on the raw features instead of the standardized ones, and watch the explosion EQ M2.6 predicts (the raw feature's curvature puts critical η near 0.03). PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(1) n = 60 x = rng.uniform(0, 10, n) # raw, unscaled feature y = 3.0 * x + 7.0 + rng.normal(0, 2.0, n) # truth: slope 3, intercept 7 X = np.column_stack([x, np.ones(n)]) # --- oracle: normal equation (EQ M2.3), via solve, never the inverse --- w_exact = np.linalg.solve(X.T @ X, X.T @ y) # --- iterator: GD on standardized x, then map weights back --- mu, sd = x.mean(), x.std() Xs = np.column_stack([(x - mu) / sd, np.ones(n)]) w = np.zeros(2) for t in range(500): grad = 2.0 / n * (Xs.T @ (Xs @ w - y)) w = w - 0.1 * grad w_gd = np.array([w[0] / sd, w[1] - w[0] * mu / sd]) # un-standardize print("normal equation:", np.round(w_exact, 4)) print("gradient descent:", np.round(w_gd, 4)) print("max difference:", float(np.abs(w_exact - w_gd).max())) # now try GD with eta=0.1 directly on raw X — it diverges. that gap is this section. RUN ▶ edits are live — break it on purpose NEXT You now own the loop: predict with \(h\), measure the loss, follow the gradient, repeat. Every model for the rest of this encyclopedia — logistic regression next chapter, neural networks after that, GPT in Volume II — is exactly this loop with a fancier \(h\) and a loss to match. Chapter 03 makes the first swap: when the target is a category instead of a number, the line becomes a sigmoid, squared error becomes cross-entropy, and the gradient — remarkably — keeps the same residual-times-features shape you memorized in EQ M2.4. § Further reading Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. — the first published statement of the method of least squares, the loss this whole chapter minimizes. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. — ties least squares to the normal distribution and the maximum-likelihood justification for squared-error loss. Cauchy, A.-L. (1847). Méthode générale pour la résolution des systèmes d'équations simultanées. — the origin of gradient descent: follow the slope downhill to a minimum. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, Ch. 3. — the modern reference for the normal equations and why the closed form is rarely used at scale. Boyd, S. & Vandenberghe, L. (2004). Convex Optimization. — rigorous treatment of why a convex loss surface has one minimum and how step size governs convergence. Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 3. — connects linear models, basis functions, and feature scaling to the probabilistic view. ← PREVIOUS 01 Learning from Data NEXT CHAPTER 03 Classification: Logistic & Softmax AI // ENCYCLOPEDIA — VOL I · CH 02 FULL CONTENTS ↗ ## VOL I · 03 · Classification: Logistic & Softmax (https://ai-encyclopedia.com/ml/03-classification.html) 03 · Classification: Logistic & Softmax — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 03 / CLASSIFICATION INDEX NEXT: TREES & NEIGHBORS → VOLUME I — FOUNDATIONS OF ML · CHAPTER 03 / 08 Classification: Logistic & Softmax Chapter 02 predicted numbers. Most tasks instead ask for a choice: spam or not, benign or malignant, which of 100,000 tokens comes next. The bridge from lines to choices is probability. Pass a linear score through a sigmoid and you have logistic regression, whose loss, cross-entropy, is the exact loss that trains GPT. LEVEL INTRO READING TIME ≈ 18 MIN BUILDS ON CH 01–02 INSTRUMENTS BOUNDARY EXPLORER · SIGMOID TEMPERATURE IN THIS CHAPTER 3.1 Why a line is not enough 3.2 Sigmoid & logistic regression 3.3 Cross-entropy loss 3.4 Decision boundaries 3.5 Many classes: softmax 3.6 Metrics beyond accuracy § Further reading 3.1 Why a line is not enough The obvious move is to recycle Chapter 02: code the two classes as \(y = 0\) and \(y = 1\), fit a straight line by least squares, and call anything above 0.5 a positive. This is called the linear probability model, and it fails in three instructive ways. First, the outputs aren't probabilities. A line is unbounded: feed it an extreme input and it cheerfully predicts 1.4, or −0.3. There is no reading of "140% probability of spam" that survives contact with arithmetic — downstream decisions (expected costs, thresholds, calibration) all need outputs that live in \((0, 1)\) and behave like degrees of belief. Second, squared error punishes being right. Take an email that is so obviously spam the line scores it 1.8. It is correctly classified by any threshold — yet squared error charges \((1.8 - 1)^2\) for it and drags the line back toward the pack, moving the boundary toward the mistakes. A loss for classification should reward confident correctness, not fine it. Third, the geometry is brittle. Add a few far-away but trivially easy points to one class and the least-squares line pivots to appease them, misclassifying points near the frontier — where classification is actually decided. The fix is not to abandon the linear score \(z = \mathbf{w}^{\top}\mathbf{x} + b\); it is too useful. The fix is to stop treating \(z\) as the answer and start treating it as evidence — a quantity on an unbounded scale that we convert into a probability. 3.2 The sigmoid & logistic regression The converter is the sigmoid (logistic) function — an S-shaped squash that maps any real score to a probability: EQ M3.1 — THE SIGMOID $$ \sigma(z) \;=\; \frac{1}{1 + e^{-z}}, \qquad \sigma: \mathbb{R} \to (0,1), \qquad \sigma(-z) \;=\; 1 - \sigma(z) $$ Strong positive evidence \(\to\) probability near 1; strong negative \(\to\) near 0; zero evidence \(\to\) exactly ½. The symmetry means "evidence for class 1" and "evidence against class 0" are the same number with the sign flipped. Its derivative is \(\sigma'(z) = \sigma(z)\,(1 - \sigma(z))\) — largest at the midpoint (¼), vanishing in the tails. The sigmoid is the exchange rate between evidence and probability. WORKED EXAMPLE ▾ 01 Take evidence \(z = 2\). First the exponential: \(e^{-2} \approx 0.135\). 02 Then \(\sigma(2) = 1/(1 + 0.135) = 1/1.135 \approx \) 0.881 — strong belief, not certainty. 03 Symmetry check: \(\sigma(-2) = 1 - 0.881 = 0.119\). And zero evidence: \(\sigma(0) = 1/(1+1) = 0.5\) exactly. 04 Slope at \(z = 2\): \(\sigma'(2) = 0.881 \times 0.119 \approx 0.105\) — well below the midpoint maximum of 0.25. The tails flatten fast, which is where gradients go to die. RESULT: σ(2) ≈ 0.881 EVIDENCE z 2.0 σ(2.0) = 0.881 · odds eᶻ = 7.39: 1 A logistic model emits the evidence (logit) \(z = 1\). What probability does the sigmoid assign, \(\sigma(1)\)? (Use \(e^{-1} \approx 0.368\).) \(\sigma(1) = \dfrac{1}{1 + e^{-1}} = \dfrac{1}{1 + 0.368} = \dfrac{1}{1.368} \approx \) 0.731. One unit of positive evidence buys about 73% belief — confident, but a long way from certain. Bolting the sigmoid onto the linear score gives logistic regression — still a linear model, but linear in the right place: EQ M3.2 — LOGISTIC REGRESSION $$ p(y = 1 \mid \mathbf{x}) \;=\; \sigma\!\left(\mathbf{w}^{\top}\mathbf{x} + b\right) \qquad \Longleftrightarrow \qquad \log \frac{p}{1 - p} \;=\; \mathbf{w}^{\top}\mathbf{x} + b $$ Read it right-to-left: the model is linear in the log-odds. Each unit increase in feature \(x_j\) multiplies the odds \(p/(1-p)\) by \(e^{w_j}\) — which is why logistic regression is still the lingua franca of medicine and credit scoring: every weight is a legible odds multiplier. The pre-sigmoid score \(z\) is called a logit — the same word, and the same object, as the raw scores an LLM emits before its softmax (Vol II · EQ 1.2). Unlike least squares, there is no closed-form solution — logistic regression is trained by gradient descent (Chapter 02) on the loss of the next section. The consolation prize is substantial: that loss is convex for this model, so gradient descent finds the global optimum. It is the last model in this volume for which that is true. 3.3 Cross-entropy: the loss that trains GPT What should the model pay when it predicts probability \(p\) and the truth is \(y\)? The principled answer comes from maximum likelihood: choose the weights that make the observed labels most probable. Taking the negative log (sums beat products; minimizing beats maximizing) yields cross-entropy, also called log loss: EQ M3.3 — BINARY CROSS-ENTROPY $$ \mathcal{L}(\mathbf{w}, b) \;=\; -\frac{1}{N} \sum_{i=1}^{N} \Big[\, y_i \log p_i \;+\; (1 - y_i) \log (1 - p_i) \,\Big], \qquad p_i = \sigma(\mathbf{w}^{\top}\mathbf{x}_i + b) $$ Per example, only one term survives: you pay \(-\log(\text{probability you gave the truth})\) — the surprisal. Assign 0.99 to what happens, pay 0.01 nats; assign 0.001, pay 6.9; the bill for confident wrongness is unbounded. Generalized from 2 classes to \(|V| \approx 100\mathrm{K}\) token classes and averaged over positions, this is exactly Vol II · EQ 1.6 — the pre-training loss of GPT. Next-token prediction is this chapter, scaled up. WORKED EXAMPLE ▾ 01 Three predictions, three truths: \(p = 0.9\) with \(y = 1\); \(p = 0.6\) with \(y = 1\); \(p = 0.2\) with \(y = 0\). 02 Each example pays −log of the probability given to the truth: \(-\ln 0.9 = 0.105\); \(-\ln 0.6 = 0.511\); \(-\ln(1 - 0.2) = -\ln 0.8 = 0.223\). 03 Average over \(N = 3\): \((0.105 + 0.511 + 0.223)/3 = 0.839/3 \approx \) 0.280 nats. 04 The unbounded bill: had the first prediction been \(p = 0.01\) (confidently wrong about a true 1), that single term is \(-\ln 0.01 = 4.61\) — over 16× the entire average above. RESULT: BCE ≈ 0.28 nats PREDICTED p 0.90 if y = 1: pay 0.105 nats · if y = 0: pay 2.303 nats The true label is \(y = 1\) and the model predicts \(p = 0.25\). By EQ M3.3, how many nats of cross-entropy does this single example cost? Only the \(y = 1\) term survives: cost \(= -\log p = -\ln(0.25) = \ln 4 \approx \) 1.386 nats. The model put just a quarter of its belief on what actually happened, and the surprisal bills it accordingly. Why not just use squared error on \(p\)? Two reasons. Through a sigmoid, squared error becomes non-convex — gradient descent can stall in flat regions, and precisely when the model is confidently wrong the \(\sigma'\) factor crushes the gradient toward zero. Cross-entropy's gradient cancels that factor exactly, leaving the cleanest possible signal: \(\nabla_{\mathbf{w}} \mathcal{L} = \tfrac{1}{N}\sum_i (p_i - y_i)\,\mathbf{x}_i\) — error times input, the same form as linear regression's. The worse the miss, the louder the correction. There is a second, subtler reason to descend cross-entropy rather than the thing you ostensibly care about: accuracy is a staircase. Nudge the boundary and accuracy doesn't move at all — until a point crosses the line and it jumps. Zero gradient almost everywhere, undefined at the jumps: useless for optimization. Cross-entropy is the smooth ramp that gradient descent can actually walk. Feel the difference yourself: INSTRUMENT M3.1 — BOUNDARY EXPLORER EQ M3.2 LIVE · 140 SEEDED POINTS · TWO GAUSSIAN CLOUDS WEIGHT w₁ 0.90 WEIGHT w₂ -0.60 BIAS b 0.40 ACCURACY (STAIRCASE) — CROSS-ENTROPY (SMOOTH) — MISCLASSIFIED — Points are colored by the model's prediction (mint = class 1, blue = class 0); red rings mark misclassifications; the white line is the decision boundary, the mint arrow is w. The default boundary is tilted the wrong way — drag w₂ positive and watch the rings vanish. Two lessons: (1) cross-entropy keeps improving between accuracy's jumps, which is why training descends the loss, not the metric; (2) past ≈98% you cannot win — the few remaining rings live in the overlap, and no line can claim them. Gradient descent does the same steering automatically. The cell below trains logistic regression on the same two clouds — twelve lines of numpy, the gradient from this section, nothing else: PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(7) n = 80 # two gaussian clouds, as in Instrument M3.1 A = rng.normal([-1.5, -1.0], 1.05, size=(n, 2)) # class 0 B = rng.normal([ 1.5, 1.1], 1.05, size=(n, 2)) # class 1 X, y = np.vstack([A, B]), np.array([0]*n + [1]*n) w, b, lr = np.zeros(2), 0.0, 0.5 for step in range(400): p = 1 / (1 + np.exp(-(X @ w + b))) # EQ M3.2 w -= lr * (X.T @ (p - y)) / len(y) # gradient of EQ M3.3: error x input b -= lr * np.mean(p - y) p = np.clip(1 / (1 + np.exp(-(X @ w + b))), 1e-12, 1 - 1e-12) ce = -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p)) acc = np.mean((p > 0.5) == y) print("w =", np.round(w, 3), " b =", round(b, 3)) print(f"cross-entropy = {ce:.4f} accuracy = {acc:.1%}") plot_scatter(X[:, 0], X[:, 1], (p > 0.5).astype(int)) RUN ▶ edits are live — try lr = 5.0, or 10 steps instead of 400 3.4 Decision boundaries: linear in input space Where does the model actually decide ? At \(p = 0.5\) — which by EQ M3.1 happens exactly where the evidence is zero: \(\mathbf{w}^{\top}\mathbf{x} + b = 0\). In two dimensions that is a straight line; in \(d\) dimensions, a flat hyperplane. The sigmoid bends probabilities, never the boundary: logistic regression is a linear classifier, however smoothly its confidence shades from one side to the other. The parameters split into three legible roles. The direction of \(\mathbf{w}\) sets the boundary's orientation (\(\mathbf{w}\) is perpendicular to it — the mint arrow in Instrument M3.1). The bias \(b\) slides the boundary without rotating it. And the magnitude \(\lVert\mathbf{w}\rVert\) controls how fast probability ramps as you walk away from the line: the score is \(z = \lVert\mathbf{w}\rVert \cdot d(\mathbf{x})\), where \(d(\mathbf{x})\) is the signed distance to the boundary. Direction says what the model believes; magnitude says how hard. That magnitude acts as a steepness dial — an inverse temperature — on the probability curve: INSTRUMENT M3.2 — SIGMOID TEMPERATURE p = σ(k·z) · STEEPNESS k ≡ ‖w‖ ≡ 1/τ STEEPNESS k 1.00 SLOPE AT MIDPOINT (k/4) — GREY ZONE (0.2 < p < 0.8) — p AT z = +1 — The dashed curve is k = 1 for reference; the shaded band is the "grey zone" where the model is genuinely unsure. Crank k toward 8: the sigmoid hardens into a step — decisive, but with vanishing gradients and zero humility. Drop it toward 0.2: every answer is a shrug near 50%. This is the same dial as sampling temperature in Vol II — there you divide logits by τ; here k multiplies the score, so k ≡ 1/τ. Training sets it implicitly through ‖w‖. A real failure mode hides in that dial. If the training data is perfectly separable, cross-entropy keeps paying the model to grow \(\lVert\mathbf{w}\rVert\) forever — every doubling sharpens probabilities toward 0/1 and shaves a little more loss, without moving the boundary at all. The result is a wildly overconfident model. The standard fixes are L2 regularization or early stopping, both of which cap \(\lVert\mathbf{w}\rVert\); Chapter 06 treats this properly. What a line cannot do is bend. XOR-style data (positives in opposite corners) and concentric rings defeat every choice of \(\mathbf{w}\) and \(b\) — no straight boundary separates them. The classical remedy is feature engineering: feed the model \(x_1 x_2\), or \(x_1^2 + x_2^2\), and the boundary becomes linear in the new features while curving in the original space. The modern remedy is to learn those features — which is precisely what neural networks do (Chapter 07). Either way, the lesson stands: a linear classifier is only as good as the space you hand it. 3.5 Many classes: softmax Two classes needed one score. For \(K\) classes, give each class its own linear score \(z_i = \mathbf{w}_i^{\top}\mathbf{x} + b_i\) and normalize the lot with softmax — exponentiate (so everything is positive and ratios are preserved on the log scale), then divide by the sum (so everything adds to one): EQ M3.4 — SOFTMAX $$ \mathrm{softmax}(\mathbf{z})_i \;=\; \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}, \qquad \mathrm{softmax}(\mathbf{z} + c)_i \;=\; \mathrm{softmax}(\mathbf{z})_i \;\;\text{for any constant } c $$ For \(K = 2\) it collapses to EQ M3.1: \(p_1 = \sigma(z_1 - z_0)\) — sigmoid is softmax for two. The shift invariance says softmax reads differences between scores, not their absolute values — which doubles as the standard numerical-stability trick: subtract \(\max_j z_j\) before exponentiating, for free. The loss generalizes too: pay \(-\log p_{\text{true class}}\). And the gradient stays beautiful: predicted probabilities minus the one-hot truth. WORKED EXAMPLE ▾ 01 Three class scores: \(\mathbf{z} = (2, 1, 0)\). 02 Exponentiate: \(e^2 \approx 7.39\), \(e^1 \approx 2.72\), \(e^0 = 1\). Sum \(\approx 11.11\). 03 Divide each by the sum: \(p \approx (7.39/11.11,\; 2.72/11.11,\; 1/11.11) = (0.665,\; 0.245,\; 0.090)\). Adds to 1, as promised. 04 Shift-invariance check: \(\mathbf{z} = (102, 101, 100)\) gives the identical answer — the common factor \(e^{100}\) cancels top and bottom. Only the gaps between scores matter (here 1 and 1). RESULT: softmax(2, 1, 0) ≈ (0.665, 0.245, 0.090) Three class scores are \(\mathbf{z} = (1, 0, 0)\). By EQ M3.4, what probability does softmax assign to the first class? Exponentiate: \(e^1 \approx 2.718\), \(e^0 = 1\), \(e^0 = 1\). Sum \(\approx 4.718\). First probability \(= 2.718 / 4.718 \approx \) 0.576. A one-unit lead over the other two scores translates into a clear, but not crushing, majority of the probability mass. You will meet this exact function three more times in this encyclopedia, doing three different jobs: Where Softmax over Producing This chapter K class scores p(class | input) LLM output head (Vol II · EQ 1.2) |V| ≈ 100K token logits p(next token | context) Attention (Vol II · EQ 3.1) T relevance scores per query mixing weights over values Sampling with temperature τ logits / τ sharpened or flattened p One function, one identity: turn arbitrary scores into a probability distribution, differentiably. Every time a network must choose softly among options — classes, tokens, positions to attend to — softmax is the mechanism. Run it: PYTHON · RUNNABLE IN-BROWSER import numpy as np def softmax(z): e = np.exp(z - z.max()) # subtract max: free, by shift invariance return e / e.sum() logits = np.array([3.2, 1.1, 0.4, -1.7]) # 4 classes, raw scores p = softmax(logits) for name, pi in zip("ABCD", p): print(f"class {name}: {pi:.4f}") print("sum =", round(p.sum(), 6)) print() print("logits + 100:", np.round(softmax(logits + 100), 4)) print("identical — softmax reads DIFFERENCES, not absolute scores") with np.errstate(over="ignore"): # what the max-trick prevents: print("naive exp(z+1000):", np.exp(logits + 1000.0)) RUN ▶ edits are live — try logits / 0.1 (cold) or logits / 10 (hot) 3.6 Metrics beyond accuracy A disease afflicts 1 person in 1,000. The classifier return "healthy" scores 99.9% accuracy and has never detected anything. Under class imbalance, accuracy measures the imbalance, not the model. The honest accounting starts by splitting the four ways a binary prediction can land: CONFUSION MATRIX PREDICTED + PREDICTED − ACTUAL + TP · true positive FN · false negative — a miss ACTUAL − FP · false positive — a false alarm TN · true negative Two questions matter, and they are different questions. Precision \(= \mathrm{TP}/(\mathrm{TP}+\mathrm{FP})\): of everything I flagged, how much was real? Recall \(= \mathrm{TP}/(\mathrm{TP}+\mathrm{FN})\): of everything real, how much did I flag? They pull against each other through a dial you already own: the decision threshold. Nothing forces the cut at \(p = 0.5\) — lower it and you catch more positives (recall ↑) while flagging more junk (precision ↓); raise it and the reverse. A single model traces an entire precision–recall curve as the threshold sweeps; the F1 score \(= 2PR/(P+R)\), a harmonic mean, condenses one operating point into one number — harsh on imbalance between the two, as a harmonic mean should be. A classifier records \(\mathrm{TP} = 30\), \(\mathrm{FP} = 10\), \(\mathrm{FN} = 20\). What is its precision? Precision \(= \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} = \dfrac{30}{30 + 10} = \dfrac{30}{40} = \) 0.75. Of everything it flagged, three in four were real — recall (which uses FN) answers the different question of how many real cases it caught. The same classifier (\(\mathrm{TP} = 30\), \(\mathrm{FP} = 10\), \(\mathrm{FN} = 20\)) has precision \(0.75\) and recall \(30/50 = 0.6\). What is its F1 score? \(F1 = \dfrac{2PR}{P + R} = \dfrac{2\cdot 0.75\cdot 0.6}{0.75 + 0.6} = \dfrac{0.9}{1.35} \approx \) 0.667. The harmonic mean sits below the arithmetic mean of \(0.675\) — its way of penalizing the imbalance between precision and recall. Where you sit on that curve is a question about costs, not statistics: a missed tumor and a false alarm are not the same price, and no metric chooses for you. Worse, base rates ambush intuition. Run a genuinely good screening test on a rare condition: SCREENED 10,000 prevalence 1% → 100 actually positive RECALL 90% 90 TP 10 real cases slip through (FN) FP RATE 8% 792 FP 8% of 9,900 healthy people flagged PRECISION 10.2 % 90 / 882 flags are real — 9 in 10 alarms are false Working under imbalance, in practice: judge models on precision/recall (or the PR curve), never raw accuracy; consider reweighting the loss so rare-class errors cost more, or resampling the data; and remember the cheapest fix is often just moving the threshold after training. The probabilities logistic regression emits are exactly what make that last move possible — a hard classifier offers no dial at all. NEXT Logistic regression draws one straight, confident line. Chapter 04 takes the opposite bet: models with no line, no sigmoid, and barely any equations — decision trees that carve the space into boxes, forests that vote, and nearest neighbors that just ask, "what did similar points do?" § Further reading Cox, D. R. (1958). The Regression Analysis of Binary Sequences. — the founding paper of logistic regression and the log-odds (logit) link. Berkson, J. (1944). Application of the Logistic Function to Bio-Assay. — introduced the "logit" and popularized the sigmoid as a response curve. Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 4. — the clearest modern treatment of logistic regression, cross-entropy, and the softmax for multiclass. Bridle, J. (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs. — names and justifies the softmax as a normalized-exponential probability layer. Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. — why accuracy misleads on imbalanced data and when to read PR vs ROC. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, Ch. 4. — linear methods for classification, decision boundaries, and maximum-likelihood fitting. ← PREVIOUS 02 Linear Regression & Gradient Descent NEXT CHAPTER 04 Trees, Forests & Neighbors AI // ENCYCLOPEDIA — VOL I · CH 03 FULL CONTENTS ↗ ## VOL I · 04 · Trees, Forests & Neighbors (https://ai-encyclopedia.com/ml/04-trees-and-neighbors.html) 04 · Trees, Forests & Neighbors — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 04 / TREES, FORESTS & NEIGHBORS INDEX NEXT: CLUSTERING & PCA → VOLUME I — FOUNDATIONS OF ML · CHAPTER 04 / 08 Trees, Forests & Neighbors Not every model is a curve bent by gradient descent. This chapter covers methods that keep the training data instead of compressing it away: k-NN, which is its training set plus a voting rule, and decision trees, which carve the input space into rectangles by asking greedy yes/no questions. Ensembled into random forests and gradient-boosted stacks, these remain, in 2026, the methods to beat on tabular data. LEVEL INTRO READING TIME ≈ 25 MIN BUILDS ON VOL I · CH 01–03 INSTRUMENTS DEPTH DIAL · k-NN k-SLIDER IN THIS CHAPTER 4.1 k-NN: memory as a model 4.2 Decision trees 4.3 Overfitting a tree 4.4 Bagging & random forests 4.5 Gradient boosting 4.6 Trees vs deep learning § Further reading 4.1 k-NN: memory as a model The k-nearest-neighbors classifier has no training step, no parameters, and no loss function. The "model" is the training set itself, a distance function, and one integer. To classify a new point \(x\): find the \(k\) training points closest to it, and let them vote. EQ M4.1 — MAJORITY VOTE OF THE k NEAREST $$ \hat{y}(x) \;=\; \operatorname*{arg\,max}_{c}\; \sum_{i \,\in\, N_k(x)} \mathbf{1}\!\left[\, y_i = c \,\right] $$ \(N_k(x)\) is the set of the \(k\) training points nearest to \(x\) — usually under Euclidean distance — and \(\mathbf{1}[\cdot]\) counts a vote when neighbor \(i\) carries label \(c\). All the cost moves to query time: \(O(nd)\) distance computations per prediction against \(n\) stored examples in \(d\) dimensions. Every modeling assumption hides inside the distance function — if one feature is measured in millimeters and another in kilometers, the millimeters decide everything, which is why k-NN demands scaled features while trees (§4.2) do not. WORKED EXAMPLE ▾ 01 Query point \(x\); set \(k = 5\). The five nearest training points carry labels ●, ●, ○, ●, ○. 02 Count the votes — that is all \(\mathbf{1}[y_i = c]\) does: class ● gets \(1+1+0+1+0 = 3\); class ○ gets \(2\). 03 \(\operatorname{arg\,max}\): predict ●. (Odd \(k\) guarantees no tie in a two-class problem.) 04 The hidden assumption: "nearest" used Euclidean distance. If feature 1 were in kilometers and feature 2 in millimeters, the vote would be decided by feature 1 alone — rescale first. RESULT: ŷ(x) = ● by 3 votes to 2 Classify a query with \(k = 7\). Its seven nearest training points carry labels \(1, 0, 1, 1, 0, 1, 0\). By EQ M4.1, which class does k-NN predict (answer \(0\) or \(1\))? Count the votes: class \(1\) appears \(4\) times, class \(0\) appears \(3\) times. The \(\arg\max\) picks the majority, so the prediction is class 1. With odd \(k\) in a two-class problem, ties are impossible. For something this naive, k-NN is theoretically respectable: a classic result (Cover & Hart, 1967) shows that with unlimited data, the humble 1-NN rule's error is at most twice the best achievable error of any classifier. Memory is a legitimate model. And it scaled further than anyone guessed: vector search over learned embeddings — the retrieval step inside every RAG system and vector database — is k-NN run at billion-document scale. The algorithm never died; it just changed feature spaces. CURSE The curse of dimensionality. "Nearest" degrades as dimensions grow. In a 100-dimensional unit cube, a sub-cube that wants to capture just 1% of uniformly spread points needs edge length \(0.01^{1/100} \approx 0.955\) — it must span 95% of every axis to hold 1% of the data. Worse, pairwise distances concentrate: the gap between the nearest and farthest neighbor shrinks toward nothing, and the vote becomes noise. The honest footnote: real high-dimensional data (images, text embeddings) usually lies near much lower-dimensional structure, which is why k-NN on a good embedding still works — a thread picked up in Chapter 05. Run it from scratch. The cell below builds two overlapping Gaussian blobs, classifies with brute-force distance + vote, and scores \(k=1\) against \(k=15\). Note the signature of memorization: \(k=1\) is perfect on the training set (every point is its own nearest neighbor) and the worst of the pair on the test set. PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(7) # two overlapping Gaussian blobs — 240 points, 160 train / 80 test n = 120 X0 = rng.normal([-0.9, -0.6], 1.1, (n, 2)) X1 = rng.normal([ 0.9, 0.7], 1.1, (n, 2)) X = np.vstack([X0, X1]); y = np.r_[np.zeros(n, int), np.ones(n, int)] idx = rng.permutation(2 * n) Xtr, ytr = X[idx[:160]], y[idx[:160]] Xte, yte = X[idx[160:]], y[idx[160:]] def knn_predict(Xq, k): d = ((Xq[:, None,:] - Xtr[None,:,:]) ** 2).sum(-1) # all pairwise dist² votes = ytr[np.argsort(d, axis=1)[:,:k]] # labels of k nearest return (votes.mean(axis=1) > 0.5).astype(int) # majority (k is odd) for k in (1, 15): print(f"k={k:2d} train acc = {(knn_predict(Xtr, k) == ytr).mean():.3f}" f" test acc = {(knn_predict(Xte, k) == yte).mean():.3f}") plot_scatter(X[:, 0], X[:, 1], y) RUN ▶ edits are live — try k = 75, or shrink the blob spread to 0.5 4.2 Decision trees: greedy questions, rectangular answers A decision tree classifies by interrogation: is feature 2 below 0.21? yes → left, no → right, repeated until a leaf, where the majority class of the training points that landed there becomes the prediction. Geometrically, every internal node slices the space with an axis-aligned cut, so every leaf is a rectangle (a box, in higher dimensions). The model is a partition. Which question to ask first? Finding the globally optimal tree is NP-complete (Hyafil & Rivest, 1976), so CART — the algorithm running live in Instrument M4.1 — is greedy: at each node, try every feature and every threshold, keep the single split that most purifies the two children, and recurse. Purity is measured by Gini impurity or entropy: EQ M4.2 — GINI IMPURITY AND SPLIT GAIN $$ G(S) \;=\; 1 - \sum_{c} p_c^{\,2}, \qquad \Delta \;=\; G(S) \;-\; \frac{|S_L|}{|S|}\,G(S_L) \;-\; \frac{|S_R|}{|S|}\,G(S_R) $$ \(p_c\) is the fraction of class \(c\) among the samples \(S\) at the node; \(G\) is the probability that two random draws from the node disagree — 0 when pure, 0.5 at a two-class coin flip. The split \(s\) sends samples to children \(S_L, S_R\), and CART picks the feature–threshold pair maximizing the gain \(\Delta\). The entropy alternative \(H(S) = -\sum_c p_c \log_2 p_c\) ("information gain") almost never produces a different tree — Gini wins on speed, not principle. WORKED EXAMPLE ▾ 01 Parent node \(S\): 8 samples, 4 ● and 4 ○. So \(p = (0.5, 0.5)\) and \(G(S) = 1 - (0.25 + 0.25) = 0.5\) — maximal two-class impurity. 02 A candidate split sends 5 samples left (4 ●, 1 ○) and 3 right (0 ●, 3 ○). 03 \(G(S_L) = 1 - (0.8^2 + 0.2^2) = 1 - 0.68 = 0.32\). \(G(S_R) = 1 - (0 + 1) = 0\) — perfectly pure. 04 Gain: \(\Delta = 0.5 - \tfrac{5}{8}(0.32) - \tfrac{3}{8}(0) = 0.5 - 0.2 = 0.3\). CART runs this arithmetic for every feature–threshold pair and keeps the winner. RESULT: Δ = 0.30 ● SENT LEFT (OF 10) 8 ○ SENT LEFT (OF 10) 2 G_L = 0.320 · G_R = 0.320 · Δ = 0.180 A tree node holds 8 samples: 6 of class ● and 2 of class ○. What is its Gini impurity \(G = 1 - \sum_c p_c^2\) (EQ M4.2)? Fractions: \(p_● = 6/8 = 0.75\), \(p_○ = 2/8 = 0.25\). \(G = 1 - (0.75^2 + 0.25^2) = 1 - (0.5625 + 0.0625) = 1 - 0.625 = \) 0.375. Less impure than the 0.5 of a 50/50 node, but not yet pure. A parent node of 8 samples (4 ●, 4 ○, so \(G = 0.5\)) is split into a left child of 4 (3 ●, 1 ○) and a right child of 4 (1 ●, 3 ○). Each child has Gini \(0.375\). What is the split gain \(\Delta\) (EQ M4.2)? \(\Delta = G(S) - \tfrac{|S_L|}{|S|}G(S_L) - \tfrac{|S_R|}{|S|}G(S_R) = 0.5 - \tfrac{4}{8}(0.375) - \tfrac{4}{8}(0.375) = 0.5 - 0.1875 - 0.1875 = \) 0.125. A modest, balanced improvement — CART would still prefer this over any split that purifies less. FIG M4.A A DEPTH-2 TREE IS A PARTITION INTO THREE RECTANGLES x₂ < 0.21 ? T F R1 · PREDICT ● x₁ < 1.04 ? T F R2 · PREDICT ○ R3 · PREDICT ● x₂ = 0.21 x₁ = 1.04 R1 · ● R2 · ○ R3 · ● x₁ → x₂ Same object, two views. The tree (left) and the partition (right) are identical. Note that class ● already owns a non-convex region — two splits suffice — and that the boundaries are axis-parallel by construction: a tree can only approximate a diagonal frontier with a staircase. Trees buy three practical superpowers that gradient-trained models lack. They are invariant to any monotone rescaling of a feature (only the ordering of values matters, so no normalization, ever); they ingest mixed numeric and categorical columns without ceremony; and a small tree is genuinely readable — you can print it as a flowchart and hand it to an auditor. Their weakness is the same as their strength: predictions are piecewise-constant, so a lone tree is jagged, unstable, and cannot extrapolate a trend beyond the data it saw. The atomic unit is the decision stump — a depth-1 tree, one question, two answers. The cell below performs the exact inner-loop search that CART runs at every node: scan every threshold, score every split by Gini gain, keep the best. Boosting (§4.5) will assemble hundreds of barely-better-than-chance learners like this one into a precision instrument. PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(3) # 1-D labels with a true step at x = 0.35, plus 10% label noise n = 200 x = rng.uniform(0, 1, n) y = (x > 0.35).astype(int) flip = rng.random(n) < 0.10 y[flip] = 1 - y[flip] def gini(labels): if len(labels) == 0: return 0.0 p = labels.mean() return 2 * p * (1 - p) # = 1 - p² - (1-p)² for two classes parent = gini(y) best_gain, best_t = -1.0, None for t in np.sort(x)[1:]: # candidate threshold between each pair L, R = y[x < t], y[x >= t] w = len(L) / n gain = parent - (w * gini(L) + (1 - w) * gini(R)) if gain > best_gain: best_gain, best_t = gain, t pred = (x >= best_t).astype(int) print(f"parent Gini: {parent:.4f}") print(f"best threshold: x = {best_t:.4f} (true step at 0.35)") print(f"Gini gain: {best_gain:.4f}") print(f"stump accuracy: {(pred == y).mean():.3f} (noise ceiling ≈ 0.90)") RUN ▶ raise the label noise to 0.3 and watch the gain — and the recovered threshold — degrade 4.3 Overfitting a tree: depth is capacity Nothing in CART tells it when to stop. Left alone, it splits until every leaf is pure — and a tree of depth \(d\) can carve up to \(2^d\) rectangles, so by depth 20 it has a private box for every training point, noise included. Train accuracy climbs monotonically with depth; test accuracy rises, peaks where the tree has captured the signal, then falls as additional splits start chiseling the noise. That divergence is the whole concept of overfitting, and you can watch it happen below: the instrument fits a real CART (the exact greedy Gini search of EQ M4.2) on 160 seeded "two-moons" points, paints its decision regions, and scores it against 80 held-out test points. INSTRUMENT M4.1 — DEPTH DIAL REAL CART · REFIT LIVE IN JS · EQ M4.2 MAX DEPTH 3 ACCURACY VS DEPTH — TRAIN · TEST TRAIN ACC — TEST ACC — GENERALIZATION GAP — LEAVES — Solid dots are training points, hollow rings are the held-out test set; the painted regions are the tree's verdict at every pixel block. Depth 1–2 underfits — one axis cut cannot bend around a moon. Depth 3–4 is honest. By depth 9 the tree is perfect on train and busy fencing off single noisy points; on this seed, test accuracy slides from 85% back to 80% while train hits 100%. The curves below the map plot the full sweep — the widening mint–blue scissors after depth ~3 is overfitting, drawn live. Production practice caps capacity directly — maximum depth, minimum samples per leaf, minimum gain to split, or post-hoc pruning — and Chapter 06 treats this trade-off in its full generality as bias versus variance. But the idea is bigger than trees, and k-NN makes the point beautifully because its capacity dial runs backwards: small \(k\) means high capacity. At \(k=1\) the model memorizes — every training point wins its own private island of space — while large \(k\) averages over wide neighborhoods and the boundary relaxes into something smooth. Same data below; watch the islands dissolve. INSTRUMENT M4.2 — k-NN k-SLIDER BRUTE-FORCE VOTE · SAME DATA · EQ M4.1 NEIGHBORS k 7 TRAIN ACC (SELF-VOTE) — TEST ACC — REGIME — Every painted block is a genuine brute-force vote over all 160 training points. At k = 1 train accuracy reads 100% — each point votes for itself, which is exactly how memorization flatters itself — yet test accuracy is the worst on the dial (81% on this seed vs 89% at k = 7). Slide right and the speckled islands vanish; push to k = 31 and watch the thin tails of each moon get annexed by the opposing majority — global accuracy holds steady here, but the local geometry is visibly wrong. 4.4 Ensembles I: bagging and random forests A deep tree is a low-bias, high-variance learner: it can represent almost anything, but refit it on a slightly different sample and you get a visibly different partition. Bagging (bootstrap aggregating, Breiman 1996) exploits that instability instead of fighting it: draw \(B\) bootstrap samples (sample \(n\) points with replacement — each bag contains about \(1 - 1/e \approx 63.2\%\) of the unique points), grow a deep, deliberately unpruned tree on each, and average their votes. Why averaging helps is one line of statistics: EQ M4.3 — VARIANCE OF AN AVERAGE $$ \mathrm{Var}\!\left( \frac{1}{n} \sum_{i=1}^{n} \hat{f}_i(x) \right) \;=\; \frac{\sigma^2}{n} \qquad \text{(uncorrelated predictors)} $$ Each tree's prediction errs with variance \(\sigma^2\); averaging \(n\) independent errors divides the variance by \(n\). The catch: trees grown on overlapping bootstrap samples are correlated, and with pairwise correlation \(\rho\) the variance is \(\rho\sigma^2 + \tfrac{1-\rho}{n}\sigma^2\). Averaging annihilates the second term but the \(\rho\sigma^2\) floor survives no matter how many trees you add — so the entire engineering problem becomes: decorrelate the trees. Four uncorrelated trees each predict with error variance \(\sigma^2 = 0.8\). By EQ M4.3, what is the variance of their averaged prediction? \(\dfrac{\sigma^2}{n} = \dfrac{0.8}{4} = \) 0.2. Averaging four independent learners quarters the variance — the entire reason bagging works, and the reason random forests fight so hard to keep their trees uncorrelated. A random forest is bagging plus one decorrelation trick: at every split, the tree may only consider a random subset of the features (classically \(\sqrt{p}\) of \(p\) for classification). Strong features stop dominating every tree, \(\rho\) drops, and the variance floor drops with it. Two free gifts follow. Because each tree never saw ~37% of the data, scoring each point with only the trees that missed it yields out-of-bag error — an honest validation estimate with no held-out split. And since trees are independent, training parallelizes perfectly. The honest ledger: random forests are astonishingly hard to break — near-default hyperparameters land within a few percent of optimal on most tabular tasks — but they pay in memory and latency (hundreds of deep trees), their predictions remain step functions, and like all trees they cannot extrapolate beyond the convex hull of what they saw: a forest trained on 2019 prices will never predict a 2026 price above the 2019 maximum. 4.5 Ensembles II: gradient boosting Bagging builds strong learners in parallel and averages away variance. Boosting does the opposite: build weak learners — shallow trees, depth 4–8 — in sequence, each one trained on what the ensemble so far still gets wrong, attacking bias instead. After \(m-1\) rounds the model is \(F_{m-1}\); the next tree fits its mistakes: EQ M4.4 — THE BOOSTING UPDATE $$ F_m(x) \;=\; F_{m-1}(x) \;+\; \eta\, h_m(x), \qquad h_m \;\text{ fit to the residuals }\; r_i = y_i - F_{m-1}(x_i) $$ For squared loss, the residuals \(r_i\) literally are the errors left over. The general recipe — fit \(h_m\) to the negative gradient of any differentiable loss, evaluated at the current predictions — is what makes it gradient boosting: the same descent idea as Chapter 02, except each "step" is an entire tree, taken in function space rather than parameter space. The shrinkage \(\eta\) (typically 0.01–0.3) deliberately under-commits to each tree; small \(\eta\) plus more rounds almost always generalizes better. Unlike bagging, boosting will overfit as rounds accumulate — early stopping on a validation set is part of the algorithm, not an optional extra. WORKED EXAMPLE ▾ 01 Targets \(y = (3, 5, 10)\). Round 0: \(F_0 = \) the mean \(= 18/3 = 6\) for every input. 02 Residuals \(r = y - F_0 = (-3, -1, 4)\). Fit tree \(h_1\) to these; suppose it nails them exactly. 03 With shrinkage \(\eta = 0.3\): \(F_1 = 6 + 0.3\,(-3, -1, 4) = (5.1,\; 5.7,\; 7.2)\). 04 New residuals: \((3 - 5.1,\; 5 - 5.7,\; 10 - 7.2) = (-2.1,\; -0.7,\; 2.8)\) — each exactly \(0.7\times\) the old. Every round multiplies what's left by \((1 - \eta)\): deliberate under-commitment, many rounds. RESULT: residuals shrink ×0.70 per round at η = 0.3 For one input: target \(y = 10\), current prediction \(F_{m-1} = 6\), and the new tree fits the residual exactly (\(h_m = 4\)). With shrinkage \(\eta = 0.3\), what residual \(y - F_m\) remains after this boosting round (EQ M4.4)? Update: \(F_m = F_{m-1} + \eta\,h_m = 6 + 0.3\cdot 4 = 7.2\). New residual: \(y - F_m = 10 - 7.2 = \) 2.8 — exactly \((1 - \eta) = 0.7\) times the old residual of \(4\). Shrinkage means each round deliberately leaves most of the error for the next tree. The modern implementations are the tabular kings. XGBoost (2016) added second-order gradients and explicit regularization on leaf weights; LightGBM made training fast at scale with histogram-binned splits and leaf-wise growth; CatBoost specializes in categorical columns via ordered target statistics. A decade of Kaggle tabular leaderboards is, to a first approximation, a history of these three libraries. Property Random forest Gradient boosting Trees built in parallel, independent in sequence, each fixing the last Error attacked variance (EQ M4.3) bias (EQ M4.4) Tree shape deep, unpruned shallow (depth 4–8 / 31–255 leaves) Tuning sensitivity low — defaults nearly optimal moderate — η, rounds, depth interact Overfits with more trees? no (variance floor, never worse) yes — early stopping required Typical accuracy on tables strong strongest — the default to beat 4.6 When trees beat deep learning This encyclopedia spends three volumes on neural networks, so honesty demands this section. On medium-sized tables — the ~1K-to-1M-row, heterogeneous-column datasets that constitute most of applied machine learning in industry — tuned gradient-boosted trees have beaten tuned deep models for most of the past decade. The careful benchmark of Grinsztajn et al. (NeurIPS 2022) compared XGBoost and random forests against MLPs, ResNets, and tabular transformers across ~45 datasets with equal tuning budgets, and the trees won outright on most of them. Three reasons survive scrutiny: Tabular targets are irregular. Neural networks carry a smoothness prior; real-world table columns (thresholded business rules, saturation effects, encoded categories) produce jagged, discontinuous target functions that piecewise-constant trees fit natively. Uninformative features are everywhere. A tree simply never splits on a useless column. An MLP must spend capacity learning to ignore it, and in low-data regimes it often fails to. Axis alignment is the right prior. Table columns are individually meaningful — "age", "income" — and the correct decision boundaries really do tend to run parallel to them. The rotation invariance that serves vision models is exactly the wrong inductive bias here. The frontier is genuinely moving, and the honest 2026 picture is contested at the edges: TabPFN-style models — transformers pre-trained on millions of synthetic tables that "fit" a new dataset by in-context learning, no gradient steps at all — now beat tuned GBDTs on many small tables (roughly ≤10K rows), as published in Nature in 2025. Deep models also win wherever the table stops being a table: text or image columns, massive row counts, transfer from pretrained representations, or the need for embeddings that feed a larger system. But the default has not changed: faced with a fresh tabular problem, the strong move is still a LightGBM or XGBoost baseline, trained in minutes on a CPU. Anything more exotic must beat that number to earn its complexity — and most of the time, it doesn't. NEXT Every method so far was handed labels. Chapter 05 takes them away: k-means clustering stepped one Lloyd iteration at a time, PCA as organized variance-hunting, and the idea that data can describe itself — the first step toward the embeddings that power everything in Volumes II and beyond. § Further reading Cover, T. & Hart, P. (1967). Nearest Neighbor Pattern Classification. — the founding analysis of k-NN, including the bound that 1-NN error is at most twice the Bayes error. Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984). Classification and Regression Trees. — the CART monograph that defines the greedy splitting, impurity, and pruning used by every modern tree. Breiman, L. (2001). Random Forests. — introduces bagging plus random feature subsets and explains why averaging decorrelated trees lowers variance. Friedman, J. (2001). Greedy Function Approximation: A Gradient Boosting Machine. — the paper that frames boosting as gradient descent in function space. Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. — the engineering and regularization advances that made gradient boosting the default for tabular data. Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? — the empirical case for trees over neural nets on heterogeneous tables. ← PREVIOUS 03 Classification: Logistic & Softmax NEXT CHAPTER 05 Clustering & Dimensionality AI // ENCYCLOPEDIA — VOL I · CH 04 FULL CONTENTS ↗ ## VOL I · 05 · Clustering & Dimensionality (https://ai-encyclopedia.com/ml/05-unsupervised.html) 05 · Clustering & Dimensionality — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 05 / CLUSTERING & DIMENSIONALITY INDEX NEXT: GENERALIZATION → VOLUME I — FOUNDATIONS OF ML · CHAPTER 05 / 08 Clustering & Dimensionality Every method so far was handed the right answers. This chapter takes them away. With no labels, the model must find structure the data carries on its own: groups that belong together, directions that matter, coordinates worth keeping. Data can describe itself, and a model that compresses data well has, in a measurable sense, understood it. Scaled up, that is how every modern language model learns. LEVEL CORE READING TIME ≈ 22 MIN BUILDS ON CH 01 · 02 · 04 INSTRUMENTS LLOYD STEPPER · VARIANCE HUNT IN THIS CHAPTER 5.1 Taking the labels away 5.2 k-means: Lloyd's loop 5.3 What k-means cannot see 5.4 PCA: variance hunting 5.5 From compression to embeddings § Further reading 5.1 Taking the labels away Supervised learning runs on a luxury: someone already did the job correctly, two million times, and wrote the answers down. That luxury is expensive — labels cost human hours, expert hours, sometimes biopsy results — and it is scarce in exactly the places data is abundant. Server logs, transaction streams, sensor traces, the text of the internet: almost everything ever recorded arrives unlabeled. Unsupervised learning is the discipline of extracting structure from \(x\) alone, with no \(y\) anywhere in the problem. What can structure mean, when nobody defines success for you? Three recurring answers, two of which this chapter builds from scratch: Task The question it asks Canonical method You have already met it as… Clustering Which points belong together? k-means (§5.2) k-NN's neighborhoods (Ch 04), minus the labels Dimensionality reduction Which directions carry the signal? PCA (§5.4) feature scaling's smarter sibling (Ch 02) Density estimation What does typical look like — and what doesn't? histograms, mixtures, … the distribution \(\mathcal{D}\) behind EQ M1.3 The honest difficulty is not algorithmic — both algorithms below fit in a dozen lines. It is that without labels there is no ground truth to be scored against. A supervised model is wrong when it disagrees with \(y\); a clustering is "wrong" only relative to a purpose nobody encoded in the data. Every unsupervised method therefore optimizes a proxy — compactness, variance, reconstruction — and the practitioner's job is judging whether the proxy matches the purpose. Keep that skepticism switched on for the rest of the chapter; both instruments below are built to reward it. One taxonomy note before we start. The regime that ate the world is a hybrid: self-supervised learning manufactures labels out of the raw data itself — mask a word and predict it, crop an image and match it, take a prefix and predict the next token. The loss is supervised machinery (cross-entropy, Ch 03); the labels cost nothing, exactly like the methods here. Section 5.5 traces the line from this chapter to that one. 5.2 k-means: Lloyd's loop The oldest serious answer to "which points belong together" is geometric: choose \(k\) cluster centers, assign every point to its nearest center, and call a clustering good when points sit close to their assigned centers. That sentence is already the objective — the unsupervised stand-in for a loss function, with no label anywhere in it: EQ M5.1 — THE K-MEANS OBJECTIVE (INERTIA) $$ J\big(c_{1..n},\, \mu_{1..k}\big) \;=\; \sum_{i=1}^{n} \big\lVert x_i - \mu_{c_i} \big\rVert^{2}, \qquad c_i \in \{1, \dots, k\} $$ \(\mu_j\) are the \(k\) centroids; \(c_i\) names the cluster point \(i\) is assigned to; \(J\) — called inertia, or within-cluster sum of squares — totals every point's squared distance to its own centroid. Minimizing \(J\) jointly over assignments and centroids is NP-hard even for \(k = 2\). Nobody minimizes it exactly; everybody uses the 1957 heuristic below. WORKED EXAMPLE ▾ 01 Six points on a line: 1, 2, 3, 8, 9, 10. Set \(k = 2\) and place the centroids badly: \(\mu_1 = 4\), \(\mu_2 = 10\). 02 ASSIGN: point 8 is 4 away from \(\mu_1\) but only 2 from \(\mu_2\), so \(S_1 = \{1,2,3\}\), \(S_2 = \{8,9,10\}\). Inertia: \(J = (1{-}4)^2 + (2{-}4)^2 + (3{-}4)^2 + (8{-}10)^2 + (9{-}10)^2 + 0 = 9{+}4{+}1{+}4{+}1 = 19\). 03 UPDATE: \(\mu_1 \leftarrow \mathrm{mean}(1,2,3) = 2\), \(\mu_2 \leftarrow \mathrm{mean}(8,9,10) = 9\). New \(J = 1{+}0{+}1{+}1{+}0{+}1 = 4\). 04 Re-ASSIGN moves nothing — converged. One Lloyd sweep cut \(J\) from 19 to 4, this data's global optimum. Now drag the centroids yourself and try to beat it. RESULT: J = 19 → 4 IN ONE SWEEP CENTROID μ₁ 4.0 CENTROID μ₂ 10.0 J = 19.00 Four 1-D points: 0, 2, 10, 12, with two centroids fixed at \(\mu_1 = 1\) and \(\mu_2 = 11\). Each point joins its nearest centroid. What is the inertia \(J = \sum_i (x_i - \mu_{c_i})^2\)? Assign by nearness: 0 and 2 go to \(\mu_1 = 1\); 10 and 12 go to \(\mu_2 = 11\). Squared distances: \((0-1)^2 = 1\), \((2-1)^2 = 1\), \((10-11)^2 = 1\), \((12-11)^2 = 1\). Each centroid sits exactly one unit from both of its points, so \(J = 1+1+1+1 = \) 4. Lloyd's algorithm attacks \(J\) the way you would untangle any two-variable problem: freeze one variable, optimize the other, alternate. Both half-steps have closed forms, and both can only push \(J\) down: EQ M5.2 — LLOYD'S TWO STEPS $$ \textbf{ASSIGN:}\;\; c_i \,\leftarrow\, \arg\min_{j} \,\lVert x_i - \mu_j \rVert^{2} \qquad\qquad \textbf{UPDATE:}\;\; \mu_j \,\leftarrow\, \frac{1}{\lvert S_j \rvert} \sum_{i \in S_j} x_i $$ ASSIGN is optimal for the current centroids by definition of "nearest"; UPDATE is optimal for the current assignments because the mean is the point minimizing total squared distance to a set — the same fact that made squared error pick the average in Chapter 01. Each half-step lowers \(J\) or leaves it fixed, \(J \ge 0\), and only finitely many assignments exist, so the loop must converge — in remarkably few sweeps, in practice. The fine print: it converges to a local minimum that depends entirely on where the centroids started. After ASSIGN, one cluster holds these four points (x-coordinate only): 3, 4, 6, 7. The UPDATE step moves the centroid to \(\mu = \frac{1}{|S|}\sum_{i \in S} x_i\). What is the new centroid's x-coordinate? UPDATE replaces a centroid with the mean of its assigned points: \(\mu = (3 + 4 + 6 + 7)/4 = 20/4 = \) 5. This is the point minimizing total squared distance to the set — the closed form that makes UPDATE optimal for fixed assignments. Run the loop by hand. The instrument seeds three well-separated blobs, drops \(k\) centroids on random data points, and gives you the two half-steps as buttons. Watch \(J\) after every press — it never rises, which is the convergence proof happening in front of you. INSTRUMENT M5.1 — LLOYD STEPPER 180 SEEDED POINTS · EQ M5.2 BY HAND · ✕ = CENTROID CLUSTERS k 3 CONTROL STEP AUTO ▶ RE-INIT ↻ NEXT HALF-STEP ASSIGN FULL SWEEPS 0 INERTIA J — EQ M5.1 — Step until the readout says CONVERGED — with k = 3 it takes only a handful of sweeps, and most starts land at J ≈ 47, the (lucky) global optimum. Now press RE-INIT a few times and re-run: some draws put two centroids in the same blob, and Lloyd converges — fully, honestly converged — at J ≈ 480, one centroid straddling two blobs forever. Convergence is not correctness. Then sweep k: at k = 2 a blob pair is forcibly merged; at k = 6 real blobs get carved into fragments — and J still goes down, because adding centroids always lowers inertia. J can compare runs at the same k; it cannot choose k for you (§5.3). The same loop in numpy — eight lines of algorithm, fully vectorized. The printed inertia drops hard, then freezes: that plateau is convergence (no assignment changed, so nothing can move again). PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(5) # three blobs, then k-means from scratch — Lloyd's two steps, verbatim centers = np.array([[-2.0, -1.0], [2.0, -1.2], [0.2, 1.8]]) X = np.vstack([rng.normal(c, 0.55, (60, 2)) for c in centers]) k = 3 mu = X[rng.choice(len(X), k, replace=False)] # init: k random points for it in range(6): d = ((X[:, None,:] - mu[None,:,:]) ** 2).sum(-1) # n x k distances c = d.argmin(1) # ASSIGN (EQ M5.2) J = d[np.arange(len(X)), c].sum() mu = np.array([X[c == j].mean(0) for j in range(k)]) # UPDATE (EQ M5.2) print(f"sweep {it}: inertia J = {J:8.2f}") print("\nrecovered centroids (true: -2,-1 / 2,-1.2 / 0.2,1.8):") print(np.round(mu, 2)) plot_scatter(X[:, 0], X[:, 1], c) RUN ▶ edits are live — try k = 5, or rng seed 12 for a different init Production reality, in three habits. (1) Never one run: restart from many random inits and keep the lowest \(J\) — or use k-means++ seeding (spread the initial centroids out proportionally to squared distance), which is the default in every serious library and carries an \(O(\log k)\) approximation guarantee. (2) Scale features first — squared Euclidean distance inherits every pathology Chapter 02 warned about, and an unscaled column silently owns the clustering. (3) At web scale, use minibatch k-means: same two steps, estimated on samples, for the same reason SGD exists. 5.3 What k-means cannot see k-means is fast, simple, and everywhere — and it is opinionated in ways the output never confesses. The objective is built from squared Euclidean distance to a single center, so the method implicitly assumes every cluster is a compact, roughly spherical, roughly equal-sized ball. Hand it anything else and it will still return \(k\) tidy clusters, with total confidence, and they will be wrong: Baked-in assumption How real data breaks it What to reach for Clusters are spherical elongated or curved groups — Chapter 04's two moons get sliced crosswise, not traced spectral clustering; DBSCAN Similar size & density one dense core next to a sparse halo: the boundary lands where the variances balance, not where humans see it Gaussian mixtures (EM) — k-means with ellipses and soft assignments Every point belongs somewhere outliers drag centroids; a single rogue point can claim a centroid at large k DBSCAN (has a noise label); trim or robustify first k is known it almost never is elbow on J, silhouette score — both heuristics, neither decisive On choosing \(k\): inertia alone cannot do it — you proved in Instrument M5.1 that \(J\) falls monotonically as \(k\) grows, all the way to the absurd optimum of one centroid per point (\(J = 0\), the lookup table of Chapter 01 wearing a new disguise). The elbow method plots \(J\) against \(k\) and looks for the bend where added centroids stop paying; the silhouette score compares each point's distance to its own cluster against the nearest foreign one. Both are judgment calls dressed as numbers. When a downstream task exists — clusters feeding a classifier, segments feeding a campaign — let its metric choose \(k\); that converts an unanswerable unsupervised question back into a measurable supervised one, and it is the most honest trick in this chapter. 5.4 PCA: organized variance-hunting Clustering compresses \(n\) points into \(k\) prototypes. The other great compression runs crosswise: keep every point, but shrink the number of coordinates used to describe it. Real datasets are wildly redundant — square footage and room count move together; pixel 2,001 mostly agrees with pixel 2,002. Redundancy means the data hugs a lower-dimensional sheet inside its nominal space, and principal component analysis finds the best flat sheet by a beautifully blunt criterion: keep the directions along which the data varies most. Center the data (always — PCA is blind without it), then ask for the unit direction that maximizes the variance of the projections: EQ M5.3 — PCA AS VARIANCE MAXIMIZATION $$ u_1 \;=\; \arg\max_{\lVert u \rVert = 1} \; u^{\top} \Sigma\, u, \qquad \Sigma \;=\; \frac{1}{n} \sum_{i=1}^{n} \big(x_i - \bar{x}\big)\big(x_i - \bar{x}\big)^{\top} $$ \(\Sigma\) is the covariance matrix; \(u^\top \Sigma u\) is exactly the variance of the data projected onto \(u\). The maximizer is the top eigenvector of \(\Sigma\), and the variance it captures is its eigenvalue \(\lambda_1\); the second component is the best direction orthogonal to the first, and so on down the spectrum. The twin reading matters just as much: by Pythagoras, variance kept + variance lost = total, so the direction that captures the most variance is the same direction that minimizes squared perpendicular reconstruction error. Maximal information and minimal distortion are one criterion, not two. WORKED EXAMPLE ▾ 01 A centered 2-D cloud whose covariance has variances \(\Sigma_{11} = \Sigma_{22} = 2\) and covariance \(\Sigma_{12} = 1\). Total variance = trace = \(2 + 2 = 4\). 02 Axis-aligned guess \(u = (1, 0)\): \(u^\top \Sigma u = \Sigma_{11} = 2\) — keeps \(2/4 = 50\%\). 03 Diagonal guess \(u = (1,1)/\sqrt{2}\): \(u^\top \Sigma u = (2 + 1 + 1 + 2)/2 = 3\) — keeps \(3/4 = 75\%\). The cross-term \(\Sigma_{12}\) pays out when the axis follows the correlation. 04 No direction beats it: for \(u = (\cos\theta, \sin\theta)\), \(u^\top \Sigma u = 2 + \sin 2\theta \le 3\), with equality at \(\theta = 45°\) — the top eigenvector, eigenvalue \(\lambda_1 = 3\). Sweep \(\theta\) below. RESULT: PC1 AT 45° CAPTURES 3/4 = 75% ANGLE θ 0° uᵀΣu = 2.00 · 50.0% of total 4 A diagonal covariance \(\Sigma = \begin{psmallmatrix}4 & 0\\ 0 & 1\end{psmallmatrix}\) and a unit direction \(u = (0.6,\, 0.8)\) (note \(0.6^2 + 0.8^2 = 1\)). How much variance does \(u\) capture, \(u^\top \Sigma u\)? For a diagonal \(\Sigma\) the cross-terms vanish, so \(u^\top \Sigma u = u_1^2\,\Sigma_{11} + u_2^2\,\Sigma_{22} = 0.6^2 \cdot 4 + 0.8^2 \cdot 1 = 0.36 \cdot 4 + 0.64 \cdot 1 = 1.44 + 0.64 = \) 2.08. (The top eigenvector here is \((1,0)\) with eigenvalue 4; this off-axis \(u\) captures less.) Hunt the direction yourself before the eigensolver does. Below is a centered, correlated cloud; the slider aims a candidate axis, the red stalks are what projection onto that axis throws away, and the lower chart sweeps EQ M5.3's objective across every angle: INSTRUMENT M5.2 — VARIANCE HUNT 200 SEEDED POINTS · u T Σu LIVE · RED = WHAT PROJECTION DISCARDS PROJECTION ANGLE θ 0° VARIANCE CAPTURED vs ANGLE — EQ M5.3 SWEPT OVER ALL θ VARIANCE CAPTURED u T Σu — SHARE OF TOTAL — DISCARDED (RED STALKS) — PC1 — EIGENVECTOR ANSWER — Drag θ and watch the two readouts trade against each other — captured + discarded is constant, the Pythagoras identity of EQ M5.3. At θ = 0° (plain "keep the x-axis") you keep 66% of the variance; the dashed mint axis, at 32.5°, keeps 87.9% — and no angle does better, because that is the top eigenvector of this sample's Σ. One honest wrinkle: the cloud was generated along 34°. The 1.5° gap is sampling noise — PCA recovers the truth of your sample, which is never quite the truth of the world (Chapter 06 is about exactly this gap). How many components to keep? The eigenvalue spectrum is the budget sheet — and the dropped eigenvalues are not vaguely "lost information", they are exactly the reconstruction error: EQ M5.4 — THE BUDGET SHEET $$ \text{variance kept by } d \text{ of } D \text{ components} \;=\; \frac{\sum_{j=1}^{d} \lambda_j}{\sum_{j=1}^{D} \lambda_j}, \qquad \frac{1}{n}\sum_{i=1}^{n} \big\lVert x_i - \hat{x}_i \big\rVert^{2} \;=\; \sum_{j=d+1}^{D} \lambda_j $$ \(\hat{x}_i\) is the reconstruction from the kept \(d\) components. The right-hand identity is the Eckart–Young theorem in its friendliest costume: among all linear projections to \(d\) dimensions, PCA's is the one with the smallest possible reconstruction error, and that error equals the sum of the eigenvalues you dropped. Practitioners keep enough components for 90–99% of variance, or cut where the spectrum visibly cliffs. WORKED EXAMPLE ▾ 01 A 3-D dataset with eigenvalue spectrum \(\lambda = (4.5,\, 0.4,\, 0.1)\). Total variance \(= 4.5 + 0.4 + 0.1 = 5.0\). 02 Keep \(d = 1\): variance kept \(= 4.5 / 5.0 = 90\%\). Reconstruction MSE = sum of dropped eigenvalues \(= 0.4 + 0.1 = 0.5\). 03 Keep \(d = 2\): kept \(= (4.5 + 0.4)/5.0 = 98\%\). MSE \(= \lambda_3 = 0.1\) — exactly, not approximately. 04 Read the cliff: the first coordinate buys 90 points of variance, the second buys 8, the third buys 2. The budget sheet says where to stop — here, \(d = 2\). RESULT: d = 2 → 98% KEPT · MSE = 0.1 An eigenvalue spectrum \(\lambda = (6,\, 3,\, 1)\). You keep the top \(d = 2\) components. What percentage of the total variance is retained? Total variance \(= 6 + 3 + 1 = 10\). Kept by the top 2: \(6 + 3 = 9\). Fraction \(= 9/10 = 0.90\), i.e. 90 %. The discarded \(\lambda_3 = 1\) is exactly the reconstruction error (the Eckart–Young identity of EQ M5.4). PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(11) # a correlated 2-D cloud: most variance lives along one hidden direction n = 300 t = rng.normal(0, 1.9, n) # signal along the hidden axis s = rng.normal(0, 0.6, n) # noise across it ang = np.deg2rad(34) X = np.column_stack([t*np.cos(ang) - s*np.sin(ang), t*np.sin(ang) + s*np.cos(ang)]) Xc = X - X.mean(0) # center first — always C = Xc.T @ Xc / n # covariance matrix (EQ M5.3) lam, U = np.linalg.eigh(C) # eigh: ascending order... lam, U = lam[::-1], U[:,::-1] #...so flip to largest-first print("eigenvalues:", np.round(lam, 3)) print("explained variance %:", np.round(100*lam/lam.sum(), 1)) print("PC1 angle (true 34deg):", round(np.degrees(np.arctan2(U[1,0], U[0,0])) % 180, 1)) Z = Xc @ U[:,:1] # project to 1-D: the embedding Xr = Z @ U[:,:1].T # reconstruct from 1 component print("reconstruction MSE:", round(float(((Xc - Xr)**2).mean()), 4)) print("dropped eigenvalue/2:", round(lam[1]/2, 4), " RUN ▶ shrink the noise to 0.1 — watch PC1 snap to 34° and the MSE collapse FINE PRINT Three ways PCA lies to the unwary. (1) It is scale-covariant: measure one feature in millimeters instead of meters and it manufactures a fake principal direction — standardize first unless the units are genuinely shared. (2) Variance is not importance. PCA never saw your labels; the discriminating signal for a downstream task can sit in a low-variance direction PCA throws away first. (3) It is strictly linear: a sheet, never a curve. Data on Chapter 04's moons or a spiral has structure no single flat projection can keep — the cue for everything in the next section. 5.5 From compression to embeddings Look again at what the PCA cell printed: it took each point and re-expressed it as coordinates \(z\) in a learned system where the axes are ordered by how much they matter. That object has a modern name — an embedding: a compact vector representation in which geometry encodes meaning, so that nearby vectors are similar things. PCA is the simplest embedding machine ever built, and the lineage from here to the frontier is unusually direct: Nonlinear map-making: t-SNE and UMAP. For looking at high-dimensional data, these bend the sheet, preserving local neighborhoods at the cost of global honesty. Use them for eyes, never for arithmetic: cluster sizes, inter-cluster distances, and density in such plots are artifacts of the optimizer as much as of the data, and both methods will draw confident-looking islands in pure noise. Autoencoders. Replace PCA's linear projection with Chapter 07's networks: an encoder squeezes \(x\) into a low-dimensional bottleneck, a decoder reconstructs, and the loss is reconstruction error — EQ M5.4's right-hand side, made trainable by gradient descent. A linear autoencoder provably recovers the PCA subspace; nonlinear ones learn curved sheets PCA cannot. Self-supervision: data labels itself. Delete part of the data and train a supervised model to restore it — next-token prediction is precisely this (Vol II · Ch 04). The "labels" are free, the scale is the internet, and the representations that fall out are the embeddings inside every LLM. Embeddings meet Chapter 04. Today's retrieval stack is this chapter's two ideas reassembled: a learned embedding (this section) plus nearest-neighbor search over it (Ch 04's k-NN) is vector search — the machinery under RAG and semantic search. Even the curse of dimensionality resolves on cue: embeddings work because real data concentrates near low-dimensional structure, which is the bet PCA placed first. The through-line deserves saying plainly. k-means compresses a dataset into \(k\) prototypes; PCA compresses it into \(d\) directions; an autoencoder into a bottleneck; a language model into weights that can regenerate the statistics of its corpus. In every case the test is the same — how much of the data can the summary give back? — and in every case better compression has meant something uncomfortably like better understanding. That is not a metaphor the field decorates itself with; it is the actual objective the largest training runs in history are minimizing. NEXT Your toolbox is now complete enough to be dangerous — supervised and unsupervised, parametric and memory-based. Chapter 06 supplies the discipline that decides whether any of it survives contact with new data: bias against variance, the capacity U-curve and its modern double-descent twist, regularization, and the validation hygiene that separates a result from an artifact. § Further reading Lloyd, S. (1982). Least Squares Quantization in PCM. — the iterative assign-then-recenter algorithm now universally known as k-means (written 1957, published 1982). Arthur, D. & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. — the smart initialization that fixes Lloyd's sensitivity to starting centroids. Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. — the geometric origin of principal component analysis as best-fit subspaces. Hotelling, H. (1933). Analysis of a Complex of Statistical Variables into Principal Components. — the variance-maximization formulation of PCA used today. van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE. — the canonical nonlinear method for embedding high-dimensional data into 2-D maps. McInnes, L., Healy, J. & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. — a faster, structure-preserving alternative to t-SNE for embeddings. ← PREVIOUS 04 Trees, Forests & Neighbors NEXT CHAPTER 06 Generalization: Bias, Variance & Regularization AI // ENCYCLOPEDIA — VOL I · CH 05 FULL CONTENTS ↗ ## VOL I · 06 · Generalization: Bias, Variance & Regularization (https://ai-encyclopedia.com/ml/06-generalization.html) 06 · Generalization: Bias, Variance & Regularization — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 06 / GENERALIZATION INDEX NEXT: NEURAL NETWORKS → VOLUME I — FOUNDATIONS OF ML · CHAPTER 06 / 08 Generalization: Bias, Variance & Regularization Any model with enough knobs can score perfectly on data it has already seen. That is memorization, and it is worth nothing. The only error that counts is measured on data the model never touched, and this chapter covers the three-way budget that governs it: systematic error (bias), sensitivity to the sample (variance), and noise no model can remove. Ridge penalties and early stopping are two expressions of the same trade-off. LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON CH 01 · 02 INSTRUMENTS DEGREE DIAL · RIDGE PATH IN THIS CHAPTER 6.1 The bias–variance decomposition 6.2 Capacity & the U-curve 6.3 Regularization 6.4 Validation discipline 6.5 Early stopping & dropout § Further reading 6.1 The bias–variance decomposition Assume the world generates labels as \(y = f(x) + \varepsilon\): a true function \(f\) corrupted by noise with variance \(\sigma^2\). You never see \(f\) — you see one training set \(\mathcal{D}\), a finite sample of that process, and you fit \(\hat{f}_{\mathcal{D}}\) to it. Had the sample come out differently, your model would have too. The honest question is therefore an average over training sets: how wrong is the procedure, not just this one fit? For squared error the answer splits exactly into three parts: EQ M6.1 — BIAS–VARIANCE DECOMPOSITION $$ \mathbb{E}_{\mathcal{D},\,\varepsilon}\!\left[\big(y - \hat{f}_{\mathcal{D}}(x)\big)^{2}\right] \;=\; \underbrace{\big(f(x) - \bar{f}(x)\big)^{2}}_{\text{bias}^{2}} \;+\; \underbrace{\mathbb{E}_{\mathcal{D}}\!\left[\big(\hat{f}_{\mathcal{D}}(x) - \bar{f}(x)\big)^{2}\right]}_{\text{variance}} \;+\; \underbrace{\sigma^{2}}_{\text{noise}} \qquad \bar{f}(x) = \mathbb{E}_{\mathcal{D}}\big[\hat{f}_{\mathcal{D}}(x)\big] $$ \(\bar{f}\) is the average model — what your procedure produces averaged over all training sets it might have been dealt. Bias² is how far that average sits from the truth: the error your model family makes systematically, even with infinite resamples. Variance is how much any single fit scatters around that average: the error of trusting one particular sample. Noise \(\sigma^2\) is the floor — no model, however clever, beats it. Capacity buys down bias by paying in variance; the exchange rate is the subject of this chapter. WORKED EXAMPLE ▾ 01 At a probe point, the truth is \(f(x_0) = 2.0\); label noise has \(\sigma = 0.5\), so the floor is \(\sigma^2 = 0.25\). 02 Train the same procedure on three different samples; the fits predict 2.6, 2.9, 2.3. Average model: \(\bar{f}(x_0) = (2.6 + 2.9 + 2.3)/3 = 2.6\). 03 Bias² \(= (2.0 - 2.6)^2 = 0.36\). Variance \(= (0^2 + 0.3^2 + (-0.3)^2)/3 = 0.18/3 = 0.06\). 04 Expected error \(= 0.36 + 0.06 + 0.25 = 0.67\) — and the biggest line item is bias: this family needs more capacity, not more data. 05 The dials below run a stylized family with \(\text{bias}^2 = 1/d^2\) and \(\text{variance} = \sigma^2 d / n\): drag capacity \(d\) and watch the budget rebalance into a U. RESULT: ERROR = 0.36 + 0.06 + 0.25 = 0.67 CAPACITY d 3 SAMPLES n 20 NOISE σ 0.50 — At a probe point a procedure has \(\text{bias}^2 = 0.49\), variance \(= 0.04\), and label-noise variance \(\sigma^2 = 0.09\). By EQ M6.1, what is the expected squared error? The decomposition is additive: expected error \(= \text{bias}^2 + \text{variance} + \sigma^2 = 0.49 + 0.04 + 0.09 = \) 0.62. Bias² dominates, so this family is underfitting — add capacity, not data. The archery reading: bias is your sights being misaligned — every arrow lands off-center the same way. Variance is an unsteady hand — arrows scatter, even though they center on the bullseye. A degree-1 polynomial fit to a cubic has misaligned sights: resample the data all you like, the average line is still a line, still wrong. A degree-12 polynomial has a violently unsteady hand: each resample produces a different contortion, and only their unreachable average is close to the truth. The decomposition is exact for squared loss. For classification under 0–1 loss the clean additive split breaks down (bias and variance interact through the decision boundary), but the qualitative trade-off survives and the vocabulary is used everywhere regardless. Honest usage: treat EQ M6.1 as a precise statement about regression and a sharp metaphor for everything else. 6.2 Capacity and the U-curve Capacity is the informal name for how rich a function family you are fitting — polynomial degree, tree depth, parameter count, training time. As capacity rises, training error falls monotonically: a bigger family always contains the smaller one, so the optimizer can only do better on the points it sees. Held-out error does something entirely different — it falls while added capacity is buying down bias, bottoms out, then climbs as the model starts spending its freedom on the noise. That is the classical U-curve, and the diagnosis table that goes with it is the most-used decision procedure in applied ML: Observation Diagnosis The move Train error high, held-out error high and close to it underfit · bias-dominated More capacity, better features, train longer, weaken regularization Train error near zero, held-out error far above it overfit · variance-dominated More data, stronger regularization, less capacity, early stopping Both errors near the noise floor \(\sigma^2\) converged Stop. Further gains require better data, not a better model. The truth at a probe point is \(f(x_0) = 4.5\). The same procedure trained on three resampled datasets predicts 4, 5, and 6 there. What is the squared bias, \((f - \bar f)^2\)? First the average model: \(\bar f = (4 + 5 + 6)/3 = 5\). Then bias\(^2 = (f - \bar f)^2 = (4.5 - 5)^2 = (-0.5)^2 = \) 0.25. Bias measures how far the procedure's average prediction sits from the truth — independent of how much any single fit scatters. INSTRUMENT M6.1 — DEGREE DIAL 18 NOISY POINTS · TRUE f IS A CUBIC · NORMAL EQUATIONS, LIVE POLYNOMIAL DEGREE d 3 RESAMPLE THE WORLD NEW SAMPLE ↻ TRAIN vs HELD-OUT MSE ACROSS ALL DEGREES · LOG SCALE · SWEET SPOT MARKED TRAIN MSE · 18 PTS — HELD-OUT MSE · 160 PTS — GEN. GAP (HELD-OUT − TRAIN) — REGIME — The dashed ghost is the true cubic; the model never sees it. At d = 1, click NEW SAMPLE repeatedly: the line barely moves but is always wrong — pure bias. At d = 12, train MSE collapses while the curve thrashes wildly between resamples and held-out MSE explodes — pure variance. The lower chart is EQ M6.1 made empirical: train error only falls, held-out error is a U, and the sweet spot hugs the true degree 3. PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(0) def f(x): # the truth — unknown to the model return 1.5*x**3 - 0.9*x x_tr = rng.uniform(-1, 1, 18); y_tr = f(x_tr) + rng.normal(0, 0.18, 18) x_te = rng.uniform(-1, 1, 200); y_te = f(x_te) + rng.normal(0, 0.18, 200) def fit(x, y, d): # least squares on the Vandermonde matrix w, *_ = np.linalg.lstsq(np.vander(x, d + 1), y, rcond=None) return w print(f"{'deg':>4}{'train MSE':>12}{'test MSE':>11}") for d in (1, 3, 11): w = fit(x_tr, y_tr, d) tr = np.mean((np.vander(x_tr, d+1) @ w - y_tr)**2) te = np.mean((np.vander(x_te, d+1) @ w - y_te)**2) print(f"{d:>4}{tr:>12.4f}{te:>11.4f}") print(f"\nirreducible noise floor sigma^2 = {0.18**2:.4f}") RUN ▶ edits are live — try d = 17, or 50 training points MODERN The U-curve is true but incomplete. Push capacity far past the point where the model can interpolate its training data exactly, and held-out error often falls a second time — double descent (Belkin et al. 2019; Nakkiran et al. 2019, who also found it epoch-wise). In the heavily overparameterized regime, gradient descent among the many zero-train-error solutions implicitly prefers low-norm, smooth ones — the optimizer regularizes even when you don't ask it to. This is the regime modern LLMs live in, and part of why "bigger is better" holds there (Vol II · Ch 04 scaling laws). Honest status: the classical U still governs the small-data regime — this page's instruments, most tabular work, most fine-tunes — and a complete theory unifying both regimes remains open. 6.3 Regularization: paying for smoothness Choosing capacity by deleting parameters (degree 3, not 9) is a blunt dial. Regularization keeps the big model and instead charges it for complexity: add a penalty on the size of the weights to the training loss, and let a continuous knob \(\lambda\) set the price. The two canonical currencies differ only in which norm they tax — and that one choice changes everything about the solution's character. EQ M6.2 — RIDGE (L2) $$ \hat{w}_{\text{ridge}} \;=\; \arg\min_{w}\; \lVert y - Xw \rVert_2^2 \;+\; \lambda \lVert w \rVert_2^2 \;=\; \big(X^{\top}X + \lambda I\big)^{-1} X^{\top} y $$ Still closed-form — the penalty just fattens the diagonal of \(X^\top X\), which is also why it cures the numerical singularity of high-degree fits. In the SVD picture, the component of the solution along a singular direction with singular value \(\sigma_i\) gets multiplied by \(\sigma_i^2 / (\sigma_i^2 + \lambda)\): strong, well-supported directions pass almost untouched while weak, noise-amplifying directions are crushed. Ridge shrinks every weight toward zero but never exactly to zero. WORKED EXAMPLE ▾ 01 One feature, so everything is scalar: \(X^\top X = \sum x_i^2 = 10\) and \(X^\top y = \sum x_i y_i = 8\). OLS: \(\hat{w} = 8/10 = 0.8\). 02 Ridge just fattens the denominator: \(\hat{w} = 8/(10 + \lambda)\). At \(\lambda = 1\): \(8/11 = 0.727\). 03 At \(\lambda = 10\): \(8/20 = 0.40\) — exactly half the OLS weight, because the penalty now equals the evidence \(\sum x_i^2 = 10\). 04 At \(\lambda = 90\): \(8/100 = 0.08\) — crushed but alive. No finite \(\lambda\) reaches zero: shrinkage, never selection. Drag \(\lambda\) and watch. RESULT: ŵ = 0.80 → 0.73 → 0.40 → 0.08 AS λ = 0, 1, 10, 90 PENALTY λ (LOG) 1.0 ŵ = 0.727 · ×0.909 of OLS One-feature ridge regression with \(X^\top X = 6\) and \(X^\top y = 12\). At penalty \(\lambda = 2\), the closed form is \(\hat w = X^\top y / (X^\top X + \lambda)\). Compute \(\hat w\). Ridge fattens the denominator: \(\hat w = 12 / (6 + 2) = 12/8 = \) 1.5. The unpenalized OLS weight would be \(12/6 = 2.0\), so the penalty shrinks it by a factor \(6/8 = 0.75\) — toward zero, but never reaching it. EQ M6.3 — LASSO (L1) $$ \hat{w}_{\text{lasso}} \;=\; \arg\min_{w}\; \lVert y - Xw \rVert_2^2 \;+\; \lambda \lVert w \rVert_1 \qquad \lVert w \rVert_1 = \textstyle\sum_{j} \lvert w_j \rvert $$ No closed form — the kink of \(\lvert \cdot \rvert\) at zero breaks the calculus, so lasso is solved by coordinate descent or proximal methods, whose core operation is the soft threshold \(S(z, \lambda) = \mathrm{sign}(z)\max(\lvert z \rvert - \lambda,\, 0)\). That \(\max\) is the point: weights whose evidence is weaker than \(\lambda\) are set to exactly zero. Lasso doesn't just shrink — it selects features, and the surviving support is often the deliverable. WORKED EXAMPLE ▾ 01 Apply the soft threshold \(S(z, \lambda) = \mathrm{sign}(z)\max(\lvert z \rvert - \lambda, 0)\) with \(\lambda = 0.3\) to three candidate weights \(z = (0.9,\, -0.4,\, 0.05)\). 02 \(S(0.9) = 0.9 - 0.3 = 0.6\). \(S(-0.4) = -(0.4 - 0.3) = -0.1\). Both survive, each pulled 0.3 toward zero. 03 \(S(0.05)\): the evidence \(0.05\) is weaker than the price \(0.3\), so \(\max(0.05 - 0.3,\, 0) = 0\) — exactly zero, not small. 04 Ridge at comparable strength multiplies all three by \(1/(1 + 0.3) \approx 0.77\) — everything survives, smaller. Lasso instead deleted the weak feature outright: a 2-feature model, not three shrunken ones. RESULT: w = (0.6, −0.1, 0) — FEATURE 3 SELECTED OUT Apply the lasso soft threshold \(S(z,\lambda) = \mathrm{sign}(z)\,\max(|z| - \lambda,\, 0)\) to the candidate weight \(z = 0.7\) with penalty \(\lambda = 0.3\). What is \(S(z,\lambda)\)? The magnitude \(|z| = 0.7\) exceeds the price \(\lambda = 0.3\), so \(\max(0.7 - 0.3,\, 0) = 0.4\); the sign is positive, giving \(S = \) 0.4. The weight survives but is pulled 0.3 toward zero. Had \(|z|\) been below 0.3, the result would have been exactly 0 — that is feature selection. FIG M6.A WHY L1 ZEROS WEIGHTS AND L2 ONLY SHRINKS THEM w₁ w₂ ŵ ridge: both coords shrunk · neither zero ŵ LS (unconstrained) w₁ w₂ ŵ lasso: lands on a corner → w₁ = 0 exactly Penalized fitting ≡ minimizing loss subject to a weight-norm budget. Blue ellipses are loss contours around the unconstrained minimum; the fit lands where the smallest reachable contour first touches the constraint set. The L2 ball is round, so first contact is almost never on an axis; the L1 diamond has corners on the axes, and corners win — that geometry is the entire reason lasso produces sparse models. Weight decay is L2 — with one large caveat. For plain SGD, adding \(\lambda \lVert w \rVert_2^2\) to the loss and multiplying weights by \((1 - \eta\lambda)\) each step are the same update. For adaptive optimizers they are not: Adam rescales the penalty's gradient per-coordinate along with everything else, quietly distorting the regularizer. AdamW fixes this by decoupling the decay from the adaptive machinery (Vol II · EQ 4.3) — which is why every modern LLM recipe says "AdamW, weight decay 0.1" rather than "L2 in the loss". Same idea, different plumbing, measurably different result. INSTRUMENT M6.2 — RIDGE PATH DEGREE-9 FIT · λ FROM 1e-4 TO 1e2 · EQ M6.2 LIVE PENALTY λ (LOG SLIDER) 1.0e-4 COEFFICIENT MAGNITUDES |w₀| … |w₉| · LOG BAR SCALE · GREY = UNPENALIZED INTERCEPT TRAIN MSE — HELD-OUT MSE — ‖w‖₂ (EXCL. w₀) — Same data-generating world as Instrument M6.1 (a different draw of 18 points), but the model keeps all ten degree-9 coefficients and pays λ for their size. Drag right from 1e-4: the wiggle flattens, the coefficient bars collapse, and held-out MSE traces a U — too little λ re-creates overfitting, too much re-creates underfitting (the fit sags toward a flat line). λ is a capacity dial with infinite resolution. The intercept is conventionally left unpenalized; shrinking it would just bias predictions away from the data's mean. PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(1) def f(x): return 1.5*x**3 - 0.9*x x_tr = rng.uniform(-1, 1, 18); y_tr = f(x_tr) + rng.normal(0, 0.18, 18) x_te = rng.uniform(-1, 1, 300); y_te = f(x_te) + rng.normal(0, 0.18, 300) d = 9 Xtr, Xte = np.vander(x_tr, d + 1), np.vander(x_te, d + 1) I = np.eye(d + 1); I[-1, -1] = 0.0 # vander puts the intercept last — leave it unpenalized lams, mses = np.logspace(-6, 2, 41), [] for lam in lams: w = np.linalg.solve(Xtr.T @ Xtr + lam * I, Xtr.T @ y_tr) # EQ M6.2 mses.append(float(np.mean((Xte @ w - y_te)**2))) b = int(np.argmin(mses)) print(f"near-zero lam = 1e-6: test MSE = {mses[0]:.4f}") print(f"best lam = {lams[b]:<8.3g}: test MSE = {mses[b]:.4f}") print(f"crushing lam = 1e2: test MSE = {mses[-1]:.4f}") plot_xy(np.log10(lams), np.array(mses)) # the regularization U-curve RUN ▶ x-axis is log10 λ — the U should bottom out mid-range 6.4 Validation discipline Every dial in this chapter — degree, \(\lambda\), stopping epoch — must be tuned against data the fit never saw, which forces the three-way split: train (fit parameters), validation (choose hyperparameters), test (touch once, report, stop). When data is scarce, k-fold cross-validation recycles it: split into \(k\) folds (5 or 10 is standard), train \(k\) times each holding out a different fold, and average the held-out scores. The average is a far lower-variance estimate of generalization than any single split — at \(k\times\) the compute. Once hyperparameters are chosen, refit on everything. If you also want an honest estimate of the whole selection pipeline, nest a second CV loop around it; people skip this and quietly report optimistic numbers. The dominant failure mode is not bad math — it is leakage: information from the evaluation side contaminating the training side. Leakage produces beautiful validation scores and production disasters, and it is almost always a pipeline bug, not a modeling bug: Leak Horror story The fix Preprocessing leak Scaler / imputer / feature-selector fit on the full dataset before splitting — test-set statistics seep into training fit every transform inside the training fold only Duplicate leak Near-identical rows land on both sides of the split; the model "generalizes" to data it memorized dedup before splitting; fuzzy-match, not exact-match Temporal leak Random split of time-ordered data — the model trains on the future it will be asked to predict split by time; validate strictly forward Group leak Same patient's scans in train and test; the model learns the patient, scores brilliantly, transfers to nobody split by group id, never by row Target leak A feature is a downstream echo of the label ("account_closed_date" predicting churn) audit features for post-outcome information HYGIENE The test set is an instrument you can use once. Every decision influenced by test numbers — "try one more λ", "rerun with the other seed" — silently moves test data into the training loop; iterate enough and the test score becomes fiction. Kaggle's public-vs-private leaderboard shakeups are this effect measured at scale. The same failure operates on LLMs as eval contamination — benchmarks leaking into web-scale training corpora (Vol II · Ch 04 decontamination, and the fine-tuning pitfall list in Vol II · Ch 06). Different scale, identical sin: testing on something the model has, in any form, already seen. 6.5 Early stopping & dropout as regularizers Two of the most-used regularizers never touch the loss function. Early stopping exploits the fact that training time is itself a capacity dial: gradient descent fits broad, smooth structure first and noise last, so the validation curve traces the familiar U over epochs. The recipe is mechanical — evaluate on validation each epoch, checkpoint the best, stop after \(p\) epochs without improvement (patience), restore the best checkpoint. It is not merely a heuristic: for linear least squares, gradient descent stopped at step \(t\) is approximately ridge regression with \(\lambda \propto 1/(\eta t)\) — each direction of the solution gets pulled in at a rate set by its singular value, so stopping early leaves the weak, noise-dominated directions still near zero. Stopping early and penalizing weights are the same medicine through different needles. Dropout attacks variance from a different angle: during training, zero each hidden unit independently with probability \(p\) (and scale survivors by \(1/(1-p)\) — "inverted dropout" — so activation magnitudes match at inference, when nothing is dropped). Two readings coexist. The ensemble view: each step trains a different random subnetwork, and inference approximates averaging exponentially many of them — and averaging is variance reduction by construction. The co-adaptation view: no unit can rely on a specific partner that might vanish, so features are forced to be individually useful. For linear models, dropout works out to an L2-like penalty scaled by each feature's second moment (Wager et al. 2013) — once again, a familiar uniform. Honest modern footnote: dropout has largely vanished from LLM pre-training — one epoch over trillions of tokens means the binding constraint is underfitting, not overfitting — while weight decay and early stopping never left. But shrink the data and the classics return instantly: small-data fine-tunes ship with dropout on the adapters (the LoRA default of 0.05 in Vol II · Ch 06's recipe) and validation-based stopping. Regularization never became obsolete; it just follows the data-to-parameter ratio around. The full toolbox, ordered by how often it is the right answer: more data (the only regularizer with no downside), weight decay / L2, early stopping, dropout, data augmentation, smaller model, L1 when you need the zeros. All of them buy the same thing — lower variance — and all charge the same currency: a little added bias. NEXT You now own the budget every model must balance. Chapter 07 builds the first machine with enough capacity to need all of it: the multi-layer perceptron — perceptrons, hidden layers, activation functions, and a tiny network you can train on XOR in the page while you watch the decision boundary bend. § Further reading Geman, S., Bienenstock, E. & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. — the paper that introduced the bias–variance decomposition to the learning community. Tikhonov, A. N. (1963). Solution of Incorrectly Formulated Problems and the Regularization Method. — the origin of L2 (ridge) regularization as a cure for ill-posed problems. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. — introduces the L1 penalty and the sparsity it induces. Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. — the canonical reference for dropout as stochastic regularization. Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. — the formal foundation of cross-validation and held-out model selection. Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019). Reconciling Modern Machine-Learning Practice and the Bias–Variance Trade-off. — the "double descent" result that complicates the classic U-curve. ← PREVIOUS 05 Clustering & Dimensionality NEXT CHAPTER 07 Neural Networks: The MLP AI // ENCYCLOPEDIA — VOL I · CH 06 FULL CONTENTS ↗ ## VOL I · 07 · Neural Networks: The MLP (https://ai-encyclopedia.com/ml/07-neural-networks.html) 07 · Neural Networks: The MLP — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 07 / NEURAL NETWORKS: THE MLP INDEX NEXT: BACKPROPAGATION → VOLUME I — FOUNDATIONS OF ML · CHAPTER 07 / 08 Neural Networks: The MLP A linear model can draw exactly one flat boundary, and four points are enough to defeat it. The fix is small. Stack two linear maps with a nonlinear bend between them, and the model starts inventing its own features. This chapter builds the multi-layer perceptron, the unit cell of every network in this encyclopedia, and trains one live, in this page, on the problem the perceptron provably cannot solve. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON CH 02–03 · CH 06 INSTRUMENTS XOR PLAYGROUND · ACTIVATION GALLERY IN THIS CHAPTER 7.1 The perceptron's limit 7.2 The MLP 7.3 Activation functions 7.4 Universal approximation 7.5 Width vs depth 7.6 Shapes discipline § Further reading 7.1 The perceptron and its limit Rosenblatt's 1958 perceptron is a thresholded weighted sum: \( \hat{y} = \mathbf{1}[\, w^\top x + b > 0 \,] \). Geometrically it is a single hyperplane — everything on one side is class 1, everything on the other is class 0. The logistic regression of Chapter 03 softens the threshold into a sigmoid, but the geometry is identical: one flat cut through input space. For sixty years of statistics that was usually enough, because humans hand-engineered features until the classes became linearly separable. Then consider the smallest dataset in this encyclopedia — exclusive-or: x₁ x₂ x₁ XOR x₂ corner 0 0 0 bottom-left 0 1 1 top-left 1 0 1 bottom-right 1 1 0 top-right The 1s sit on one diagonal, the 0s on the other. No line separates them, and the proof takes four inequalities. Suppose weights \(w_1, w_2, b\) existed. The four points demand: \((0,0)\to 0\): \( b \le 0 \) \((1,0)\to 1\): \( w_1 + b > 0 \) \((0,1)\to 1\): \( w_2 + b > 0 \) \((1,1)\to 0\): \( w_1 + w_2 + b \le 0 \) Add the middle two: \( w_1 + w_2 + 2b > 0 \), so \( w_1 + w_2 + b > -b \ge 0 \) — directly contradicting the fourth. No linear model, however trained, can represent XOR. Minsky and Papert published this in 1969, funding for neural networks evaporated, and the result stood as the field's cautionary tale until people accepted the obvious-but-then-untrainable fix: more layers. Note the honest framing — the limitation was about single-layer machines; multi-layer ones were known to be more powerful but nobody could train them until backpropagation spread in 1986 (Chapter 08). 7.2 The MLP: linear, bend, linear The multi-layer perceptron inserts a hidden layer of \(d_h\) units between input and output, with an elementwise nonlinearity \(\varphi\) — the activation function — applied in between: EQ M7.1 — THE TWO-LAYER MLP $$ h = \varphi\!\big( W_1 x + b_1 \big), \qquad \hat{y} = W_2\, h + b_2, \qquad W_1 \in \mathbb{R}^{d_h \times d_{\text{in}}},\; W_2 \in \mathbb{R}^{d_{\text{out}} \times d_h} $$ Each row of \(W_1\) defines one hyperplane; hidden unit \(h_j = \varphi(w_j^\top x + b_j)\) is a squashed signed distance to it — a soft linear feature detector. The output layer is then an ordinary linear model in the feature space the network chose for itself. The bend is load-bearing: without \(\varphi\), the stack collapses — \(W_2(W_1 x + b_1) + b_2\) is just another linear model, no matter how many layers you pile up. We write a network's sizes as d in -d h -d out: the instrument below trains a 2-8-1. WORKED EXAMPLE ▾ 01 A hand-set 2-2-1 ReLU net (the working core of §7.2's Python cell): both rows of \(W_1\) are \((1, 1)\), \(b_1 = (0, -1)\), \(W_2 = (1, -2)\), \(b_2 = 0\). So \(h_1 = \mathrm{ReLU}(x_1 + x_2)\), \(h_2 = \mathrm{ReLU}(x_1 + x_2 - 1)\), \(\hat{y} = h_1 - 2h_2\). 02 \(x = (1, 0)\): \(h_1 = \mathrm{ReLU}(1) = 1\), \(h_2 = \mathrm{ReLU}(0) = 0\), \(\hat{y} = 1 - 0 = 1\). ✓ 03 \(x = (1, 1)\): \(h_1 = \mathrm{ReLU}(2) = 2\), \(h_2 = \mathrm{ReLU}(1) = 1\), \(\hat{y} = 2 - 2 = 0\). ✓ The hinge in \(h_2\) — silent until \(x_1 + x_2\) exceeds 1 — is what notices "both on". 04 Delete \(\varphi\) and the stack collapses: \(W_2(W_1 x + b_1) + b_2 = 2 - x_1 - x_2\), a plane — and §7.1 proved no plane does XOR. Feed the net yourself below. RESULT: ŷ = 0, 1, 1, 0 ON THE FOUR CORNERS — XOR EXACT INPUT x₁ 1.00 INPUT x₂ 0.00 h = (1.00, 0.00) → ŷ = 1.00 One hidden unit with ReLU activation: weights \(w = (1, 2)\), bias \(b = -1\), input \(x = (2, 1)\). Its output is \(h = \mathrm{ReLU}(w^\top x + b)\). What is \(h\)? Pre-activation: \(w^\top x + b = 1\cdot 2 + 2\cdot 1 + (-1) = 2 + 2 - 1 = 3\). Since \(3 > 0\), \(\mathrm{ReLU}(3) = \max(0, 3) = \) 3. The unit is on its active half, so its gradient passes through undamped (\(\varphi' = 1\)). FIG 7.1 A 2-4-1 MLP — TWO MATRICES AND ONE BEND x₁ x₂ h₁ h₂ h₃ h₄ ŷ W₁ (4×2) + b₁ W₂ (1×4) + b₂ hⱼ = φ(wⱼ·x + bⱼ) INPUT ℝ² HIDDEN — 4 LEARNED FEATURES OUTPUT Every edge is one learned number. The first layer's rows are four hyperplanes; φ turns signed distances into soft features; the second layer is a plain linear model over those features. The network's only new trick is choosing its features itself. What does training do with this freedom? Watch it happen. The instrument below is a complete neural network — forward pass, backpropagation, gradient updates — implemented in this page with no library. It trains a 2-H-1 MLP (tanh hidden units, sigmoid output, cross-entropy loss, full-batch gradient descent) on seeded datasets a linear model cannot touch. INSTRUMENT M7.1 — XOR PLAYGROUND REAL 2-H-1 MLP · FULL BACKPROP IN-PAGE · SEEDED DATASET XOR BLOBS TWO CIRCLES CONTROL TRAIN ▶ RESET LEARNING RATE η (LOG) 0.50 HIDDEN UNITS H 8 EPOCH 0 LOSS (BCE) — TRAIN ACCURACY — PARAMETERS (4H+1) 33 Press TRAIN and watch a straight prejudice bend into the right shape — with H = 8 and η = 0.5, XOR falls inside a few hundred epochs. Switch to TWO CIRCLES and the net closes a loop around the inner cluster. Now drop H to 2 on the circles: enclosing a region needs at least three half-planes, so it provably cannot — and H = 2 on XOR can represent the answer yet gradient descent often fails to find it (representable ≠ learnable, the theme of §7.4). Then push η toward 100: the loss curve goes ragged as each step overshoots the valley it is aiming for, and the boundary thrashes — full-batch descent on this bounded loss rarely explodes outright, but it stops descending. The loss curve tells you before the picture does. Before training finds weights, it helps to see that good weights exist. The cell below hard-codes a 2-8-1 ReLU network in which only two of the eight hidden units do any work — and XOR falls exactly. This is the existence proof; the instrument above is the search; Chapter 08 is the algebra of the search. PYTHON · RUNNABLE IN-BROWSER import numpy as np def relu(z): return np.maximum(0, z) # 2-8-1 MLP, weights set by hand: 2 of the 8 hidden units solve XOR, 6 idle W1 = np.zeros((8, 2)); b1 = np.zeros(8) W1[0] = [1, 1]; b1[0] = 0.0 # h0 = ReLU(x1 + x2) W1[1] = [1, 1]; b1[1] = -1.0 # h1 = ReLU(x1 + x2 - 1) W2 = np.zeros((1, 8)); W2[0, 0] = 1.0; W2[0, 1] = -2.0 b2 = np.zeros(1) # y = h0 - 2*h1 X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float) H = relu(X @ W1.T + b1) # (4, 8) Y = H @ W2.T + b2 # (4, 1) for x, y in zip(X, Y): print(f"x = {x.astype(int)} -> yhat = {y[0]:.1f} XOR = {int(x[0]) ^ int(x[1])}") print("\nhidden layer H (4 inputs x 8 units):") print(H) RUN ▶ edits are live — break it on purpose 7.3 Activation functions The choice of \(\varphi\) looks cosmetic and decided a decade of history. An activation must be nonlinear (or the stack collapses), nearly free to compute (it runs once per unit per example), and — the part nobody appreciated until networks got deep — it must pass gradients. The learning signal reaching layer 1 is a product of one factor per layer crossed: EQ M7.2 — WHY GRADIENTS VANISH $$ \frac{\partial \mathcal{L}}{\partial h^{(1)}} \;=\; \Bigg( \prod_{\ell=2}^{L} W_{\ell}^{\top}\, \mathrm{diag}\!\big( \varphi'(z^{(\ell)}) \big) \Bigg) \frac{\partial \mathcal{L}}{\partial h^{(L)}} $$ Every layer multiplies the backward signal by its weight matrix and by \(\varphi'\) evaluated where each unit currently sits. Sigmoid's derivative peaks at \(1/4\) — so through \(L\) sigmoid layers the signal shrinks like \((1/4)^L\) at best, and far faster once units saturate. Ten layers: \(\sim 10^{-6}\) of the gradient survives. ReLU's derivative is exactly 1 on the entire active half — gradients pass through unshrunk, which is most of why deep networks became trainable in 2012. WORKED EXAMPLE ▾ 01 A healthy-looking sigmoid unit sitting at \(z = 2.5\): \(a = \sigma(2.5) \approx 0.924\), so its slope is \(\varphi' = a(1-a) \approx 0.924 \times 0.076 \approx 0.070\). 02 Ten layers multiply ten such factors: \(0.070^{10} \approx 3 \times 10^{-12}\) — a trillionth of the loss signal reaches layer 1. 03 Even sigmoid's best case, \(\varphi' = 0.25\) at \(z = 0\): \(0.25^{10} = 1/4^{10} = 1/1{,}048{,}576 \approx 9.5 \times 10^{-7}\). Best case loses 99.9999%. 04 ReLU on its active half: \(\varphi' = 1\) exactly, so \(1^{10} = 1\) — the product that kills sigmoid stacks is neutral for ReLU. Drag the slope and depth below. RESULT: 10 SIGMOID LAYERS ≤ 9.5e−7 OF SIGNAL · RELU = 1 PER-LAYER SLOPE φ′ 0.25 DEPTH L 10 — A sigmoid unit outputs \(a = \sigma(z) = 0.9\). The sigmoid derivative is \(\varphi'(z) = a(1 - a)\). What is this unit's local slope \(\varphi'\)? \(\varphi' = a(1 - a) = 0.9 \times (1 - 0.9) = 0.9 \times 0.1 = \) 0.09. Far below sigmoid's peak of 0.25 — this near-saturated unit barely passes gradient, and stacking such factors is what makes deep sigmoid networks untrainable. Activation φ(z) Range max φ′ Verdict Sigmoid σ 1 / (1 + e⁻ᶻ) (0, 1) 0.25 Saturates on both sides; killed deep stacks. Survives as the output for probabilities and inside gates. Tanh 2σ(2z) − 1 (−1, 1) 1.00 Zero-centered sigmoid; the RNN-era default; still saturates at both ends. ReLU max(0, z) [0, ∞) 1.00 Derivative is 1 everywhere active; cheap; sparse. Risk: dead units stuck at φ′ = 0 forever. GELU z · Φ(z) ≈ (−0.17, ∞) ≈ 1.08 Smooth ReLU weighted by the Gaussian CDF; default of the GPT-2/BERT era of transformers. INSTRUMENT M7.2 — ACTIVATION GALLERY f AND f′ ON [−5, 5] · SATURATION SHADED ACTIVATION SIGMOID TANH RELU GELU MAX f′ ON [−5, 5] — SATURATED SHARE (f′ < 0.05) — f′(2.5)¹⁰ — 10-LAYER SIGNAL — The red bands are where the derivative is effectively zero — a unit parked there learns nothing. Click through the four and watch the last readout: it is the gradient surviving ten stacked layers for a unit sitting at z = 2.5. Sigmoid: ~10⁻¹². Tanh: ~10⁻¹⁶. ReLU: exactly 1. That single number is the argument that ended the sigmoid era. Note ReLU's own red zone — the entire negative half — which is the dead-unit risk in the table above. Where the lineage goes next. Modern LLMs use a gated refinement: SwiGLU (Vol II · EQ 2.3) multiplies a SiLU-squashed gate elementwise against a linear up-projection, letting the MLP modulate its own features. It is the direct descendant of the choices in this table — same constraint set, one more multiplicative trick. 7.4 Universal approximation — and its fine print The classical justification for all of this is the universal approximation theorem, in words: a feed-forward network with a single hidden layer and any non-polynomial activation can approximate any continuous function on a bounded region to any accuracy you name — provided you may make the hidden layer wide enough (Cybenko 1989 for sigmoids; Hornik 1991 in general). One layer of bends, in principle, suffices for everything. FINE PRINT Existence is not learnability. The theorem is non-constructive on every axis that matters: (1) it does not say how many units — worst-case width grows exponentially in the input dimension, the curse of dimensionality again; (2) it does not say the weights are findable — gradient descent on a non-convex loss carries no guarantee of reaching them (you watched H = 2 fail on a representable problem in Instrument M7.1); (3) it says nothing about generalization from finite data. In Chapter 06's language, it bounds approximation error only — estimation and optimization error are untouched. Read it, then, as a license rather than an explanation: MLPs are a hypothesis class with no permanent blind spots, unlike the perceptron. Why gradient-trained deep networks work as well as they do on real data remains a partially open research question in 2026 — be suspicious of anyone who cites this theorem as the answer. 7.5 Width, depth, and why depth wins If one wide layer suffices in principle, why is every serious network deep? Because depth buys composition. A second hidden layer computes features of features: layer 1 finds edges in pixels, layer 2 assembles edges into textures and parts, layer 3 into objects. In text models the same hierarchy runs characters → morphemes → syntax → semantics. A wide-shallow network must build every high-level feature from raw inputs in one step; a deep one builds a vocabulary at each level and reuses it everywhere above — the same economy that makes subroutines beat straight-line code. Wider (same depth) Deeper (same width) What you get More parallel features at one level of abstraction A hierarchy — features of features Expressive power grows polynomially some functions need exponentially fewer units Trainability Benign; gradients stay healthy EQ M7.2's product bites — needs ReLU-family φ, careful init, later residuals & normalization (Vol II · CH 02) The middle row has real theorems behind it: deep ReLU networks fold input space repeatedly, producing a number of linear regions that grows exponentially with depth but only polynomially with width (Montúfar 2014), and there exist functions computable by a deep network that no shallow network of sub-exponential width can match (Telgarsky 2016). Honesty requires the caveat: those are worst-case constructions, not descriptions of your dataset. The practical reasons depth wins are that real data is compositional, and that a decade of engineering — initialization, normalization, residual connections — removed depth's optimization penalty. On flat tabular data, Chapter 04's gradient-boosted trees still routinely beat both wide and deep. 7.6 The forward pass is matrix multiplication Everything above was written for one input vector. Real computation is batched: stack \(B\) examples as rows of \(X\), and the entire forward pass becomes two matrix multiplications — which is the only operation GPUs are truly built for, and the reason the whole field runs on them: EQ M7.3 — THE BATCHED FORWARD PASS $$ \underset{B \times d_h}{H} \;=\; \varphi\!\Big( \underset{B \times d_{\text{in}}}{X} \;\; \underset{d_{\text{in}} \times d_h}{W_1^{\top}} \;+\; \underset{1 \times d_h}{b_1} \Big), \qquad \underset{B \times d_{\text{out}}}{\hat{Y}} \;=\; H\, W_2^{\top} + b_2 $$ The single rule of shape discipline: inner dimensions must agree, and the batch dimension rides along untouched in the leftmost slot. The bias \(b_1\) is a single row, broadcast down all \(B\) rows. Frameworks store weights as (out, in) — hence the transposes. In Volume II the same ledger gains one axis, (B, T, d model), and otherwise nothing changes. WORKED EXAMPLE ▾ 01 Batch \(B = 32\) through the 2-8-1: \(X\) is \((32 \times 2)\), \(W_1^\top\) is \((2 \times 8)\) — inner dims \(2 = 2\) agree, so \(Z_1 = X W_1^\top\) is \((32 \times 8)\). 02 \(b_1\) is one row, \((1 \times 8)\), broadcast down all 32 rows; \(\varphi\) is elementwise, so \(H\) stays \((32 \times 8)\). 03 \(H\, (32 \times 8)\) times \(W_2^\top\, (8 \times 1)\) — inner dims \(8 = 8\) — gives \(\hat{Y}\) at \((32 \times 1)\). The batch axis rode through untouched, leftmost the whole way. 04 The bill: \(32 \times 8\) outputs at 2 multiply-adds each \(= 512\), plus \(32 \times 1\) at 8 each \(= 256\) — 768 MACs per batch. Parameters: \(16 + 8 + 8 + 1 = 33\). RESULT: (32×2) → (32×8) → (32×1) · 33 PARAMS · 768 MACs A two-layer MLP with sizes 3-5-2 (\(d_{\text{in}}=3\), \(d_h=5\), \(d_{\text{out}}=2\)). Counting every weight and bias, how many trainable parameters does it have? \(W_1\) is \(5 \times 3 = 15\), \(b_1\) is 5, \(W_2\) is \(2 \times 5 = 10\), \(b_2\) is 2. Total \(= 15 + 5 + 10 + 2 = \) 32. Every edge in the network diagram is one of these numbers. Professionals debug networks by reciting shapes, not by reading values. Build the habit now — run the drill, then change d_h or batch and predict every line before re-running. If you can write the shape ledger of a network from memory, you understand its forward pass; there is nothing else in it. PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(7) B, d_in, d_h, d_out = 32, 2, 8, 1 # batch 32 through a 2-8-1 X = rng.normal(size=(B, d_in)) W1 = rng.normal(size=(d_h, d_in)) * 0.5 # (out, in) convention b1 = rng.normal(size=d_h) * 0.1 W2 = rng.normal(size=(d_out, d_h)) * 0.5 b2 = np.zeros(d_out) Z1 = X @ W1.T + b1 # inner dims: (B,d_in)(d_in,d_h) H = np.maximum(0, Z1) # ReLU, elementwise: shape unchanged Y = H @ W2.T + b2 ledger = [("X", X), ("W1", W1), ("Z1 = X @ W1.T + b1", Z1), ("H = ReLU(Z1)", H), ("W2", W2), ("Y = H @ W2.T + b2", Y)] for name, A in ledger: print(f"{name:24s} {A.shape}") print(f"\nReLU zeroed {(H == 0).mean():.0%} of H — sparsity is the default") RUN ▶ predict each shape, then run NEXT The forward pass is two matmuls; learning is the question of how blame flows backward through them. Chapter 08: the chain rule organized on a computational graph — backpropagation — plus momentum and Adam, the optimizers that turn raw gradients into progress. Instrument M7.1 already ran every line of it; next chapter you read its algebra. § Further reading Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. — the founding single-layer model whose limits motivate the MLP. Minsky, M. & Papert, S. (1969). Perceptrons. — the formal proof that a single perceptron cannot solve XOR, the limit Section 7.1 turns on. Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. — the universal approximation theorem: one hidden layer can approximate any continuous function. Hornik, K., Stinchcombe, M. & White, H. (1989). Multilayer Feedforward Networks are Universal Approximators. — generalizes universality beyond sigmoids to broad activation classes. Glorot, X. & Bengio, Y. (2010). Understanding the Difficulty of Training Deep Feedforward Networks. — the Xavier initialization and activation analysis that make deep MLPs trainable. Nair, V. & Hinton, G. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. — the case for ReLU, now the default hidden activation. ← PREVIOUS 06 Generalization: Bias, Variance & Regularization NEXT CHAPTER 08 Backpropagation & Optimization AI // ENCYCLOPEDIA — VOL I · CH 07 FULL CONTENTS ↗ ## VOL I · 08 · Backpropagation & Optimization (https://ai-encyclopedia.com/ml/08-backpropagation.html) 08 · Backpropagation & Optimization — AI Encyclopedia AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 08 / BACKPROPAGATION INDEX NEXT: VOL II · FOUNDATIONS → VOLUME I — FOUNDATIONS OF ML · CHAPTER 08 / 08 Backpropagation & Optimization Chapter 07 left a network with thirty-three knobs and one number telling it how wrong it is. This chapter is the algorithm that turns that one number into thirty-three precise instructions, or a trillion. Backpropagation is the chain rule, organized on a graph so that one backward sweep prices every parameter's share of the blame. The optimizers follow: SGD, momentum, and Adam, the machinery that turns raw gradients into progress. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON CH 02 · 03 · 07 INSTRUMENTS GRAPH STEPPER · OPTIMIZER RACE IN THIS CHAPTER 8.1 Credit assignment 8.2 Chain rule on a graph 8.3 Backprop, worked 8.4 Autodiff 8.5 SGD & minibatches 8.6 Momentum & Adam 8.7 Vanishing & exploding § Further reading 8.1 The credit-assignment problem A network maps inputs through millions of weights to a single scalar loss. When that loss is bad, which weights are at fault, and by how much? That is credit assignment, and it is the whole problem of learning. The output layer's culpability is easy to see — it touched the answer directly. But a weight three layers deep influenced the loss only through everything stacked above it; its blame arrives diluted, rerouted, and mixed with everyone else's. The brute-force answer exists and is worth respecting: nudge one weight by \(\varepsilon\), re-run the network, and watch the loss move — \(\partial L / \partial \theta_i \approx (L(\theta_i + \varepsilon) - L(\theta_i - \varepsilon)) / 2\varepsilon\). It is exactly correct in the limit and catastrophically expensive: two full forward passes per parameter, per step. For GPT-class models that is trillions of forward passes to compute what backpropagation delivers in roughly the cost of one. The brute-force method survives in one honorable role — as the referee that checks backprop implementations (§8.4) — and nowhere else. Backpropagation was applied to neural networks and popularized by Rumelhart, Hinton and Williams in 1986 (the underlying reverse-mode differentiation is older — Linnainmaa, 1970). It ended the seventeen-year winter that Chapter 07's perceptron proof began. The insight is not deep math; it is deep bookkeeping: the chain rule, applied once per node of a graph, in the right order, sharing every intermediate result. 8.2 The chain rule on a computational graph Stop thinking of a network as a formula and start thinking of it as a computational graph: a directed graph in which every node is a primitive operation (multiply, add, \(\sigma\), square) and every edge carries a value forward. The crucial move is to label each edge with its local derivative — not a symbolic expression but a number, evaluated at the values that just flowed through. The edge from \(z\) into \(a = \sigma(z)\) is labeled \(\partial a / \partial z = a(1-a)\): one number, known the moment the forward pass computes \(a\). EQ M8.1 — THE CHAIN RULE, GRAPH FORM $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}\,\frac{\partial y}{\partial x} \qquad\Longrightarrow\qquad \frac{\partial L}{\partial v} \;=\; \sum_{c \,\in\, \mathrm{children}(v)} \frac{\partial L}{\partial c}\,\frac{\partial c}{\partial v} $$ Left: the one-step rule — blame flowing into \(x\) is blame at \(y\) times the local edge derivative. Right: the same rule on a graph — a node's gradient is the sum over its outgoing edges of (downstream gradient × local derivative). Multiply along paths, add across paths. Process nodes from the loss backward and every downstream gradient is already in hand when you need it: each edge is touched exactly once. That single scheduling decision is the entire difference between exponential and linear cost. WORKED EXAMPLE ▾ 01 Instrument M8.1's preset A: \(x = 2\), \(w = 0.5\), \(b = -0.5\), \(y = 1\). Forward: \(u = wx = 1.0\), \(z = u + b = 0.5\), \(a = \sigma(0.5) = 0.6225\), \(L = \tfrac{1}{2}(0.6225 - 1)^2 = 0.0713\). 02 Backward, starting at \(\partial L/\partial L = 1\): \(\partial L/\partial a = a - y = -0.3775\). 03 Through the sigmoid edge: \(\partial a/\partial z = a(1-a) = 0.6225 \times 0.3775 = 0.2350\), so \(\partial L/\partial z = -0.3775 \times 0.2350 = -0.0887\). 04 Split at the two parents: \(\partial L/\partial w = -0.0887 \times x = -0.1774\); \(\partial L/\partial b = -0.0887 \times 1 = -0.0887\). Every edge touched exactly once. 05 Both gradients are negative, so the update pushes \(w\) and \(b\) up — \(a\) rises toward \(y = 1\). Drag \(\eta\) below and watch one SGD step pay off. RESULT: ∂L/∂w = −0.1774 · ∂L/∂b = −0.0887 LEARNING RATE η 1.0 — On the graph \(L = \tfrac12(a-y)^2\) with \(a = \sigma(z)\): at this step \(a = 0.5\) and target \(y = 1\). The two edge derivatives are \(\partial L/\partial a = a - y\) and \(\partial a/\partial z = a(1-a)\). What is \(\partial L/\partial z\)? Chain-rule multiply along the path: \(\partial L/\partial a = 0.5 - 1 = -0.5\); \(\partial a/\partial z = 0.5(1 - 0.5) = 0.25\). So \(\partial L/\partial z = (-0.5)(0.25) = \) −0.125. The negative sign means raising \(z\) lowers the loss — the update pushes \(a\) up toward \(y = 1\). Walk it on the smallest model that exercises every move — one weight, one bias, a sigmoid, a squared loss: \( L = \tfrac{1}{2}(\sigma(w x + b) - y)^2 \). The forward pass fills node values left to right and records each edge's local derivative as it goes. The backward pass then starts from \(\partial L / \partial L = 1\) and multiplies its way left, one edge at a time. Step both directions yourself: INSTRUMENT M8.1 — GRAPH STEPPER L = ½(σ(w·x + b) − y)² · EVERY NUMBER COMPUTED LIVE PRESET A — x=2.0 · y=1 B — x=−1.0 · y=0 PASS FORWARD ▶ ◀ BACKWARD FORWARD: VALUES + LOCAL DERIVATIVES → ← BACKWARD: ∂L/∂(EVERYTHING), ONE SWEEP ∂u/∂w = x ∂u/∂x = w ∂z/∂u = 1 ∂z/∂b = 1 ∂a/∂z = a(1−a) ∂L/∂a = a−y w · weight — x · input — b · bias — u = w·x — z = u + b — a = σ(z) — L = ½(a−y)² — y · target — ∂L/∂w = — ∂L/∂x = — ∂L/∂b = — ∂L/∂u = — ∂L/∂z = — ∂L/∂a = — ∂L/∂L = — LOSS L — ∂L/∂w — ∂L/∂b — LOSS AFTER 1 SGD STEP (η = 1) — Press FORWARD: values fill left to right, and each edge's local derivative (blue) is recorded the moment its node computes — exactly what an autodiff tape stores. Press BACKWARD: gradients flow right to left in mint, each one the product of the downstream gradient and one blue edge label. On preset A you should land on ∂L/∂w = −0.1774; apply the η = 1 update to w and b and the last readout shows the loss genuinely drops. Preset B flips the target to 0 — watch ∂L/∂b change sign and every gradient shrink (the unit is already nearly right). Notice what the instrument makes obvious: the backward pass never recomputes anything. Local derivatives were priced during the forward pass; backward just multiplies and adds them in reverse topological order. And the node \(z\) — with one input from \(u\) and one from \(b\) — shows EQ M8.1's sum degenerating to single terms, while the inputs \(w\) and \(x\) each receive their gradient through one path. In a real network a hidden unit feeds many downstream nodes, and the sum over children is doing real work: that is the backward product of §8.3. 8.3 Backprop through a two-layer net, worked Now the classic: Chapter 07's MLP, \( h = \varphi(W_1 x + b_1) \), \( \hat{y} = \sigma(W_2 h + b_2) \), binary cross-entropy loss. Define \(\delta_\ell\) as the gradient of the loss with respect to layer \(\ell\)'s pre-activation — the quantity backprop actually ferries between layers. At the output, sigmoid and cross-entropy collapse into the cleanest result in the field: EQ M8.2 — OUTPUT-LAYER GRADIENT $$ \delta_2 \;\equiv\; \frac{\partial L}{\partial z_2} \;=\; \hat{y} - y, \qquad\quad \frac{\partial L}{\partial W_2} = \delta_2\, h^{\top}, \qquad \frac{\partial L}{\partial b_2} = \delta_2 $$ The \(\sigma'\) from the activation and the \(1/\hat{y}(1-\hat{y})\) from the cross-entropy cancel exactly — the same cancellation that made logistic regression's gradient clean in Chapter 03, and the reason this loss–activation pairing is universal. Prediction minus target: the error itself is the gradient signal. A weight's gradient is then (its layer's δ) × (the activation it multiplied) — a weight that fed on a zero activation gets zero blame, which is exactly fair. WORKED EXAMPLE ▾ 01 Output \(\hat{y} = 0.8\), target \(y = 1\): \(\delta_2 = \hat{y} - y = -0.2\). No \(\sigma'\), no log derivatives — they cancelled. 02 Hidden activations this step: \(h = (0.5,\, 2.0,\, 0.0)\). 03 \(\partial L/\partial W_2 = \delta_2 h^\top = (-0.2 \times 0.5,\; -0.2 \times 2.0,\; -0.2 \times 0) = (-0.1,\, -0.4,\, 0)\); \(\partial L/\partial b_2 = \delta_2 = -0.2\). 04 Read the fairness: the unit that shouted (\(h = 2.0\)) gets 4× the blame of the quiet one (\(0.5\)); the silent unit gets none. Credit assignment is literally proportional to participation. RESULT: δ₂ = −0.2 · ∂L/∂W₂ = (−0.1, −0.4, 0) Output \(\hat y = 0.8\), target \(y = 0.5\), so \(\delta_2 = \hat y - y\). One hidden unit fed activation \(h = 2.0\) into the output. By EQ M8.2, what is \(\partial L/\partial W_2\) for that weight (\(= \delta_2 \cdot h\))? First the output-layer signal: \(\delta_2 = \hat y - y = 0.8 - 0.5 = 0.3\) (sigmoid and cross-entropy cancel — the error itself is the gradient). Then \(\partial L/\partial W_2 = \delta_2 \cdot h = 0.3 \times 2.0 = \) 0.6. Blame is proportional to participation: a loud unit earns a large gradient. EQ M8.3 — HIDDEN-LAYER GRADIENT: THE BACKWARD PRODUCT $$ \delta_1 \;=\; \big( W_2^{\top}\, \delta_2 \big) \,\odot\, \varphi'(z_1), \qquad\quad \frac{\partial L}{\partial W_1} = \delta_1\, x^{\top}, \qquad \frac{\partial L}{\partial b_1} = \delta_1 $$ Read \(W_2^{\top} \delta_2\) as EQ M8.1's sum-over-children done for every hidden unit at once: the same weights that carried activations forward carry blame backward, transposed. Then \(\odot\, \varphi'(z_1)\) gates the blame through each unit's local slope — a saturated unit (\(\varphi' \approx 0\)) absorbs no gradient. Deeper nets just iterate this line: \(\delta_{\ell} = (W_{\ell+1}^{\top} \delta_{\ell+1}) \odot \varphi'(z_{\ell})\). One matrix multiply per layer, backward — the mirror image of the forward pass, at roughly twice its FLOPs. That is the entire algorithm. Forward to get values, EQ M8.2 to start the blame, EQ M8.3 once per hidden layer to pass it down, then step every parameter against its gradient. The cell below is the complete loop — a 2-4-1 network learning XOR from eight points, every gradient written by hand. Two hundred epochs, loss printed and plotted; the predictions at the end are the proof Chapter 07 promised: PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(3) # 8 points on the XOR pattern -- the dataset Ch07 proved no linear model can fit X = np.array([[0,0],[0,1],[1,0],[1,1],[.1,.1],[.1,.9],[.9,.1],[.9,.9]], float) y = np.array([0,1,1,0,0,1,1,0], float).reshape(-1,1) W1 = rng.normal(0, 1.0, (4,2)); b1 = np.zeros(4) # a 2-4-1 net, tanh hidden W2 = rng.normal(0, 1.0, (1,4)); b2 = np.zeros(1) lr, losses = 2.0, [] for epoch in range(201): H = np.tanh(X @ W1.T + b1) # forward p = 1/(1 + np.exp(-(H @ W2.T + b2))) L = -np.mean(y*np.log(p+1e-9) + (1-y)*np.log(1-p+1e-9)) losses.append(L) dZ2 = (p - y)/len(X) # EQ M8.2: error IS the gradient dW2 = dZ2.T @ H; db2 = dZ2.sum(0) dZ1 = (dZ2 @ W2) * (1 - H**2) # EQ M8.3: backward product, tanh gate dW1 = dZ1.T @ X; db1 = dZ1.sum(0) W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2 # gradient step if epoch % 25 == 0: print(f"epoch {epoch:3d} loss {L:.4f}") print("\npredictions:", np.round(p.ravel(), 2)) print("targets: ", y.ravel()) plot_xy(list(range(len(losses))), losses) RUN ▶ change lr to 8.0 or the seed to 1 — watch the curve, not the code Sixteen lines of algorithm, and they are the same sixteen lines that train a frontier model — more layers in the loop, attention and normalization among the ops, ~10¹³ parameters instead of 17, but EQ M8.2 and M8.3 are doing all the work either way. 8.4 Autodiff: you never write gradients You just hand-derived a network's gradients for the last time. Every framework implements automatic differentiation: as your code runs the forward pass, each primitive operation appends a node to a tape (in PyTorch, the grad_fn chain) recording which tensors fed it and how to compute its local derivative — precisely the blue edge labels of Instrument M8.1. Calling loss.backward() walks that tape in reverse topological order, applying EQ M8.1 at every node. This is reverse-mode autodiff, and its defining property is the one that built deep learning: Mode One pass computes Cost for n params → 1 scalar loss Right when Forward-mode ∂(everything)/∂(one input) n passes Few inputs, many outputs Reverse-mode ∂(loss)/∂(everything) 1 backward pass ≈ 2× forward FLOPs Many params, one loss — i.e. all of ML Numerical one ∂L/∂θᵢ, approximately 2n forward passes Testing the other two Reverse mode's bill arrives in memory, not time: every activation must be kept alive from the forward pass until the backward pass consumes it. That is why training a model needs several times the memory of running it, and why activation checkpointing (recompute instead of store — Vol II · CH 04) exists. Three practical PyTorch facts complete the picture: gradients accumulate into.grad (hence zero_grad() every step — forgetting it is the classic silent bug); the tape is rebuilt every forward pass, so Python control flow is differentiated for free; and anything inside torch.no_grad() records nothing, which is what makes inference cheap. And the referee from §8.1 gets its one honorable job. Whenever a gradient is written by hand — a custom op, a new layer, a paper reimplementation — it is checked against central differences. The contract: analytic and numerical agree to ~10⁻⁷ in float64, or the backward pass is wrong. Run the audit on §8.3's network: PYTHON · RUNNABLE IN-BROWSER import numpy as np rng = np.random.default_rng(0) X = rng.normal(size=(8,2)); y = rng.integers(0,2,(8,1)).astype(float) theta = rng.normal(0, 0.6, size=12) # 2-4-1 net, MSE loss, params flattened def unpack(t): return t[:8].reshape(4,2), t[8:].reshape(1,4) def loss(t): W1, W2 = unpack(t) p = 1/(1 + np.exp(-(np.tanh(X @ W1.T) @ W2.T))) return np.mean((p - y)**2) def grad_backprop(t): # analytic: one forward + one backward W1, W2 = unpack(t) H = np.tanh(X @ W1.T); p = 1/(1 + np.exp(-(H @ W2.T))) dZ2 = 2*(p - y)/y.size * p*(1 - p) # chain: MSE then sigmoid dW1 = ((dZ2 @ W2)*(1 - H**2)).T @ X # EQ M8.3 again return np.concatenate([dW1.ravel(), (dZ2.T @ H).ravel()]) eps = 1e-5 # numerical: 2 forwards PER parameter g_bp = grad_backprop(theta) g_num = np.array([(loss(theta + eps*np.eye(12)[i]) - loss(theta - eps*np.eye(12)[i]))/(2*eps) for i in range(12)]) print("max |analytic - numerical| =", f"{np.abs(g_bp - g_num).max():.2e}") print("np.allclose verdict:", np.allclose(g_bp, g_num, rtol=1e-5, atol=1e-7)) print(f"\ncost: backprop = 2 passes total; numerical = {2*12} passes for 12 params") RUN ▶ sabotage grad_backprop — drop the (1 − H**2) — and watch the verdict flip 8.5 SGD and minibatches The loss that matters is an average over the whole dataset, so the true gradient is too — and on a trillion tokens, computing it once would cost more than most entire training runs. Stochastic gradient descent declines to pay: sample a minibatch \(\mathcal{B}\), average its per-example gradients, and step on that estimate: EQ M8.4 — THE NOISY GRADIENT ESTIMATOR $$ \hat{g} \;=\; \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_{\theta}\, \ell_i(\theta), \qquad \mathbb{E}\big[\hat{g}\big] = \nabla_{\theta} L(\theta), \qquad \mathrm{Var}\big[\hat{g}\big] \;\propto\; \frac{1}{|\mathcal{B}|} $$ The estimator is unbiased — on average it points exactly downhill — and its noise shrinks only as \(1/|\mathcal{B}|\) in variance (\(1/\sqrt{|\mathcal{B}|}\) in magnitude): quadrupling the batch halves the noise. The decisive property: the cost of a step is independent of the dataset size. That one fact is why training on internet-scale data is possible at all. You grow the minibatch from \(|\mathcal{B}| = 25\) to \(|\mathcal{B}| = 100\). Since the gradient-noise magnitude scales as \(1/\sqrt{|\mathcal{B}|}\), by what factor does the noise magnitude change (new ÷ old)? Magnitude \(\propto 1/\sqrt{|\mathcal{B}|}\), so the ratio is \(\sqrt{25}/\sqrt{100} = 5/10 = \) 0.5. The variance (the square) drops by \(25/100 = 1/4\), but the magnitude only halves — quadrupling the batch buys a 2× cleaner gradient, the diminishing return behind the critical batch size. Batch size is then an engineering trade, not a statistical one. Larger batches use accelerators efficiently and parallelize across devices (Vol II · CH 04), but past a critical batch size the extra averaging buys almost nothing — the noise is no longer the limiting factor, and you are spending more compute per unit of progress. A common heuristic when scaling the batch is to scale the learning rate with it, which works until it abruptly doesn't. And the noise itself is not purely a tax: it helps escape saddle points, and there is evidence — genuinely contested — that it biases training toward flatter minima that generalize better (Chapter 06's themes). What is not contested: the learning rate \(\eta\) is the single most important hyperparameter in deep learning. Too low wastes compute; ~3× too high diverges; the usable window is often well under one order of magnitude, which is why Vol II · EQ 4.4's warmup-and-decay schedules exist. 8.6 Momentum and Adam Real loss surfaces are ravines: curvature differs wildly by direction (recall Chapter 02's elongated bowls). Plain SGD must keep \(\eta\) small enough not to explode along the steepest direction — and at that \(\eta\) it crawls along the shallow one, zigzagging across the valley while barely advancing down it. Momentum fixes this with one extra vector — an exponential moving average of gradients: EQ M8.5 — MOMENTUM (HEAVY BALL) $$ v_t \;=\; \beta\, v_{t-1} + \hat{g}_t, \qquad\quad \theta_{t+1} \;=\; \theta_t - \eta\, v_t $$ \(v\) remembers roughly the last \(1/(1-\beta)\) gradients — ten, at the standard \(\beta = 0.9\). Across the ravine, gradients alternate sign and cancel in the average; along the valley floor they agree and accumulate, up to a \(1/(1-\beta)\) ≈ 10× effective speedup. The physics name is honest: \(v\) is velocity, \(\beta\) is friction, and a rolling ball coasts through small bumps and minor noise that stop a memoryless walker cold. WORKED EXAMPLE ▾ 01 \(\beta = 0.9\), valley floor — every gradient is \(+1\): \(v_1 = 1\), \(v_2 = 0.9 + 1 = 1.9\), \(v_3 = 0.9 \times 1.9 + 1 = 2.71\), \(v_4 = 3.44\), … \(\rightarrow v_\infty = 1/(1 - 0.9) = 10\). Agreement compounds 10×. 02 Across the ravine gradients alternate \(+1, -1\): \(v_1 = 1\), \(v_2 = 0.9 - 1 = -0.1\), \(v_3 = 0.91\), \(v_4 = -0.181\), … settling at amplitude \(1/(1 + 0.9) = 0.53\). 03 Same gradient magnitude, a 19× different response: momentum is a frequency filter — steady direction amplified, oscillation damped. 04 \(\beta\) also sets memory: roughly the last \(1/(1-\beta)\) gradients matter — 10 at 0.9, 100 at 0.99. Hundred-step memory is also why high \(\beta\) overshoots. Drag it below. RESULT: ALONG VALLEY ×10 · ACROSS RAVINE ×0.53 FRICTION β 0.90 — Momentum with friction \(\beta = 0.8\), starting from \(v_0 = 0\), on a valley floor where every gradient is \(g = +1\). Using \(v_t = \beta\,v_{t-1} + g\), what is the velocity \(v_3\) after three steps? \(v_1 = 0.8\cdot 0 + 1 = 1\); \(v_2 = 0.8\cdot 1 + 1 = 1.8\); \(v_3 = 0.8\cdot 1.8 + 1 = 1.44 + 1 = \) 2.44. Agreement compounds toward the limit \(1/(1-\beta) = 1/0.2 = 5\) — momentum accelerates along a consistent direction. Adam keeps momentum's first moment and adds a second: a running average of each coordinate's squared gradient, used to divide the step — so every parameter gets an automatically calibrated per-coordinate learning rate, and rarely-updated or small-gradient directions are not starved. The full update, bias corrections and the decoupled weight decay that makes it AdamW, is Vol II · EQ 4.3 — we will not re-derive it here. The standings in practice: Optimizer State per param Character Where it rules SGD none Honest, noisy, ravine-bound Theory; small problems SGD + momentum +1 (v) Coasts valleys, rolls over bumps, overshoots The CNN/vision era; still competitive there Adam / AdamW +2 (m, v) Per-coordinate scaling; robust to bad conditioning Every LLM you have heard of The state column is real money at scale: weights + gradients + fp32 master copy + Adam's two moments is where Vol II · CH 06's "≈16 bytes per parameter to train" comes from — a 70B model wants ~1.1 TB of optimizer-laden memory before a single activation is stored. INSTRUMENT M8.2 — OPTIMIZER RACE ELONGATED VALLEY + BUMP · 3 LIVE UPDATE RULES · SEEDED NOISE CONTROL STEP +1 AUTO ▶ RESET STEP 0 SGD LOSS (η=1.6) — MOMENTUM LOSS (η=.22 β=.9) — ADAM LOSS (η=.30) — A synthetic two-parameter loss — 10:1 curvature ratio plus a Gaussian bump squarely in the path — but every trajectory is its real update rule run live, all three fed identical seeded gradient noise. Run AUTO: SGD zigzags across the valley, then parks — the bump carves a shallow local minimum in front of itself, and a memoryless stepper has no way out. Momentum's stored velocity carries it straight over (and then past — watch the red overshoot swing back). Adam's per-coordinate normalization takes the bump as a detour and settles cleanly. Around step 120 the scoreboard reads ≈1.34 / 0.03 / 0.02 — same surface, same noise, three different fates. 8.7 Vanishing, exploding, and the fixes that built deep learning Iterate EQ M8.3 through \(L\) layers and the gradient reaching layer 1 is a product of \(L\) matrices and \(L\) activation slopes. Products compound geometrically: if each factor shrinks the signal by 0.9, a hundred layers leave \(0.9^{100} \approx 3 \times 10^{-5}\) of it; if each grows it by 1.1, the same hundred layers amplify ~14,000×. Vanishing gradients mean the early layers stop learning while the late ones overfit; exploding gradients mean a single step flings the weights to infinity. For two decades this product was the practical wall — "deep networks don't train" — and the modern stack is, to a first approximation, the list of fixes: Initialization that respects the product. Xavier/Glorot and He initialization choose weight variance (≈ \(2/d_{\text{in}}\) for ReLU) so each layer's factor starts with norm ≈ 1 — the product begins neutral instead of doomed. One line of code; it is the difference between training and not. Activations that pass gradient. EQ M7.2's argument: sigmoid contributes at most 0.25 per layer; ReLU contributes exactly 1 wherever active. The 2012 switch to ReLU is most of why "deep" stopped being a euphemism for "broken". Residual connections. Reformulate each layer as \( h_{\ell+1} = h_\ell + F(h_\ell) \). The backward Jacobian becomes \( I + \partial F / \partial h_\ell \): the identity term gives gradients a multiplication-free expressway from the loss to every layer, and the product of \((I + \text{small})\) terms stays tame where a product of raw matrices would not. He et al. (2015) used it to train 152 layers the year 20 was hard. Normalization and clipping. LayerNorm/RMSNorm re-standardize activations so the slopes stay in their responsive range; gradient clipping caps the global gradient norm as a circuit breaker against the exploding side — still standard in every LLM pretraining run (Vol II · CH 04). Carry the third fix with you across the volume boundary: a transformer is a residual network through and through, and what Volume II calls the residual stream (Vol II · §2.2) is exactly this gradient expressway, promoted from a training trick to the architecture's central data structure — every attention head and MLP reads from it and adds back into it, and gradients ride it undamped through a hundred layers. NEXT You now know everything GPT knows about learning — Volume II shows what happens at a trillion times the scale. The forward pass of Chapter 07, this chapter's backward pass, AdamW on EQ M8.4's noisy gradients: that is, literally and completely, the training loop of a frontier model. Volume II begins where the loop meets reality — tokens, embeddings, attention, and the engineering of running it across tens of thousands of GPUs. § Further reading Rumelhart, D., Hinton, G. & Williams, R. (1986). Learning Representations by Back-Propagating Errors. — the paper that popularized backpropagation for training multilayer networks. Linnainmaa, S. (1970). The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion. — the earliest description of reverse-mode automatic differentiation, backprop's mathematical core. Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method. — the foundation of stochastic gradient descent and its convergence conditions. Kingma, D. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. — the adaptive optimizer combining momentum and per-parameter scaling that is now the default. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. — the diploma thesis that first diagnosed the vanishing-gradient problem in deep networks. Baydin, A., Pearlmutter, B., Radul, A. & Siskind, J. (2018). Automatic Differentiation in Machine Learning: A Survey. — the modern reference tying backprop to general-purpose autodiff. ← PREVIOUS 07 Neural Networks: The MLP NEXT CHAPTER 01 Vol II · Foundations AI // ENCYCLOPEDIA — VOL I · CH 08 FULL CONTENTS ↗ ## VOL I · 09 · Naive Bayes & Generative Classifiers (https://ai-encyclopedia.com/ml/09-naive-bayes.html) 09 · Naive Bayes & Generative Classifiers — AI Encyclopedia AI // ENCYCLOPEDIA / MACHINE LEARNING / 09 / NAIVE BAYES INDEX NEXT: SVM & KERNELS → MACHINE LEARNING · CHAPTER 09 / 15 Naive Bayes & Generative Classifiers Most classifiers learn where to draw the line between classes. A generative classifier instead learns to model each class, then asks which class would most plausibly have produced what it sees. Naive Bayes is the simplest such model, named for one shortcut. Assume every feature is independent of the others given the class, an assumption almost no real data obeys, and you get a classifier that trains in a single pass, needs little data, and remains hard to beat. LEVEL INTRO READING TIME ≈ 22 MIN BUILDS ON CH 03 · STATS CH 01 INSTRUMENTS GAUSSIAN BOUNDARY · SPAM FILTER · INDEPENDENCE TOY IN THIS CHAPTER 9.1 Generative vs discriminative 9.2 Bayes' rule & the naive lie 9.3 Gaussian Naive Bayes 9.4 Multinomial & Bernoulli for text 9.5 Smoothing & why it works § References 9.1 Generative vs discriminative classifiers Every classifier ultimately wants \(p(y \mid x)\): the probability of class \(y\) given features \(x\). There are two roads to it, and they split machine learning down the middle. A discriminative model attacks \(p(y \mid x)\) head-on. Logistic regression (Chapter 03), neural nets (Chapter 07), SVMs (next chapter) — all of them learn the boundary between classes directly and never bother modeling what the inputs themselves look like. A generative model takes the long way around: it learns \(p(x \mid y)\) — a full story of how each class generates its data — plus the class prevalences \(p(y)\), and only then flips them with Bayes' rule to recover \(p(y \mid x)\). You could literally sample from a trained generative classifier to hallucinate new examples of "spam" or "not spam". Aspect Generative — models \(p(x,y)\) Discriminative — models \(p(y \mid x)\) Learns how each class produces data where the boundary sits Examples Naive Bayes, LDA/QDA, GMMs, HMMs Logistic regression, SVM, neural nets Data hunger low — strong assumptions fill the gaps higher — wants enough to trace the boundary Asymptotic accuracy often lower (model is wrong) often higher (fewer assumptions) Can generate samples? yes no The trade-off is captured in a classic result of Ng & Jordan (2002): a generative classifier and its discriminative twin (e.g. naive Bayes vs logistic regression) approach different error rates as data grows, but the generative one approaches its (sometimes higher) ceiling much faster — and wins outright in the small-data regime. With ten examples, the model that assumes more often beats the model that assumes less. With ten million, the assumptions become a liability. Naive Bayes lives at the assumption-heavy end of this spectrum, which is exactly why it remains a strong baseline whenever labeled data is scarce or latency is brutal. INTUITION A discriminative model is a border guard who has learned to spot a forged passport without ever picturing a real citizen. A generative model is a forger who has studied what genuine documents look like and judges a new one by how easily they could have produced it. Both can flag fakes — they just know the world differently. 9.2 Bayes' rule for classification & the naive assumption The engine is Bayes' rule (Stats · EQ 1.x), read as a classifier. To score class \(y\) for an input \(x = (x_1, \ldots, x_d)\): EQ M9.1 — BAYES' RULE AS A CLASSIFIER $$ p(y \mid x) \;=\; \frac{p(x \mid y)\, p(y)}{p(x)} \;=\; \frac{p(x \mid y)\, p(y)}{\sum_{y'} p(x \mid y')\, p(y')} $$ The prior \(p(y)\) is just how common each class is. The likelihood \(p(x \mid y)\) is the class's story of the data. The evidence \(p(x)\) in the denominator is the same for every class, so for picking the winner it is pure normalization — you can ignore it entirely until you need calibrated probabilities. That single observation is why naive Bayes never has to compute the hard part. The problem hides in \(p(x \mid y)\). For \(d\) binary features there are \(2^d - 1\) free numbers per class — a joint distribution over every combination of feature values. With \(d = 50\) that is more parameters than atoms you could ever count. No dataset estimates it. So naive Bayes makes its defining leap of faith: given the class, the features are mutually independent. EQ M9.2 — THE NAIVE CONDITIONAL-INDEPENDENCE ASSUMPTION $$ p(x_1, x_2, \ldots, x_d \mid y) \;\approx\; \prod_{j=1}^{d} p(x_j \mid y) $$ The full joint — exponential in \(d\) — collapses into a product of \(d\) one-dimensional pieces, each trivially estimated by counting. The cost falls from \(2^d\) parameters to \(O(d)\) per class. This is almost never literally true (in text, "new" and "york" are wildly correlated), yet it is the entire trick. The assumption is the price; linear-time learning and inference is what you buy. Plug EQ M9.2 into EQ M9.1 and drop the constant evidence. The decision rule becomes a single argmax, and — because likelihoods are tiny products that underflow to zero in floating point — you always compute it in log space, where the product becomes a sum: EQ M9.3 — THE NAIVE BAYES DECISION RULE $$ \hat{y} \;=\; \underset{y}{\arg\max}\;\Big[\, \log p(y) + \sum_{j=1}^{d} \log p(x_j \mid y) \,\Big] $$ A prediction is one weighted vote: start from the log-prior, then add each feature's log-evidence for that class. No iteration, no gradient descent — training is counting, inference is addition. Because the scores are sums of log-probabilities, naive Bayes is linear in feature log-likelihoods; with the right parameterization it shares the exact functional form of logistic regression, just fit differently. True or false: Naive Bayes assumes that the features are independent of one another given the class label. (Answer true or false.) This is precisely EQ M9.2 — the conditional-independence assumption that lets the joint likelihood factor into \(\prod_j p(x_j \mid y)\). It is the model's namesake "naive" leap. The answer is true. (Note the subtlety: features need not be marginally independent — only independent conditioned on the class.) PYTHON · RUNNABLE IN-BROWSER # EQ M9.3 by hand: log-prior + sum of feature log-likelihoods, then argmax. # Two classes, two binary features. p(x_j=1 | y) given directly. import numpy as np prior = {0: 0.6, 1: 0.4} # p(y): class 0 is more common p1 = {0: [0.2, 0.7], 1: [0.8, 0.3]} # p(x_j = 1 | y) for j = 0, 1 x = [1, 0] # observe feature0 = 1, feature1 = 0 def logscore(y): s = np.log(prior[y]) for j, xj in enumerate(x): pj = p1[y][j] if xj == 1 else 1 - p1[y][j] # Bernoulli per feature s += np.log(pj) return s scores = {y: logscore(y) for y in (0, 1)} print("log-scores:", {y: round(s, 4) for y, s in scores.items()}) yhat = max(scores, key=scores.get) # turn log-scores into a calibrated posterior via the log-sum-exp normalizer m = max(scores.values()) Z = sum(np.exp(s - m) for s in scores.values()) post = {y: float(np.exp(s - m) / Z) for y, s in scores.items()} print("posterior:", {y: round(p, 4) for y, p in post.items()}) print("prediction:", yhat, " (evidence p(x) cancelled, never computed)") RUN ▶ edits are live — break it on purpose 9.3 Gaussian Naive Bayes For continuous features — heights, pixel intensities, sensor readings — each per-feature likelihood \(p(x_j \mid y)\) is modeled as a one-dimensional Gaussian. The training "fit" is nothing but estimating, from each class's data, a mean and a variance for every feature: EQ M9.4 — GAUSSIAN PER-FEATURE LIKELIHOOD $$ p(x_j \mid y) \;=\; \frac{1}{\sqrt{2\pi\,\sigma_{y,j}^2}}\,\exp\!\left(-\frac{(x_j - \mu_{y,j})^2}{2\,\sigma_{y,j}^2}\right) $$ \(\mu_{y,j}\) and \(\sigma_{y,j}^2\) are simply the sample mean and variance of feature \(j\) among the training points of class \(y\). Because features are assumed independent given the class, the joint per-class density is an axis-aligned Gaussian — its contours are ellipses whose axes line up with the coordinate axes, never tilted. That single restriction is the visible footprint of the naive assumption in 2-D. Where do the decision boundaries come from? Take the log-ratio of the two class posteriors. With shared variances the quadratic terms cancel and you get a straight line — this is Linear Discriminant Analysis. With per-class variances the quadratic terms survive and the boundary becomes a conic (a parabola, ellipse, or hyperbola) — Quadratic Discriminant Analysis. Gaussian NB is exactly QDA with the off-diagonal covariances forced to zero. INSTRUMENT M9.1 — GAUSSIAN-NB BOUNDARY EXPLORER DRAG CLASS MEANS · TUNE VARIANCE · EQ M9.4 CLASS A SPREAD σ 1.00 CLASS B SPREAD σ 1.00 PRIOR p(A) 0.50 BOUNDARY SHAPE — A-MEAN (drag the dot) — B-MEAN (drag the dot) — Drag either coloured dot to move a class mean. Equal spreads give a straight boundary (LDA); make the spreads unequal and it bows into a conic (QDA). Skew the prior and the whole boundary slides toward the rarer class — Bayes-optimally demanding more evidence to call something uncommon. The shaded region is wherever class A wins the argmax of EQ M9.3. PYTHON · RUNNABLE IN-BROWSER # Gaussian Naive Bayes from scratch: fit = means + variances, predict = argmax. import numpy as np rng = np.random.default_rng(0) # two 2-D classes, 200 points each, drawn from axis-aligned Gaussians mu = {0: [0.0, 0.0], 1: [3.0, 2.5]} Xy = [(rng.normal(mu[y], [1.0, 1.3], (200, 2)), np.full(200, y)) for y in (0, 1)] X = np.vstack([a for a, _ in Xy]); y = np.concatenate([b for _, b in Xy]) # --- fit: per class, per feature mean and variance (eps guards zero variance) classes = np.unique(y); eps = 1e-9 means = np.array([X[y == c].mean(0) for c in classes]) vars = np.array([X[y == c].var(0) for c in classes]) + eps logpr = np.log(np.array([(y == c).mean() for c in classes])) # --- predict: log-prior + sum_j log N(x_j; mu, var) (EQ M9.3 + M9.4) def log_gauss(Xb, m, v): return -0.5 * (np.log(2 * np.pi * v) + (Xb[:, None,:] - m) ** 2 / v).sum(2) scores = log_gauss(X, means, vars) + logpr # (N, n_classes) pred = classes[scores.argmax(1)] acc = (pred == y).mean() print(f"fitted means:\n{means.round(2)}") print(f"fitted variances:\n{vars.round(2)}") print(f"train accuracy: {acc:.3f}") plot_scatter(X[:, 0], X[:, 1], list(pred.astype(int))) # colour = predicted class RUN ▶ edits are live — break it on purpose One practical wrinkle. If a feature has near-zero variance within a class — a constant column, common in real data — the Gaussian collapses to a spike and the log-likelihood explodes. Production implementations (scikit-learn's GaussianNB) add a tiny smoothing term to every variance, a fixed fraction of the largest feature variance, to keep the densities well-behaved. The same impulse — never let a probability go to zero or infinity — drives the smoothing we meet next for discrete features. 9.4 Multinomial & Bernoulli NB for text Naive Bayes' first and most enduring job was sorting documents — Maron's 1961 "automatic indexing" experiments are arguably the technique's debut, and spam filters made it famous. Text needs a different per-feature model than the Gaussian, and there are two canonical choices that differ in what they count. Multinomial NB treats a document as a bag of word counts. Each class \(y\) has a vocabulary-sized probability vector \(\theta_{y}\) — "how likely is each word in a document of this class" — and a document's likelihood is the multinomial probability of its counts. The estimate for one word is the share of that class's total word-tokens it accounts for, with Laplace (add-\(\alpha\)) smoothing so an unseen word never zeroes out the whole product: EQ M9.5 — MULTINOMIAL NB WITH LAPLACE SMOOTHING $$ \hat{\theta}_{y,w} \;=\; p(w \mid y) \;=\; \frac{N_{y,w} + \alpha}{N_{y} + \alpha\,V}, \qquad N_{y} = \sum_{w'} N_{y,w'} $$ \(N_{y,w}\) is how many times word \(w\) appears across all class-\(y\) documents; \(N_{y}\) is that class's total token count; \(V\) is the vocabulary size; \(\alpha\) is the smoothing strength (\(\alpha = 1\) is "Laplace", \(\alpha = 0.5\) is "Lidstone/Jeffreys"). The \(+\alpha\) in the numerator and \(+\alpha V\) in the denominator together form a valid probability distribution — they sum to 1 over the vocabulary — and guarantee every word keeps a sliver of probability mass. Bernoulli NB instead treats each vocabulary word as a binary presence/absence flag and explicitly models the absence of words too — a "free" word that never appears in spam is positive evidence for ham. It tends to win on short texts (tweets, subject lines) where a word appearing once vs. thrice carries little extra signal; multinomial wins on longer documents where counts matter. Both are linear classifiers in log space; both are one counting pass to train. A word appears \(N_{y,w} = 0\) times in class \(y\). The class has \(N_y = 6\) total tokens, the vocabulary size is \(V = 4\), and you use Laplace smoothing with \(\alpha = 1\). What smoothed probability \(p(w \mid y)\) does EQ M9.5 assign this unseen word? (Give a decimal.) Apply EQ M9.5 directly: \(p(w \mid y) = \dfrac{N_{y,w} + \alpha}{N_y + \alpha V} = \dfrac{0 + 1}{6 + 1\times 4} = \dfrac{1}{10} = \) 0.1. Without smoothing the answer would be \(0/6 = 0\), which would force the entire document's likelihood to zero — the catastrophe smoothing exists to prevent. PYTHON · RUNNABLE IN-BROWSER # Multinomial Naive Bayes on a toy bag-of-words, with Laplace smoothing (EQ M9.5). import numpy as np vocab = ["win", "free", "money", "meeting", "report", "lunch"] V = len(vocab) # rows = documents (word counts), labels: 1 = spam, 0 = ham X = np.array([[2,1,1,0,0,0], # spam [1,1,0,0,0,0], # spam [0,0,0,1,1,0], # ham [0,0,0,1,0,1]]) # ham y = np.array([1, 1, 0, 0]) alpha = 1.0 # fit: per-class token totals -> smoothed word probabilities (EQ M9.5) theta, logprior = {}, {} for c in (0, 1): counts = X[y == c].sum(0) # N_{y,w} theta[c] = (counts + alpha) / (counts.sum() + alpha * V) logprior[c] = np.log((y == c).mean()) # predict a new document: "free money meeting" doc = np.array([0,1,1,1,0,0]) score = {c: logprior[c] + (doc * np.log(theta[c])).sum() for c in (0, 1)} m = max(score.values()); Z = sum(np.exp(s - m) for s in score.values()) post = {c: float(np.exp(s - m) / Z) for c, s in score.items()} print("p(win|spam) =", round(float(theta[1][0]), 3), " p(win|ham) =", round(float(theta[0][0]), 3)) print("posterior:", {c: round(p, 3) for c, p in post.items()}) print("prediction:", "SPAM" if score[1] > score[0] else "HAM") RUN ▶ edits are live — break it on purpose INSTRUMENT M9.2 — LIVE SPAM FILTER MULTINOMIAL NB · TYPE WORDS · POSTERIOR UPDATES PER TOKEN MESSAGE (type freely — known words shown below) LAPLACE α 1.0 P(SPAM | MESSAGE) LOG-SCORE SPAM — LOG-SCORE HAM — VERDICT — A tiny corpus is baked in (spammy words: free, money, win, click, offer, now, prize; hammy words: meeting, report, lunch, project, team, schedule). Each token you add casts a log-vote (EQ M9.3); the bar is the softmaxed posterior. Words the model has never seen contribute only the smoothed floor — turn α up and watch unknown words pull every verdict toward 50/50. 9.5 Smoothing, failure modes & why it works anyway We have already met the most important fix. The zero-frequency problem is fatal without it: one word the training set never saw in a class drives \(p(w \mid y) = 0\), and a single zero in the product of EQ M9.3 sends the whole log-likelihood to \(-\infty\), vetoing that class no matter how strong the other evidence. Laplace/Lidstone smoothing (EQ M9.5) is the cure — read as a Bayesian posterior, \(\alpha\) is the strength of a Dirichlet prior that pre-loads \(\alpha\) imaginary observations of every word. It is regularization wearing a probabilistic costume. The deeper failure mode is the assumption itself. When features are correlated — and they always are — naive Bayes double-counts evidence. The bigram "New York" contributes "new" and "york" as if they were two independent witnesses, so the model becomes overconfident: its posteriors pile up near 0 and 1, badly miscalibrated. Here is the surprise that has fascinated the field for decades: THE PARADOX Naive Bayes' probabilities are often garbage, yet its classifications are excellent. Domingos & Pazzani (1997) showed why: the argmax only needs the correct class to score highest, not to have an accurate probability. The independence assumption can be massively violated — pushing the estimated posterior to a wildly wrong value like 0.9999 — and the decision still lands on the right side of the boundary. The model is reliably wrong about how sure it is, and reliably right about which class. If you need calibrated probabilities, post-hoc calibration (Platt scaling, isotonic regression) is mandatory; if you only need labels, ship it. Three more practical truths round out the picture: (1) NB is robust to irrelevant features, which simply contribute roughly equal log-votes to every class and cancel; (2) it is sensitive to strongly correlated features, so de-duplicating obvious redundancy (or using TF-IDF weighting for text) helps; (3) its training cost is a single pass and its model is a handful of count tables, making it the natural choice for streaming data, on-device inference, and any setting where a heavier model's accuracy gain does not justify its cost. It is the baseline every other classifier in this volume must beat — and on small, high-dimensional, sparse data, it frequently isn't beaten. INSTRUMENT M9.3 — INDEPENDENCE-ASSUMPTION TOY CORRELATE TWO FEATURES · WATCH NB DEGRADE FEATURE CORRELATION ρ 0.00 TRUE-MODEL ACCURACY — NAIVE-BAYES ACCURACY — MEAN |POSTERIOR ERROR| — Two classes drawn from correlated Gaussians; NB still fits them as axis-aligned (ρ forced to 0). At ρ = 0 the assumption holds and NB matches the optimal classifier. Crank ρ up: the calibration error climbs steadily — NB grows overconfident — while its accuracy barely moves. That gap is the paradox of §9.5 made visible: wrong probabilities, right decisions. NEXT Naive Bayes draws boundaries by modeling each class; the next chapter draws the single best boundary directly. Chapter 10 — Support Vector Machines & Kernels: maximum-margin separation, the kernel trick that buys nonlinear boundaries for free, and what changes when you stop modeling the data and start carving it. 9.R References Domingos, P. & Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29 — explains why a model with violated independence assumptions still classifies well (the paradox of §9.5). Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM 8(3) — arguably the first application of naive-Bayes-style probabilistic classification to text. McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop — the canonical multinomial vs. Bernoulli comparison (EQ M9.5 and §9.4). Ng, A. Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NeurIPS 14 — the generative/discriminative trade-off and small-data advantage of §9.1. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — §6.6.3 naive Bayes, §4.3 LDA/QDA (the Gaussian-NB connection of §9.3). Free PDF. Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Ch. 4: Naive Bayes and Sentiment Classification. Stanford — a modern, worked treatment of multinomial NB for text with smoothing. ← PREVIOUS 08 Backpropagation NEXT CHAPTER 10 SVM & Kernels AI // ENCYCLOPEDIA — MACHINE LEARNING · CH 09 FULL CONTENTS ↗ ## VOL I · 10 · Support Vector Machines & the Kernel Trick (https://ai-encyclopedia.com/ml/10-svm-kernels.html) 10 · Support Vector Machines & the Kernel Trick — AI Encyclopedia AI // ENCYCLOPEDIA / MACHINE LEARNING / 10 / SVM & KERNELS INDEX NEXT: DISTANCES & SIMILARITY → MACHINE LEARNING · CHAPTER 10 / 15 Support Vector Machines & the Kernel Trick Of all the lines that separate two classes, the SVM picks the one sitting in the widest empty corridor. Maximize that margin and a handful of support vectors define the entire boundary; a kernel then bends it into high or infinite dimensions at little extra cost. For roughly a decade these were the most accurate classifiers available, and they remain the cleanest place to learn what a margin, a dual problem, and a kernel are. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON ML · CH 02–03 INSTRUMENTS MARGIN EXPLORER · KERNEL PLAYGROUND · C–HINGE IN THIS CHAPTER 10.1 The maximum-margin idea 10.2 Hard margin & support vectors 10.3 Soft margin & hinge loss 10.4 The kernel trick 10.5 SVMs in practice 10.R References 10.1 The maximum-margin idea Suppose two classes of points are linearly separable. Then there are infinitely many straight lines that split them with zero training error — and the perceptron of Chapter 03 will happily return whichever one it stumbles onto first. The support vector machine asks a sharper question: among all separating hyperplanes, which one is best ? Its answer is geometric and almost obvious once stated: pick the boundary that keeps the largest possible empty buffer on either side. A line that skims past a training point is fragile — nudge that point slightly and it flips sides. A line centered in a wide no-man's-land is robust. A hyperplane is the set of points satisfying \(w \cdot x + b = 0\), where \(w\) is a normal vector pointing across the boundary and \(b\) shifts it from the origin. The signed distance from any point \(x\) to that hyperplane is \((w \cdot x + b)/\lVert w \rVert\). The classifier is the sign of that expression: \(\hat{y} = \operatorname{sign}(w \cdot x + b)\). What we want to maximize is the margin — the distance from the boundary to the closest point of either class. EQ M10.1 — THE GEOMETRIC MARGIN $$ \gamma \;=\; \min_{i}\; \frac{y_i\,(w \cdot x_i + b)}{\lVert w \rVert}, \qquad y_i \in \{-1, +1\} $$ Labels are \(\pm 1\) (not \(0/1\)) precisely so that \(y_i(w\cdot x_i + b)\) is positive exactly when the point is on the correct side — the label cancels the sign. Dividing by \(\lVert w \rVert\) converts the raw score into a real distance, so \(\gamma\) is measured in the units of the data, not the units of \(w\). The SVM's whole objective is to find the \((w, b)\) that makes this smallest distance as large as possible. The corridor's full width is \(2\gamma\). There is a redundancy to remove first. The pair \((w, b)\) and \((2w, 2b)\) describe the same hyperplane but give different score magnitudes. SVMs fix this by a canonical normalization: scale \((w, b)\) so that the closest points score exactly \(\pm 1\), i.e. \(\min_i y_i(w\cdot x_i + b) = 1\). Under that convention the margin in EQ M10.1 simplifies to the single clean quantity \(\gamma = 1/\lVert w \rVert\) — so maximizing the margin is the same as minimizing \(\lVert w \rVert\). That equivalence is the hinge on which the next section turns. FIG M10.A ONE OF MANY SEPARATORS VS. THE MAXIMUM-MARGIN ONE ARBITRARY SEPARATOR skims a point → MAXIMUM MARGIN support vectors ⬡ Same data, two boundaries. The left line separates the classes but hugs a blue point; the right one is centered in the widest empty corridor. The three ringed points are support vectors — the only points that touch the margin, and the only ones that matter (§10.2). The maximum-margin principle is not just aesthetic. Margin width controls the capacity of the classifier in the bias–variance sense of Chapter 06: a wider margin is a simpler, lower-variance hypothesis, and the bound on generalization error degrades with \(1/\gamma^2\) rather than with the dimension of the space. That is the deep reason SVMs tolerate enormous feature spaces (§10.4) without overfitting in the way intuition warns they should. 10.2 The hard-margin problem and its support vectors Pin down the redundancy with the canonical scaling and the SVM becomes a tidy convex optimization problem: minimize \(\lVert w \rVert\) — equivalently \(\tfrac{1}{2}\lVert w \rVert^2\), which is smoother — subject to every point sitting on the correct side of its margin. EQ M10.2 — THE HARD-MARGIN PRIMAL $$ \min_{w,\,b}\; \frac{1}{2}\lVert w \rVert^2 \qquad \text{subject to}\qquad y_i\,(w \cdot x_i + b) \;\ge\; 1 \quad \text{for all } i $$ A quadratic objective under linear inequality constraints — a quadratic program, with a unique global optimum and no local minima to trap you (contrast the loss surfaces of Chapters 07–08). Each constraint says "point \(i\) is correctly classified and at least one margin-unit away from the boundary." Points that hold their constraint with equality (\(y_i(w\cdot x_i + b) = 1\)) sit exactly on the margin; everything else has slack to spare. Solving the dual of EQ M10.2 — via Lagrange multipliers \(\alpha_i \ge 0\), one per point — reveals the structure that gives the method its name. The optimum has the form \(w = \sum_i \alpha_i y_i x_i\), and the Karush–Kuhn–Tucker conditions force \(\alpha_i = 0\) for every point strictly outside the margin. Only the points on the margin carry a nonzero \(\alpha_i\). Those are the support vectors. EQ M10.3 — THE DUAL: ONLY DOT PRODUCTS SURVIVE $$ \max_{\alpha \ge 0}\; \sum_i \alpha_i \;-\; \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j\, y_i y_j\,(x_i \cdot x_j) \qquad \text{s.t.}\quad \sum_i \alpha_i y_i = 0 $$ Two facts make this the most important equation in the chapter. First, the solution is sparse in \(\alpha\): typically only a small fraction of points are support vectors, so deleting every other training point leaves the boundary unchanged. Second — and this is the seed of §10.4 — the data enters only through inner products \(x_i \cdot x_j\). Nowhere does the optimizer need the coordinates themselves, only how points relate. Swap that dot product for a kernel and you have changed feature spaces without touching a single line of the solver. The consequence is striking. A trained SVM is a list of support vectors, their labels, their weights \(\alpha_i\), and a bias \(b\). On a clean problem that might be a dozen points out of a million. The decision function is EQ M10.4 — THE DECISION FUNCTION $$ \hat{y}(x) \;=\; \operatorname{sign}\!\Big( \sum_{i \in \text{SV}} \alpha_i\, y_i\,(x_i \cdot x) \;+\; b \Big) $$ Classification reduces to comparing the new point against the support vectors alone. The model's memory footprint and prediction cost scale with the number of support vectors, not the size of the training set — which is exactly why SVMs were practical on 1990s hardware and why a "too easy" problem (few SVs) and a "too hard" one (almost every point becomes an SV) feel so different at deployment time. An SVM is canonically scaled so the closest points score \(\pm 1\), and its weight vector has norm \(\lVert w \rVert = 2\). By the simplification of EQ M10.1 (\(\gamma = 1/\lVert w \rVert\)), what is the margin \(\gamma\) — the distance from the boundary to the nearest point? Under canonical scaling the geometric margin collapses to \(\gamma = 1/\lVert w \rVert = 1/2 = \) 0.5. The full corridor between the two classes is twice this, \(2\gamma = 1\). Doubling \(\lVert w \rVert\) halves the margin — which is exactly why minimizing \(\lVert w \rVert^2\) maximizes the corridor. INSTRUMENT M10.1 — MARGIN EXPLORER DRAG POINTS · MAX-MARGIN SOLVED LIVE · EQ M10.2 ACTIONS + BLUE + MINT RESET MARGIN γ = 1/‖w‖ — ‖w‖ — SUPPORT VECTORS — STATUS — Drag any point and the maximum-margin boundary re-solves instantly (a small coordinate-ascent on the dual of EQ M10.2). The solid white line is the boundary; the dashed lines are the margins; ringed points are the support vectors that touch them. Drag a faraway point around — the boundary does not move, because its \(\alpha_i\) is zero. Now drag a support vector and watch everything shift: only the points on the corridor have a vote. PYTHON · RUNNABLE IN-BROWSER # Linear soft-margin SVM by sub-gradient descent on the hinge loss # (EQ M10.2 + M10.5). Recovers the margin and the support vectors. import numpy as np rng = np.random.default_rng(0) # two clearly separable clouds, labels in {-1, +1} n = 40 Xp = rng.normal([ 2.2, 2.0], 0.6, (n, 2)) Xm = rng.normal([-2.0, -1.6], 0.6, (n, 2)) X = np.vstack([Xp, Xm]); y = np.r_[np.ones(n), -np.ones(n)] w = np.zeros(2); b = 0.0; C = 1.0 for t in range(1, 4001): # Pegasos-style schedule lr = 1.0 / (0.01 * t) # 0.01 = regularization 1/(C·N) m = y * (X @ w + b) # margins y·f(x) viol = m < 1 # points inside the margin grad_w = 0.01 * w - C / len(X) * (y[viol] @ X[viol]) grad_b = - C / len(X) * y[viol].sum() w -= lr * grad_w; b -= lr * grad_b m = y * (X @ w + b) sv = np.where(m < 1.05)[0] # points on/inside the margin print(f"||w|| = {np.linalg.norm(w):.3f}") print(f"margin 1/||w|| = {1/np.linalg.norm(w):.3f}") print(f"train accuracy = {(np.sign(X @ w + b) == y).mean():.3f}") print(f"support vectors= {len(sv)} of {len(X)} points") plot_scatter(X[:, 0], X[:, 1], (y > 0).astype(int)) RUN ▶ push the clouds together (try means ±1.0) and watch the support-vector count climb 10.3 Soft margin and the hinge loss Real data is not cleanly separable. One mislabeled point, or two classes that genuinely overlap, and the hard-margin problem of EQ M10.2 has no feasible solution — every hyperplane violates some constraint. Cortes and Vapnik's 1995 fix, the move that turned a beautiful idea into a workhorse, was to let constraints be broken at a price. Introduce a slack variable \(\xi_i \ge 0\) for each point measuring how far it intrudes into (or past) its margin, and pay for the total slack: EQ M10.5 — THE SOFT-MARGIN PRIMAL $$ \min_{w,\,b,\,\xi}\; \frac{1}{2}\lVert w \rVert^2 \;+\; C\sum_i \xi_i \qquad \text{s.t.}\quad y_i(w \cdot x_i + b) \ge 1 - \xi_i,\;\; \xi_i \ge 0 $$ The hyperparameter \(C\) sets the exchange rate between a wide margin and few violations. Large \(C\) punishes every violation harshly — the boundary contorts to classify training points, low bias, high variance (toward the hard margin as \(C \to \infty\)). Small \(C\) tolerates mistakes for the sake of a fat, smooth margin — high bias, low variance. \(C\) is the single most important knob on an SVM, and it is precisely the regularization strength of Chapter 06 wearing a different letter. Eliminating the slack variables (each is just \(\xi_i = \max(0, 1 - y_i f(x_i))\) at the optimum) recasts the soft-margin SVM as plain regularized empirical-risk minimization with one specific loss — the hinge loss: EQ M10.6 — HINGE LOSS = SVM IN DISGUISE $$ \mathcal{L}(w, b) \;=\; \underbrace{\frac{1}{2}\lVert w \rVert^2}_{\text{margin / L2}} \;+\; C \sum_i \underbrace{\max\!\big(0,\; 1 - y_i\,f(x_i)\big)}_{\text{hinge loss } \ell_{\text{hinge}}}, \qquad f(x) = w \cdot x + b $$ The hinge \(\ell(z) = \max(0, 1 - z)\), with \(z = y\,f(x)\) the signed margin, is the soul of the method. It is zero once a point is correctly classified with margin \(\ge 1\) (so those points exert no force — the sparsity of §10.2), then rises linearly as the point drifts inside the margin or onto the wrong side. Unlike the squared loss, it does not keep penalizing points that are already comfortably right; unlike the 0–1 loss it is convex and has a usable (sub-)gradient. Logistic regression's log-loss is its smooth cousin — same shape, rounded corner. The hinge's corner at \(z = 1\) is the entire personality of the SVM. Plug in the two cases the brief cares about: a point classified correctly and well clear of the margin (\(z = y\,f(x) = 1.2\)) costs nothing, while a point on the wrong side (\(z = -0.5\)) costs \(1 - (-0.5) = 1.5\). The first two exercises walk exactly these. A correctly classified point has signed margin \(z = y\cdot f(x) = 1.2\). Using the hinge loss \(\ell = \max(0,\; 1 - z)\) from EQ M10.6, what loss does this point incur? \(\ell = \max(0,\; 1 - 1.2) = \max(0,\; -0.2) = \) 0. The point is past the margin (\(z > 1\)), so the hinge is flat at zero — it contributes no gradient and is not a support vector. This zero-region is exactly what makes the SVM solution sparse. A point lands on the wrong side of the boundary with signed margin \(z = y\cdot f(x) = -0.5\). What hinge loss \(\ell = \max(0,\; 1 - z)\) does it incur (EQ M10.6)? \(\ell = \max(0,\; 1 - (-0.5)) = \max(0,\; 1.5) = \) 1.5. A misclassified point pays more than 1 (it would pay exactly 1 if it sat on the boundary at \(z = 0\)), and the penalty grows linearly the deeper it strays. This nonzero hinge makes it an active, margin-violating support vector. INSTRUMENT M10.2 — HINGE LOSS & THE C TRADE-OFF EQ M10.5 / M10.6 · LIVE REGULARIZATION C 1.0 CLASS OVERLAP 0.9 MARGIN γ = 1/‖w‖ — TOTAL HINGE Σξ — MARGIN VIOLATIONS — TRAIN ACCURACY — The curve on the left is the hinge loss \(\max(0, 1-z)\) plotted against the signed margin \(z\) — note the elbow at \(z=1\) and the flat zero beyond it. The panel on the right shows two overlapping clouds with the live SVM boundary and its margins. Slide C from small (fat margin, many tolerated violations) to large (the boundary fights to classify every point, margin shrinks). Raise the overlap until the classes genuinely mix and watch even a large C give up — there is no separating line left to find. 10.4 The kernel trick: RBF and polynomial So far the boundary is a flat hyperplane — useless for data shaped like concentric rings or two interleaved spirals. The classical escape is to lift the data into a higher-dimensional space where it becomes linearly separable: map each \(x\) through some feature transform \(\phi(x)\), fit a linear SVM there, and the flat boundary upstairs is a curved one back downstairs. The catch is cost: a good \(\phi\) might have thousands or infinitely many components, and computing them all is hopeless. Here is the trick, and it is genuinely a piece of magic. Recall from EQ M10.3 and M10.4 that the SVM never needs \(\phi(x)\) by itself — it only ever needs inner products \(\phi(x_i)\cdot\phi(x_j)\). So if some cheap function \(K(x_i, x_j)\) happens to equal that inner product, we can compute in the lifted space while never visiting it: EQ M10.7 — THE KERNEL TRICK $$ K(x, x') \;=\; \phi(x) \cdot \phi(x') \qquad\Longrightarrow\qquad \hat{y}(x) \;=\; \operatorname{sign}\!\Big( \sum_{i \in \text{SV}} \alpha_i\, y_i\, K(x_i, x) \;+\; b \Big) $$ Every \(x_i \cdot x_j\) in the dual becomes \(K(x_i, x_j)\); nothing else changes. Mercer's theorem tells you which functions \(K\) are legal: any symmetric \(K\) whose Gram matrix \([K(x_i,x_j)]\) is positive semi-definite corresponds to some valid feature map \(\phi\) — you never have to construct it. You design the similarity measure, not the coordinates. This single substitution turns the linear SVM into a universal nonlinear classifier. Two kernels dominate practice: EQ M10.8 — RBF AND POLYNOMIAL KERNELS $$ K_{\text{RBF}}(x, x') = \exp\!\big(-\gamma\,\lVert x - x' \rVert^2\big), \qquad K_{\text{poly}}(x, x') = \big(\gamma\, x \cdot x' + r\big)^{d} $$ The RBF (Gaussian) kernel is the default — a smooth bump of similarity that decays with distance. Its feature map \(\phi\) is infinite-dimensional, so the trick is buying you something you could literally never compute by hand. The width parameter \(\gamma\) is decisive: small \(\gamma\) means each support vector's influence reaches far (smooth, almost-linear boundary); large \(\gamma\) means influence is hyper-local (the boundary wraps tightly around individual points — overfitting). The polynomial kernel of degree \(d\) implicitly contains all feature interactions up to order \(d\): degree 2 gives you every pairwise product \(x_a x_b\) for free. RBF and \(C\) are tuned together, almost always by cross-validated grid search, because they trade against each other: a large \(\gamma\) and a large \(C\) both push toward a wiggly, overfit boundary. The instrument below lets you feel that interaction on data that no straight line can touch. INSTRUMENT M10.3 — KERNEL PLAYGROUND RBF SVM ON INSEPARABLE RINGS · EQ M10.7 / M10.8 RBF WIDTH γ 2.0 REGULARIZATION C 5 DATASET RINGS XOR TRAIN ACCURACY — SUPPORT VECTORS — REGIME — A linear SVM scores ~50% here — the classes are concentric. The painted regions are a real RBF-kernel SVM (dual coordinate ascent over EQ M10.7) refit live in your browser. Start at a small \(\gamma\): the boundary is smooth but may miss the inner ring. Crank \(\gamma\) up and the boundary tightens into closed loops around clusters — eventually shrink-wrapping individual points, the visual signature of overfitting. Lower \(C\) to forgive stray points and recover a calmer frontier. PYTHON · RUNNABLE IN-BROWSER # The RBF kernel matrix on inseparable rings, and why lifting helps. # Inner ring = class 0, outer ring = class 1 -- no line can split them. import numpy as np rng = np.random.default_rng(1) def ring(r, n): t = rng.uniform(0, 2*np.pi, n) rad = r + rng.normal(0, 0.12, n) return np.c_[rad*np.cos(t), rad*np.sin(t)] X = np.vstack([ring(0.6, 60), ring(2.2, 60)]) y = np.r_[np.zeros(60, int), np.ones(60, int)] def rbf(A, B, g): # K_ij = exp(-g ||a_i - b_j||^2) d2 = ((A[:, None,:] - B[None,:,:])**2).sum(-1) return np.exp(-g * d2) K = rbf(X, X, g=1.0) # full 120x120 Gram matrix print("Gram matrix shape:", K.shape) print("diagonal (self-similarity):", np.round(K[0, 0], 3), "(always 1)") # A point's mean kernel-similarity to each class is already discriminative, # even though the raw coordinates are not linearly separable at all. sim0 = K[:, y == 0].mean(1) # closeness to inner ring sim1 = K[:, y == 1].mean(1) # closeness to outer ring pred = (sim1 > sim0).astype(int) # 1-NN-in-feature-space sanity check print("nearest-mean acc in feature space:", f"{(pred == y).mean():.3f}") print("=> in RBF feature space the rings ARE separable.") plot_scatter(X[:, 0], X[:, 1], y) RUN ▶ drop g to 0.05 (too smooth) or push to 30 (too local) and watch the feature-space accuracy crack WHY IT WORKS The kernel trick is a similarity measure in disguise. An RBF SVM's prediction (EQ M10.7) is a weighted vote of support vectors, where the weight \(K(x_i, x)\) is how similar the query is to each one — large \(\gamma\) makes "similar" mean "almost identical." Seen this way it is a close cousin of the k-NN and kernel-density ideas you will meet in Chapter 11: distance defines everything. The SVM's contribution is choosing which points to remember (the support vectors) and how much to weight each (the \(\alpha_i\)) by maximizing a margin, rather than keeping the whole training set. 10.5 SVMs in practice — and against the field For roughly a decade — from the late 1990s until deep learning's 2012 breakout — kernel SVMs were the most accurate general-purpose classifiers available, the default for text categorization, handwritten-digit recognition, and bioinformatics. They have receded, but understanding when they still win, and why they faded, is worth a section. The practical checklist is short and unusually reliable: Scale your features. Both the dot product and the RBF distance are dominated by large-magnitude features, so standardize to zero mean and unit variance first. This is not optional — an unscaled SVM is mostly listening to whichever column happens to have the biggest numbers. Start with the RBF kernel, and grid-search \((C, \gamma)\) on a log scale with cross-validation. Linear kernels are the right call only when the feature space is already huge and sparse (e.g. bag-of-words text), where they are both faster and just as accurate. Watch the support-vector count. If nearly every training point becomes a support vector, your \(\gamma\) is too large or the problem is too noisy — the model is memorizing, and both accuracy and prediction speed will suffer. SVMs output a score, not a probability. Calibrate (Platt scaling, or isotonic regression) if you need \(P(y\,|\,x)\). Model Decision boundary Scales to N samples Probabilities Best when… Linear SVM hyperplane (max-margin) excellent — linear in N via calibration high-dim sparse text; N in the millions RBF SVM smooth nonlinear poor — \(O(N^2)\)–\(O(N^3)\) to train via calibration small/medium N, low-dim, clean signal Logistic regression hyperplane (log-loss) excellent native & calibrated you need probabilities and interpretability Gradient-boosted trees axis-aligned, piecewise excellent native-ish heterogeneous tabular data (the default to beat) Deep nets arbitrary, learned features good (SGD), data-hungry native (softmax) images, text, audio; very large N The honest verdict, contested only at the margins: on the medium-sized, low-dimensional, fairly clean problems that gave SVMs their name, a well-tuned RBF SVM is still competitive with anything and often the cleanest model to reason about. But its training cost grows roughly quadratically-to-cubically in the number of samples, which rules it out for the large datasets that define modern ML; on messy tabular data, gradient-boosted trees (Chapter 04) usually edge it out with far less tuning; and on perceptual data, deep networks — which learn their feature map instead of fixing it with a kernel — left SVMs behind entirely after 2012. The kernel idea itself never died, though: it reappears in Gaussian processes, in the "neural tangent kernel" theory of why wide networks train the way they do, and in the attention mechanism of Volume II, which is a learned, data-dependent similarity kernel at heart. PITFALLS The four ways an SVM goes wrong: (1) forgetting to scale features — the single most common failure, and it produces a model that looks trained but isn't; (2) leaving \(C\) and \(\gamma\) at library defaults — they must be searched jointly, and the right values span orders of magnitude; (3) running an RBF SVM on hundreds of thousands of rows and waiting forever — reach for a linear SVM or SGD instead; (4) treating the raw decision score as a probability — it is an uncalibrated margin, not \(P(y\,|\,x)\). NEXT Every kernel in this chapter was, underneath, a measure of similarity between two points. Chapter 11 makes that the whole subject: Euclidean, Manhattan, cosine, Mahalanobis, Jaccard, edit distance — what "near" means, why the choice quietly decides everything from k-NN to clustering to the retrieval step inside every modern embedding system. 10.R References Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3), 273–297. The paper that introduced the soft margin and slack variables (EQ M10.5) and gave the method its modern form — the canonical primary source for this chapter. Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. Proceedings of COLT '92, 144–152. Where the maximum-margin hyperplane and the kernel trick (EQ M10.1, M10.7) were first combined — the true origin of the kernel SVM. Schölkopf, B. & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press — the definitive textbook treatment of kernels, Mercer's theorem, and the optimization behind EQ M10.3, M10.8. Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167. The most-cited tutorial — derives the dual (EQ M10.3) and the KKT support-vector conditions step by step. Chang, C.-C. & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM TIST 2(3), 1–27. The standard implementation behind scikit-learn's SVC, and the reference for practical \((C, \gamma)\) selection (§10.5). Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011). Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming 127(1), 3–30. The hinge-loss sub-gradient method used in this chapter's first Python cell — how to train a linear SVM at scale. ← PREVIOUS 09 Naive Bayes NEXT CHAPTER 11 Distances & Similarity AI // ENCYCLOPEDIA — MACHINE LEARNING · CH 10 FULL CONTENTS ↗ ## VOL I · Distance & Similarity Metrics (https://ai-encyclopedia.com/ml/11-distances-similarity.html) Distance & Similarity Metrics — AI Encyclopedia AI // ENCYCLOPEDIA / MACHINE LEARNING / 11 / DISTANCES INDEX NEXT: CLUSTERING ZOO → MACHINE LEARNING · CHAPTER 11 / 15 Distance & Similarity Metrics k-NN, every clustering algorithm, and every vector search rest on one decision that usually goes unexamined: how you measure "close". That single choice of distance determines every nearest neighbor, cluster, and embedding, and in high dimensions it behaves in counterintuitive ways. This chapter builds the families of distance and similarity from the ground up, Minkowski, Mahalanobis, cosine, and Jaccard, then confronts the curse that hollows them out as dimensions grow. LEVEL INTRO READING TIME ≈ 22 MIN BUILDS ON ML 04 · STATS 06 INSTRUMENTS DISTANCE EXPLORER · MAHALANOBIS · CONCENTRATION IN THIS CHAPTER 11.1 The distance defines the model 11.2 The Minkowski family 11.3 Mahalanobis distance 11.4 Cosine & Jaccard similarity 11.5 The curse of dimensionality 11.R References 11.1 Why the distance defines the model k-NN (Chapter 04) has no parameters, no training step, no loss function. Strip it down and what remains is a stored dataset and one function that decides which points count as neighbors. The same is true of k-means, of DBSCAN, of hierarchical clustering, of the approximate-nearest-neighbor index inside every vector database. None of them is really an algorithm over data; each is an algorithm over a distance. Change the distance and you have changed the model — usually more than any hyperparameter could. Mathematicians demand four properties before they will call a function \(d(\mathbf{x},\mathbf{y})\) a metric. They are worth stating because the moment you violate one, geometric intuition stops being trustworthy: EQ M11.1 — THE METRIC AXIOMS $$ d(\mathbf{x},\mathbf{y}) \ge 0,\quad d(\mathbf{x},\mathbf{y}) = 0 \iff \mathbf{x}=\mathbf{y},\quad d(\mathbf{x},\mathbf{y}) = d(\mathbf{y},\mathbf{x}),\quad d(\mathbf{x},\mathbf{z}) \le d(\mathbf{x},\mathbf{y}) + d(\mathbf{y},\mathbf{z}) $$ In order: non-negativity, identity of indiscernibles (zero distance means the same point), symmetry, and the triangle inequality — a detour through \(\mathbf{y}\) can never be shorter than going straight. Euclidean, Manhattan, and Mahalanobis distance satisfy all four. Cosine and squared-Euclidean distance do not — both break the triangle inequality, which is exactly why some fast indexes that assume a true metric cannot be used with them unmodified (§11.4). The deepest consequence hides inside the most innocent-looking choice. A distance combines coordinates, so it implicitly weights them. If one feature is measured in millimetres and another in kilometres, the kilometres dominate every Euclidean comparison and the millimetres become invisible — the model decides almost everything on a single axis without anyone choosing that. This is why distance-based methods demand standardized features, why a covariance correction exists at all (§11.3), and why "what distance?" is never a footnote. The distance is the inductive bias. SCALE FIRST An unscaled feature is a silent vote. Before any distance is computed, the standard moves are z-scoring each column to mean 0, variance 1, or min–max scaling to \([0,1]\). Skip it and the column with the largest raw spread quietly becomes the model. Trees (Chapter 04) are the exception that proves the rule — they care only about the order of values within a column, so they are invariant to rescaling and never need it. 11.2 The Minkowski family — Euclidean & Manhattan The workhorse distances are a single formula with one dial. The Minkowski distance of order \(p\) raises each coordinate gap to the power \(p\), sums them, and takes the \(p\)-th root: EQ M11.2 — THE MINKOWSKI DISTANCE $$ d_p(\mathbf{x},\mathbf{y}) \;=\; \left( \sum_{i=1}^{n} \lvert x_i - y_i \rvert^{\,p} \right)^{1/p}, \qquad p \ge 1 $$ Turning the single knob \(p\) gives every distance you use daily. \(p=1\) is Manhattan (taxicab) distance, the sum of absolute coordinate gaps — how far you walk on a city grid. \(p=2\) is Euclidean distance, the straight-line length from Pythagoras. As \(p \to \infty\) the largest single coordinate gap swallows the sum and you get the Chebyshev distance \(\max_i \lvert x_i - y_i\rvert\) — the king's move on a chessboard. For \(p \ge 1\) it is a true metric; for \(0 < p < 1\) the triangle inequality fails and it is only a "fractional distance." WORKED EXAMPLE ▾ 01 Take \(\mathbf{x} = (1, 1)\) and \(\mathbf{y} = (4, 5)\). The coordinate gaps are \(\lvert 4-1\rvert = 3\) and \(\lvert 5-1\rvert = 4\). 02 Manhattan (\(p=1\)): \(3 + 4 = \mathbf{7}\) — total blocks walked along the grid. 03 Euclidean (\(p=2\)): \(\sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = \mathbf{5}\) — the 3-4-5 right triangle. 04 Chebyshev (\(p\to\infty\)): \(\max(3, 4) = \mathbf{4}\). Note the ordering \(4 \le 5 \le 7\): higher \(p\) always gives a smaller-or-equal distance. RESULT: Manhattan 7 · Euclidean 5 · Chebyshev 4 The order \(p\) is not a cosmetic choice — it reshapes the geometry of "near." The set of all points at distance exactly 1 from the origin, the unit ball, changes form completely: a diamond for Manhattan, a circle for Euclidean, a square for Chebyshev. Two points that are equidistant under one \(p\) need not be under another, so the nearest neighbor itself can change with \(p\). The instrument below lets you drag both points and read all three off at once. Using EQ M11.2 with \(p=1\), what is the Manhattan distance between \((1, 2)\) and \((4, 6)\)? Sum the absolute coordinate gaps: \(\lvert 4-1\rvert + \lvert 6-2\rvert = 3 + 4 = \) 7. Manhattan ignores the diagonal shortcut — it counts only axis-aligned travel, as if walking city blocks. Using EQ M11.2 with \(p=2\), what is the Euclidean distance between \((0, 0)\) and \((3, 4)\)? \(\sqrt{(3-0)^2 + (4-0)^2} = \sqrt{9 + 16} = \sqrt{25} = \) 5 — the hypotenuse of the classic 3-4-5 right triangle. The straight-line route is shorter than the Manhattan route of \(3+4=7\), as the triangle inequality guarantees. PYTHON · RUNNABLE IN-BROWSER # EQ M11.2: the Minkowski family, plus cosine and Mahalanobis, between two vectors import numpy as np x = np.array([1.0, 1.0]) y = np.array([4.0, 5.0]) diff = x - y manhattan = np.abs(diff).sum() # p = 1 euclidean = np.sqrt((diff ** 2).sum()) # p = 2 chebyshev = np.abs(diff).max() # p -> infinity print(f"Manhattan (p=1): {manhattan:.4f}") print(f"Euclidean (p=2): {euclidean:.4f}") print(f"Chebyshev (inf): {chebyshev:.4f}") print("ordering p1 >= p2 >= pinf:", manhattan >= euclidean >= chebyshev) cos_sim = (x @ y) / (np.linalg.norm(x) * np.linalg.norm(y)) print(f"\ncosine similarity: {cos_sim:.4f} (1 = same direction)") cov = np.array([[1.0, 0.8], [0.8, 1.0]]) # correlated features m2 = diff @ np.linalg.inv(cov) @ diff # squared Mahalanobis (EQ M11.3) print(f"Mahalanobis dist: {np.sqrt(m2):.4f} (vs Euclidean {euclidean:.4f})") RUN ▶ edits are live — try p between 1 and 2, or set the off-diagonal covariance to 0 INSTRUMENT M11.1 — DISTANCE EXPLORER TWO POINTS · EUCLIDEAN · MANHATTAN · CHEBYSHEV · EQ M11.2 POINT P — x 1.0 POINT P — y 1.0 POINT Q — x 4.0 POINT Q — y 5.0 EUCLIDEAN (p=2) — MANHATTAN (p=1) — CHEBYSHEV (p=∞) — The faint coloured outlines around P are the three unit balls — the diamond is Manhattan, the circle Euclidean, the square Chebyshev — every shape a locus of "distance 1 from P." The mint line is the straight Euclidean route; the blue staircase is one Manhattan path of equal taxicab length. Default P = (1,1), Q = (4,5) gives Euclidean 5, Manhattan 7, Chebyshev 4. Drag Q onto a diagonal of P (equal x- and y-gaps) and watch Manhattan stretch furthest while Chebyshev stays small — the same two points, three different verdicts on "how far." 11.3 Mahalanobis distance — correcting for covariance Euclidean distance treats every direction as equal and every feature as independent. Real data rarely obliges. Suppose height and weight are strongly correlated: a point that is tall-and-light sits far from the data cloud even if its raw Euclidean distance to the centre is modest, because it violates the correlation the rest of the data obeys. The Mahalanobis distance (Mahalanobis, 1936) fixes this by measuring distance in units of the data's own spread, deflating directions of high variance and inflating directions of low variance: EQ M11.3 — MAHALANOBIS DISTANCE $$ d_M(\mathbf{x},\boldsymbol{\mu}) \;=\; \sqrt{(\mathbf{x}-\boldsymbol{\mu})^{\top}\, \Sigma^{-1}\, (\mathbf{x}-\boldsymbol{\mu})} $$ \(\boldsymbol{\mu}\) is the data mean and \(\Sigma\) its covariance matrix; \(\Sigma^{-1}\) is the inverse covariance (the "precision" matrix). The quadratic form rotates into the eigen-axes of \(\Sigma\) and rescales each by \(1/\sqrt{\lambda_i}\) — exactly the principal directions of Stats 06. When \(\Sigma = I\) (uncorrelated, unit-variance features), Mahalanobis collapses back to plain Euclidean distance. Curves of constant \(d_M\) are the ellipses aligned with the data cloud, not circles — which is why a point along the cloud's long axis can be "closer" than a nearer point lying across it. There is a clean way to see what \(\Sigma^{-1}\) is doing: Mahalanobis distance is just Euclidean distance computed after whitening the data — applying the linear transform \(\Sigma^{-1/2}\) that decorrelates the features and rescales each to unit variance. Whiten first, then measure with an ordinary ruler. This also explains its starring role in outlier and anomaly detection: for multivariate-Gaussian data, \(d_M^2\) follows a \(\chi^2\) distribution with \(n\) degrees of freedom, giving a principled threshold for "too far to be one of us." Features are uncorrelated with variances 9 and 16, so \(\Sigma = \begin{bmatrix} 9 & 0 \\ 0 & 16 \end{bmatrix}\) (\(\sigma_x = 3\), \(\sigma_y = 4\)). A point sits a gap of \((9, 16)\) from the mean. What is its Mahalanobis distance \(d_M\) (EQ M11.3)? For diagonal \(\Sigma\), \(\Sigma^{-1} = \begin{bmatrix} 1/9 & 0 \\ 0 & 1/16 \end{bmatrix}\), so each gap is measured in standard deviations. The squared distance is \(\dfrac{9^2}{9} + \dfrac{16^2}{16} = 9 + 16 = 25\). So \(d_M = \sqrt{25} = \) 5. The point sits \(3\sigma\) out along \(x\) and \(4\sigma\) out along \(y\) — a 3-4-5 right triangle in σ-units, even though its raw Euclidean gap of \(\sqrt{9^2+16^2}\approx 18.4\) is far larger. INSTRUMENT M11.2 — MAHALANOBIS vs EUCLIDEAN CORRELATED CLOUD · COVARIANCE ELLIPSE · EQ M11.3 VARIANCE σ²ₓ 3.0 VARIANCE σ²ᵧ 0.6 CORRELATION ρ 0.80 TEST POINT — x 2.4 TEST POINT — y 2.0 EUCLIDEAN TO MEAN — MAHALANOBIS TO MEAN — VERDICT (χ² ≈ 2.45σ) — The mint dots are a seeded Gaussian cloud with the covariance you dial in; the mint ellipses are the contours of constant Mahalanobis distance (1, 2, 3 σ). The white dot is your test point, the white line its straight Euclidean reach to the mean. Crank \(\rho\) toward 0.8 and slide the test point along the cloud's long axis: Euclidean distance stays large while Mahalanobis shrinks — the point is unremarkable given the correlation. Now push it across the short axis and the verdict flips to OUTLIER even at a small Euclidean distance. That divergence is the entire reason Mahalanobis exists. PYTHON · RUNNABLE IN-BROWSER # Mahalanobis = Euclidean after whitening — two points, same Euclidean distance, # but very different Mahalanobis distance on a correlated cloud (EQ M11.3) import numpy as np mu = np.array([0.0, 0.0]) cov = np.array([[1.0, 0.9], [0.9, 1.0]]) # strongly correlated features Sinv = np.linalg.inv(cov) # two test points the SAME Euclidean distance (sqrt 2) from the mean... along = np.array([1.0, 1.0]) # lies ALONG the correlation axis across = np.array([1.0, -1.0]) # lies ACROSS it def mahal(p): d = p - mu return np.sqrt(d @ Sinv @ d) for name, p in [("along ", along), ("across", across)]: print(f"{name}: Euclidean = {np.linalg.norm(p - mu):.3f}" f" Mahalanobis = {mahal(p):.3f}") print("\nIdentical Euclidean distance, ~6x difference in Mahalanobis:") print("the across-the-grain point violates the correlation, so it reads as far.") RUN ▶ edits are live — set the off-diagonal to 0 and watch the two distances become equal 11.4 Cosine & Jaccard similarity Sometimes magnitude is noise and only direction carries meaning. A document that uses the word "neural" twice as often as another, in the same proportions across every other word, is about the same thing — yet its longer count vector sits far away under Euclidean distance. Cosine similarity throws magnitude away by normalizing both vectors to unit length, then taking their dot product — the cosine of the angle between them: EQ M11.4 — COSINE SIMILARITY $$ \cos\theta \;=\; \frac{\mathbf{x}\cdot\mathbf{y}}{\lVert\mathbf{x}\rVert\,\lVert\mathbf{y}\rVert} \;=\; \frac{\sum_i x_i y_i}{\sqrt{\sum_i x_i^2}\,\sqrt{\sum_i y_i^2}} \;\in\; [-1, 1] $$ \(1\) means the vectors point the same way, \(0\) means orthogonal (unrelated), \(-1\) means opposite. The companion cosine distance is \(1 - \cos\theta\); it is not a true metric (it violates the triangle inequality), which is why some metric-tree indexes refuse it. The standard trick: on L2-normalized vectors, squared Euclidean distance is \(2(1 - \cos\theta)\) — a strictly increasing function of cosine distance — so cosine ranking equals Euclidean ranking, and you can use any Euclidean index after normalizing. This is why embedding pipelines normalize before they store. Cosine is the default for sparse text vectors and dense embeddings precisely because it ignores document length and activation scale. When the data is not counts but sets — the tags on a photo, the words in a tweet, the products in a basket — direction is the wrong picture entirely. Jaccard similarity measures overlap: the size of the intersection over the size of the union. EQ M11.5 — JACCARD SIMILARITY $$ J(A,B) \;=\; \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}, \qquad d_J(A,B) \;=\; 1 - J(A,B) $$ Pure set overlap: \(1\) when the sets are identical, \(0\) when they are disjoint. The Jaccard distance \(1 - J\) is a true metric. On binary vectors it counts only shared presences and ignores the vast sea of shared absences — the right instinct for sparse data, where two short documents agreeing that they both lack 49,998 of 50,000 vocabulary words tells you nothing. At web scale Jaccard is estimated, not computed, via MinHash + locality-sensitive hashing — the engine behind near-duplicate detection and the data deduplication that cleans LLM training corpora (Vol II · CH 04). Two sets \(A = \{1, 2, 3\}\) and \(B = \{2, 3, 4\}\). What is their Jaccard similarity \(J(A,B)\) (EQ M11.5)? Intersection \(A \cap B = \{2, 3\}\), size 2. Union \(A \cup B = \{1, 2, 3, 4\}\), size 4. So \(J = \dfrac{2}{4} = \) 0.5 — the sets share half of their combined elements. The Jaccard distance is \(1 - 0.5 = 0.5\). Measure Sees True metric? Reach for it when… Euclidean straight-line gap yes features are scaled & roughly independent; the default for k-means Manhattan axis-aligned gap yes grid-like or high-dimensional data; more robust to outliers than L2 Mahalanobis gap in σ-units yes features are correlated; outlier/anomaly detection Cosine angle only no text TF-IDF, dense embeddings — when length is irrelevant Jaccard set overlap yes (1−J) sparse binary / set data; near-duplicate detection PYTHON · RUNNABLE IN-BROWSER # Cosine ignores magnitude; Jaccard measures set overlap (EQ M11.4-M11.5) import numpy as np # a short doc and the SAME doc with every count doubled a = np.array([3.0, 1.0, 0.0, 2.0]) b = np.array([6.0, 2.0, 0.0, 4.0]) # b = 2*a: same direction cos = (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b)) euc = np.linalg.norm(a - b) print(f"cosine(a, b): {cos:.4f} (1.0 = identical direction)") print(f"euclidean(a, b): {euc:.4f} (large! magnitude differs)") print("-> cosine calls them identical; Euclidean is fooled by length\n") # Jaccard on two sets, computed via binary membership vectors A = {1, 2, 3} B = {2, 3, 4} inter = len(A & B) union = len(A | B) print(f"A & B = {A & B}, A | B = {A | B}") print(f"Jaccard(A, B): {inter/union:.4f}") print(f"Jaccard distance: {1 - inter/union:.4f}") RUN ▶ edits are live — add an element to B and watch Jaccard fall 11.5 The curse of dimensionality Everything above is built on a quiet assumption: that "nearest" is meaningfully different from "farthest." In low dimensions it obviously is. In high dimensions it quietly stops being true — and this is the single most important, least intuitive fact in this chapter. As the number of dimensions grows, the distances between random points concentrate: the nearest neighbor and the farthest neighbor of a query end up at almost the same distance, and the very notion of a closest point loses its bite. EQ M11.6 — DISTANCE CONCENTRATION $$ \lim_{n \to \infty} \;\mathbb{E}\!\left[ \frac{\mathrm{dist}_{\max}(n) - \mathrm{dist}_{\min}(n)}{\mathrm{dist}_{\min}(n)} \right] \;\to\; 0 \qquad (\text{i.i.d. features}) $$ For data with independent coordinates, the relative contrast between the farthest and nearest points of a query vanishes as the dimension \(n\) grows (Beyer et al., 1999). Each new coordinate adds roughly equal, independent "noise" to every pairwise distance, so by a law-of-large-numbers effect all distances pile up around the same mean and their spread relative to that mean collapses. The practical reading is brutal: in enough dimensions, every point is almost equidistant from every other, k-NN's votes become coin flips, and the index gains nothing over a linear scan. Aggarwal et al. (2001) add a twist — lower \(p\) (Manhattan over Euclidean, even fractional norms) concentrates more slowly, so Manhattan is often the better high-dimensional choice. A second face of the same curse is geometric. To capture a fixed fraction of uniformly spread points, a neighborhood must grow until it is no longer "local" at all. In a \(d\)-dimensional unit cube, a sub-cube holding just 1% of the volume needs edge length \(0.01^{1/d}\): at \(d=2\) that is 0.10, but at \(d=100\) it is \(\approx 0.955\) — to grab 1% of the data your "neighborhood" must span 95% of every single axis. Locality evaporates. THE ESCAPE HATCH Why does any of this still work? Because real high-dimensional data is almost never i.i.d. across its coordinates. Images, text embeddings, and audio live near a much lower-dimensional manifold inside the ambient space — the manifold hypothesis. The concentration result describes the worst case of structureless noise; genuine data has structure, so its intrinsic dimension is small even when its nominal dimension is thousands. This is precisely why k-NN on a good learned embedding still works, while k-NN on raw high-dimensional pixels does not. Dimensionality reduction (PCA, UMAP — next chapters) and learned representations are, at bottom, attempts to recover that low intrinsic dimension before you ever compute a distance. INSTRUMENT M11.3 — DISTANCE CONCENTRATION RANDOM POINTS · CONTRAST vs DIMENSION · EQ M11.6 DIMENSION n 2 NORM EUCLIDEAN MANHATTAN NEAREST / FARTHEST DIST — RELATIVE CONTRAST — REGIME — 200 points are drawn uniformly in the \(n\)-cube; the histogram is their distances to one query, and the contrast \((d_{\max}-d_{\min})/d_{\min}\) is EQ M11.6 made live. At \(n=2\) the distances are spread wide and "nearest" is meaningful. Slide \(n\) toward 512 and watch the histogram tighten into a spike — the contrast plunges from ~10 toward a fraction, and a nearest neighbor stops being distinguishable from the rest. Flip to Manhattan and the collapse is gentler at every dimension: the Aggarwal result, drawn from random data. PYTHON · RUNNABLE IN-BROWSER # EQ M11.6: the max/min pairwise-distance ratio collapsing as dimension rises import numpy as np rng = np.random.default_rng(0) print(" dim mean dist min max contrast (max-min)/min") dims, contrasts = [], [] for n in (2, 4, 8, 16, 64, 256, 1024): X = rng.random((200, n)) # 200 uniform points in the n-cube q = rng.random(n) # one query point d = np.sqrt(((X - q) ** 2).sum(axis=1)) # Euclidean distance to each contrast = (d.max() - d.min()) / d.min() dims.append(n); contrasts.append(contrast) print(f"{n:5d} {d.mean():10.3f} {d.min():8.3f} {d.max():8.3f} {contrast:14.3f}") print("\nIn 2-D the farthest point is ~10x the nearest; by 1024-D the gap is tiny.") print("Nearest-neighbor 'distance' loses its meaning -- the curse, quantified.") plot_xy(dims, contrasts) # contrast vs dimension RUN ▶ edits are live — switch to Manhattan (np.abs(X-q).sum) and watch it decay slower NEXT You now own the ruler; next you point it at unlabeled data. Every clustering algorithm is a distance plus a rule for grouping by it — k-means minimizes squared Euclidean distance to centroids, DBSCAN grows clusters by a distance radius, hierarchical methods merge by inter-cluster distance. Chapter 12 tours the clustering zoo and shows how the choice you made here decides the shapes each one can — and cannot — find. 11.R References Mahalanobis, P. C. (1936, repr. 2018). On the generalised distance in statistics. Sankhyā A, 80(S1), 1–7. the original covariance-corrected distance (EQ M11.3), reprinted with commentary. Aggarwal, C. C., Hinneburg, A. & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT 2001, LNCS 1973, 420–434. shows lower-order (fractional, Manhattan) norms concentrate more slowly than Euclidean (§11.5). Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When Is "Nearest Neighbor" Meaningful? ICDT 1999, LNCS 1540, 217–235. the foundational distance-concentration result behind EQ M11.6. Cover, T. & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Information Theory, 13(1), 21–27. why the distance choice is the model: the founding analysis of k-NN. Broder, A. Z. (1997/1998). On the resemblance and containment of documents & Min-wise independent permutations. SEQUENCES / STOC. MinHash estimation of Jaccard similarity at scale (§11.4). ← PREVIOUS 10 SVM & Kernels NEXT CHAPTER 12 Clustering Zoo AI // ENCYCLOPEDIA — MACHINE LEARNING · CH 11 FULL CONTENTS ↗ ## VOL I · The Clustering Zoo (https://ai-encyclopedia.com/ml/12-clustering-zoo.html) The Clustering Zoo — Hierarchical, DBSCAN & GMM — AI Encyclopedia AI // ENCYCLOPEDIA / MACHINE LEARNING / 12 / CLUSTERING ZOO INDEX NEXT: MATRIX FACTORIZATION → MACHINE LEARNING · CHAPTER 12 / 15 The Clustering Zoo — Hierarchical, DBSCAN & GMM k-means is fast and simple, but its squared-distance-to-a-centre objective can only carve the plane into round, equal, convex blobs. Hand it two interleaving crescents, a dense core inside a sparse halo, or a stretched ellipse, and it returns a confident but wrong partition. The rest of the clustering zoo exists to see the shapes k-means cannot: linking points into trees, chasing density instead of distance, and replacing hard balls with soft, full-covariance Gaussians. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON ML 05 · 11 INSTRUMENTS COMPARATOR · DENDROGRAM · EM STEPPER IN THIS CHAPTER 12.1 What k-means cannot see 12.2 Hierarchical & dendrograms 12.3 DBSCAN: density-based 12.4 Gaussian mixtures & EM 12.5 Choosing k & validating 12.R References 12.1 What k-means cannot see (recap) Chapter 05 built k-means from scratch: drop \(k\) centroids, assign each point to its nearest one, recompute each centroid as the mean of its members, repeat. The whole method optimizes a single objective — the within-cluster sum of squared distances, or inertia — and that one choice of objective bakes in three assumptions the output never confesses to. EQ M12.1 — WHAT THE INERTIA OBJECTIVE ASSUMES $$ J \;=\; \sum_{i=1}^{n}\big\lVert x_i - \mu_{c_i}\big\rVert^{2} \;=\; \sum_{j=1}^{k}\sum_{i \in S_j}\big\lVert x_i - \mu_j \big\rVert^{2} $$ Squared Euclidean distance to a single centre \(\mu_j\) is the only shape information k-means has. Three consequences fall straight out of that formula: clusters are implicitly spherical (every direction is penalized equally — a circle, never an ellipse), equal-sized (a point joins whichever centre is nearer in raw distance, so a big loose cluster cannibalizes a small tight one), and exhaustive (every point must join some cluster — there is no term for "this is an outlier"). Real data violates all three routinely; k-means violates them silently. The recap, then, is a list of failure modes — and a map of which animal in the zoo fixes each one: k-means assumes… How data breaks it Reach for… Clusters are convex, round balls two interleaving crescents (the two-moons set of Ch 04) get sliced crosswise, never traced DBSCAN · spectral clustering Clusters are equal-sized & isotropic a stretched ellipse, or a dense core beside a sparse halo: the boundary lands where variances balance, not where you would draw it Gaussian mixtures (full covariance, soft assignment) Every point belongs somewhere a handful of outliers each capture, or badly drag, a centroid DBSCAN (has a literal noise label) \(k\) is known in advance it almost never is; inertia falls monotonically with \(k\) and cannot choose it for you dendrogram cuts · DBSCAN (no \(k\)) · silhouette / BIC (§12.5) One honest framing before the algorithms arrive. There is no universally "correct" clustering — clustering is the search for structure no label ever defined, so every method optimizes a proxy (compactness, connectivity, density, likelihood) and the right method is the one whose proxy matches your purpose. The zoo is not a ladder from worse to better; it is a set of differently-shaped lenses. The instrument below is the whole chapter in one frame: one dataset, three lenses, three verdicts. INSTRUMENT M12.1 — ALGORITHM COMPARATOR ONE DATASET · k-MEANS vs DBSCAN vs GMM DATASET TWO MOONS ELLIPSES CORE + HALO ROUND BLOBS ALGORITHM k-MEANS DBSCAN GMM CLUSTERS FOUND — NOISE / UNASSIGNED — VERDICT vs TRUTH — Each dataset has a "right" answer your eye can see. Start on TWO MOONS and step through the algorithms: k-means slices both crescents straight across (the centres land between the moons, not along them); the Gaussian mixture, still centre-based, does little better; only DBSCAN traces each crescent by following the chain of dense neighbours. Switch to ELLIPSES and the story inverts — there the soft, full-covariance GMM wins and DBSCAN, with one global density threshold, struggles. CORE + HALO punishes the single threshold for everyone. No animal wins every dataset; matching the lens to the shape is the entire skill. 12.2 Hierarchical clustering & dendrograms k-means demands you commit to \(k\) before you have seen any structure. Agglomerative hierarchical clustering refuses to commit: it builds the entire family of clusterings at once, from \(n\) singletons up to one all-encompassing cluster, and lets you read off whichever level you like afterward. The algorithm is almost embarrassingly simple — start with every point as its own cluster, then repeatedly merge the two closest clusters until only one remains: EQ M12.2 — AGGLOMERATIVE MERGE & LINKAGE $$ (A^\star, B^\star) \;=\; \arg\min_{A \neq B}\; D(A, B), \qquad D_{\text{single}} = \min_{a \in A,\, b \in B}\lVert a - b\rVert, \quad D_{\text{complete}} = \max_{a \in A,\, b \in B}\lVert a - b\rVert, \quad D_{\text{ward}} = \Delta\,\text{SSE} $$ Everything turns on \(D(A,B)\), the linkage — how you measure the distance between two clusters, not two points. Single linkage uses the nearest pair, so it chains along thin filaments and can trace non-convex shapes (but suffers "chaining", merging distinct groups joined by a single bridge of points). Complete linkage uses the farthest pair, forcing compact, ball-like clusters. Ward linkage merges the pair that increases total within-cluster squared error least — it is, in effect, hierarchical k-means and the most common default. Different linkages produce genuinely different trees from identical data; the choice is a modelling decision, not a detail. The output is not a flat partition but a dendrogram: a binary tree whose leaves are the \(n\) points and whose every internal node is a merge, drawn at a height equal to the linkage distance at which that merge happened. The tree records the complete order in which structure assembled. To extract clusters you draw a horizontal line at some height \(h\) and cut: every branch the line crosses becomes one cluster. Cut low and you get many tight clusters; cut high and they fuse into a few loose ones — and the big vertical gaps in the tree (long stretches where no merge happens) mark the most natural, most stable places to cut. The arithmetic of the tree itself is fixed and worth internalizing. Each merge reduces the cluster count by exactly one, starting from \(n\) and ending at \(1\), so a dataset of \(n\) points always produces exactly \(n-1\) merges — and therefore \(n-1\) internal nodes in the dendrogram. You run agglomerative hierarchical clustering on a dataset of \( n = 50 \) points, all the way up to a single root cluster. How many merge operations does the algorithm perform in total? Each merge takes two clusters and makes one, so the cluster count drops by exactly 1 per merge. Starting at \( n = 50 \) singletons and ending at 1 cluster requires \( 50 - 1 = \) 49 merges — and the dendrogram therefore has 49 internal nodes, one per merge. A dendrogram's six leaves are joined by five merges at heights \( 0.4,\ 0.9,\ 1.1,\ 1.2,\ 2.8 \). You cut the tree with a horizontal line at height \( h = 1.0 \). How many clusters does the cut produce? A cut undoes every merge that happened above \( h \) and keeps every merge below it intact. The merges at \( 0.4 \) and \( 0.9 \) are below \( 1.0 \) and survive; the three at \( 1.1, 1.2, 2.8 \) are above the line and are undone. Each surviving merge fuses two groups into one, so the count is \( 6 \text{ leaves} - 2 \text{ surviving merges} = \) 4 clusters. Equivalently, clusters \( = 1 + (\text{merges above the cut}) = 1 + 3 = 4 \). Explore the cut directly. The instrument below builds a Ward dendrogram over a small seeded set; drag the cut height and watch clusters split and fuse, with the resulting partition coloured on the scatter beside it. INSTRUMENT M12.2 — DENDROGRAM EXPLORER WARD LINKAGE · DRAG THE CUT · EQ M12.2 CUT HEIGHT h 2.40 LINKAGE WARD SINGLE COMPLETE CLUSTERS AT THIS CUT — CUT HEIGHT — LARGEST GAP (NATURAL CUT) — The dashed white line is your cut; everything it crosses is a cluster, coloured live on the scatter at right. Slide it down through a long vertical gap and the cluster count is stable across the whole gap — that flatness is exactly why the gap marks a "natural" number of clusters. Now switch linkage: SINGLE chains the points into long straggly groups (watch one bridge merge two visually-separate blobs early), while COMPLETE and WARD insist on compact balls. Same points, three different trees — linkage is a real choice. Cost and when to use it. The naive algorithm is \(O(n^3)\) time and \(O(n^2)\) memory (it holds the full pairwise-distance matrix), so vanilla agglomerative clustering tops out around tens of thousands of points — SLINK/CLINK bring single and complete linkage down to \(O(n^2)\), and for larger \(n\) you sub-sample or switch tools. Its real strengths are the nested structure (taxonomies, gene-expression heatmaps, document hierarchies) and not having to pick \(k\) up front. The honest caveat: agglomerative merges are greedy and irrevocable — a bad early merge can never be undone, which is precisely the weakness density-based methods sidestep next. 12.3 DBSCAN — density-based clustering Centre-based methods ask "which prototype is this point near?". DBSCAN (Density-Based Spatial Clustering of Applications with Noise) asks a different and often better question: "is this point in a crowded neighbourhood, and is that crowd connected to other crowds?". A cluster, in this view, is simply a connected region of high point density, of any shape, surrounded by regions of low density. That single reframing buys three things k-means cannot offer: arbitrary cluster shapes, no need to specify \(k\), and a built-in notion of noise. DBSCAN has exactly two parameters and three kinds of point. The parameters are \(\varepsilon\) (the neighbourhood radius) and minPts (how many neighbours make a point "dense"). The point types follow: EQ M12.3 — CORE, BORDER, NOISE $$ N_\varepsilon(p) = \{\, q: \lVert p - q \rVert \le \varepsilon \,\}, \qquad p \text{ is a } \textbf{core point} \iff \lvert N_\varepsilon(p) \rvert \ge \texttt{minPts} $$ \(N_\varepsilon(p)\) is the set of points within radius \(\varepsilon\) of \(p\), including \(p\) itself. A core point has at least minPts neighbours in that ball — it sits in a crowd. A border point is not itself dense but falls within \(\varepsilon\) of some core point — it is on the fringe of a crowd. A noise point is neither: too few neighbours, and not close enough to any core. Clusters grow by density-reachability: start at any unvisited core point and absorb everything reachable through a chain of overlapping \(\varepsilon\)-balls of core points. Because the cluster follows the chain of dense neighbours rather than a distance-to-centre, it can bend into any shape — crescents, spirals, rings — that k-means would shatter. The noise label is the quiet superpower. Every other method in this chapter forces every point into some cluster; DBSCAN explicitly refuses, tagging genuinely isolated points as noise rather than letting one outlier drag a centroid across the plane. This is a feature, not a bug — but it is also the source of the most common exam confusion, so state it plainly: a point in a sparse, low-density region (too few neighbours within \(\varepsilon\), and not within \(\varepsilon\) of any core point) is labelled noise and left out of every cluster. True or false: DBSCAN labels points that lie in low-density regions — too few neighbours within \( \varepsilon \), and not within \( \varepsilon \) of any core point — as noise, leaving them out of every cluster. (Enter true or false.) This is exactly the definition of a noise point in EQ M12.3: neither core (fewer than minPts neighbours) nor border (not within \( \varepsilon \) of a core point). Unlike k-means, hierarchical clustering, or GMM — all of which assign every point to some cluster — DBSCAN has a dedicated noise label for low-density points. The statement is true. Choosing the two knobs is the whole craft of DBSCAN. minPts is usually set to roughly \(2 \times \text{dimensionality}\) (so \(\approx 4\) for 2-D data; larger for noisier or higher-dimensional data), and \(\varepsilon\) is read off a k-distance plot: sort every point's distance to its \(k\)-th nearest neighbour and look for the "elbow" where the curve turns sharply upward — that knee is the density boundary between cluster and noise. The build-it-yourself cell below implements the whole algorithm in numpy and runs it on the two-moons set k-means famously fails on. PYTHON · RUNNABLE IN-BROWSER # DBSCAN from scratch (EQ M12.3) on two interleaving moons. # k-means slices the crescents; density-reachability traces them. import numpy as np rng = np.random.default_rng(7) def two_moons(n=150, noise=0.07): t = np.linspace(0, np.pi, n) a = np.c_[np.cos(t), np.sin(t)] # upper crescent b = np.c_[1 - np.cos(t), 1 - np.sin(t) - 0.5] # lower, offset crescent X = np.vstack([a, b]) + rng.normal(0, noise, (2*n, 2)) return X def dbscan(X, eps, min_pts): n = len(X); labels = np.full(n, -1); cid = -1 # -1 = noise/unset D = np.sqrt(((X[:, None] - X[None]) ** 2).sum(-1)) # pairwise distances neigh = [np.where(D[i] leave as noise cid += 1; labels[i] = cid; seeds = list(neigh[i]) # start a new cluster k = 0 while k = min_pts: # j is core -> expand seeds += [q for q in neigh[j] if q not in seeds] return labels X = two_moons() labels = dbscan(X, eps=0.22, min_pts=5) n_clusters = labels.max() + 1 n_noise = int((labels == -1).sum()) print(f"clusters found: {n_clusters} (truth = 2 crescents)") print(f"noise points: {n_noise} (k-means would force ALL of these into a cluster)") plot_scatter(X[:, 0], X[:, 1], labels) RUN ▶ edits are live — drop eps to 0.1 and watch the moons shatter into noise The honest limitations — and the modern fix. DBSCAN uses a single global \(\varepsilon\), so it struggles when clusters have very different densities (the right \(\varepsilon\) for the dense core is wrong for the sparse halo — you watched this in Instrument M12.1's CORE + HALO set), and like all Euclidean-distance methods it degrades in high dimensions as distances concentrate. Its near-universal successor, HDBSCAN (2017), removes the \(\varepsilon\) knob entirely: it builds a hierarchy across all density levels and extracts the most stable clusters, handling variable density gracefully — it is the default density clusterer in practice today. Worth knowing the lineage: DBSCAN is the idea, HDBSCAN is what you usually run. 12.4 Gaussian Mixture Models & EM k-means makes two hard commitments per point: a hard assignment (you belong to cluster 3, full stop) and a single isotropic radius (every cluster is a circle). The Gaussian mixture model softens both. It models the data as having been generated by \(k\) Gaussians blended together: to draw a point, first pick a component \(j\) with probability \(\pi_j\), then sample from that component's Gaussian \(\mathcal{N}(\mu_j, \Sigma_j)\). The density of any point is the weighted sum over all components: EQ M12.4 — THE MIXTURE DENSITY $$ p(x) \;=\; \sum_{j=1}^{k} \pi_j\, \mathcal{N}\!\big(x \mid \mu_j,\, \Sigma_j\big), \qquad \pi_j \ge 0,\quad \sum_{j=1}^{k}\pi_j = 1 $$ \(\pi_j\) is the mixing weight (the prior probability of component \(j\)); \(\mu_j\) its mean; \(\Sigma_j\) its covariance matrix, which is what frees the model from k-means' circles. A diagonal \(\Sigma_j\) gives axis-aligned ellipses; a full \(\Sigma_j\) gives ellipses tilted at any angle and of any aspect ratio — so a GMM can fit the stretched, rotated clusters k-means mangles. The constraint \(\sum_j \pi_j = 1\) makes \(p(x)\) a proper probability density. In fact k-means is the limiting case of a GMM with equal weights, shared spherical covariance \(\Sigma_j = \sigma^2 I\), and \(\sigma \to 0\) forcing assignments hard. Because each component is a full Gaussian, a GMM does not say "point \(i\) is in cluster \(j\)". It computes a responsibility \(\gamma_{ij}\) — the posterior probability that component \(j\) generated point \(i\) — a soft, fractional membership that can be 70% one cluster and 30% another. That softness is the whole point: it honestly represents the ambiguity of points sitting between clusters, which hard k-means simply discards. EQ M12.5 — EM: THE E-STEP AND M-STEP $$ \textbf{E:}\;\; \gamma_{ij} = \frac{\pi_j\,\mathcal{N}(x_i \mid \mu_j, \Sigma_j)}{\sum_{l=1}^{k}\pi_l\,\mathcal{N}(x_i \mid \mu_l, \Sigma_l)} \qquad\qquad \textbf{M:}\;\; \pi_j = \frac{N_j}{n},\;\; \mu_j = \frac{1}{N_j}\sum_i \gamma_{ij} x_i,\;\; \Sigma_j = \frac{1}{N_j}\sum_i \gamma_{ij}(x_i - \mu_j)(x_i - \mu_j)^{\top} $$ where \(N_j = \sum_i \gamma_{ij}\) is the effective number of points in component \(j\). The Expectation–Maximization algorithm alternates exactly as Lloyd's loop does, but soft: the E-step fixes the parameters and computes every responsibility (a soft "assign"); the M-step fixes the responsibilities and re-estimates each \(\pi_j, \mu_j, \Sigma_j\) as responsibility-weighted statistics (a soft "update"). Each full iteration is guaranteed to increase the data log-likelihood \(\sum_i \log p(x_i)\) — or leave it unchanged — and it converges to a local maximum that depends on the initialization, so multiple restarts (often k-means++ seeding) are standard. Replace the soft \(\gamma_{ij}\) with a hard 0/1 assignment and EM becomes k-means. The parameter count is where the freedom shows its price. Each component carries a mean (one number per dimension), a covariance, and a weight. For a 2-D, full-covariance model the means alone are \(2\) numbers per component — \(2k\) in total — and that is the figure the exercise below pins down, because it is the cheapest way to feel how parameters scale with \(k\). A Gaussian mixture in 2-D has \( k = 5 \) full-covariance components. Counting only the mean parameters (each component's mean is a point in 2-D), how many mean parameters does the model have in total? Each mean \( \mu_j \) lives in \( \mathbb{R}^2 \), so it contributes 2 numbers; with \( k \) components the means contribute \( 2k \) parameters in total. For \( k = 5 \): \( 2 \times 5 = \) 10 mean parameters. (For the full picture each full covariance adds \( d(d+1)/2 = 3 \) more per component, plus \( k-1 \) free weights — but the means alone scale as \( 2k \).) Watch EM converge one step at a time. The stepper seeds two Gaussians badly, then alternates E and M: the E-step recolours every point by its responsibility (a blend, never a hard pick), and the M-step slides and reshapes the two ellipses to fit. The log-likelihood climbs monotonically in the readout — that climb is the convergence guarantee of EQ M12.5. INSTRUMENT M12.3 — GMM / EM STEPPER 2 COMPONENTS · RESPONSIBILITIES + FITTED GAUSSIANS · EQ M12.5 CONTROL E / M STEP AUTO ▶ RE-INIT ↻ NEXT HALF-STEP E-STEP ITERATIONS 0 LOG-LIKELIHOOD — The two ellipses are the fitted Gaussians at \( \pm 2\sigma \); point colour blends mint and blue by responsibility, so a purple point is one EM is genuinely unsure about. Press E / M STEP alternately: the E-step only recolours (parameters frozen), the M-step only moves and reshapes the ellipses (colours frozen). The log-likelihood never falls — that monotone climb is EQ M12.5's guarantee made visible. Now RE-INIT a few times: most starts find the two true clusters, but a bad seed can collapse a component onto a few points (likelihood spikes toward \( +\infty \)) — the singularity pathology that real implementations regularize \( \Sigma_j \) to avoid. And the 1-D version in numpy, stripped to its skeleton: EM for a two-component mixture on a line. Watch the two means separate to the true generating centres and the weights settle near their true mixing proportions as the log-likelihood plateaus. PYTHON · RUNNABLE IN-BROWSER # EM for a 1-D two-component GMM (EQ M12.5). Recover means & weights. import numpy as np rng = np.random.default_rng(0) # true mixture: 40% N(0,1), 60% N(5, 1.5^2) x = np.concatenate([rng.normal(0.0, 1.0, 200), rng.normal(5.0, 1.5, 300)]) n = len(x) def normal(x, mu, var): # 1-D Gaussian density return np.exp(-(x - mu)**2 / (2*var)) / np.sqrt(2*np.pi*var) mu = np.array([-1.0, 1.0]) # deliberately bad init var = np.array([1.0, 1.0]) pi = np.array([0.5, 0.5]) for it in range(40): # E-step: responsibilities gamma[i, j] comp = np.stack([pi[j]*normal(x, mu[j], var[j]) for j in range(2)], axis=1) ll = np.log(comp.sum(1)).sum() # data log-likelihood (climbs each iter) g = comp / comp.sum(1, keepdims=True) # M-step: weighted means, variances, weights Nj = g.sum(0) pi = Nj / n mu = (g * x[:, None]).sum(0) / Nj var = (g * (x[:, None] - mu)**2).sum(0) / Nj if it in (0, 4, 39): print(f"iter {it:2d}: loglik={ll:8.1f} means={np.round(mu,2)} weights={np.round(pi,2)}") print("\nconverged means:", np.round(np.sort(mu), 2), " (true: 0.0, 5.0)") print("converged weights:", np.round(pi[np.argsort(mu)], 2), " (true: 0.4, 0.6)") RUN ▶ edits are live — set both init means to 2.0 and watch a bad start stall 12.5 Choosing k & validating clusters Every method here either needs \(k\) (k-means, GMM, a dendrogram cut) or trades it for density knobs (DBSCAN's \(\varepsilon\), minPts). Without labels there is no accuracy to optimize, so validation splits into two honest questions: internal validation (is the clustering geometrically good on its own terms?) and the far better external answer (does a downstream task care?). The internal tools each have a failure mode, and knowing the failure mode is the skill. The elbow method plots inertia \(J\) against \(k\) and looks for the bend where adding centres stops paying. It is the weakest tool here, because — as Chapter 05 proved — \(J\) falls monotonically with \(k\) all the way to the absurd \(J = 0\) at one centre per point, so there is often no clean elbow at all. The silhouette score is sharper: for each point it compares the mean distance to its own cluster against the mean distance to the nearest other cluster. EQ M12.6 — SILHOUETTE OF A POINT $$ s(i) \;=\; \frac{b(i) - a(i)}{\max\big(a(i),\, b(i)\big)} \;\in\; [-1,\, 1], \qquad a(i) = \text{mean dist to own cluster}, \quad b(i) = \text{mean dist to nearest other cluster} $$ \(a(i)\) measures cohesion (how tight your own cluster is), \(b(i)\) measures separation (how far the nearest rival cluster is). \(s(i) \to 1\) means the point is deep inside a well-separated cluster; \(s(i) \approx 0\) means it sits on a boundary; \(s(i) < 0\) means it is closer to a neighbouring cluster than its own — a likely misassignment. Average \(s(i)\) over all points and you get a single score in \([-1,1]\); the \(k\) that maximizes it is a defensible choice. The catch: silhouette is built from distance-to-centre logic, so it favours convex, k-means-shaped clusters and will under-rate a correct DBSCAN clustering of crescents. An internal metric can only reward the shape it was designed around. For probabilistic models there is a principled alternative that penalizes complexity directly. A GMM can always raise its likelihood by adding components, so you cannot pick \(k\) by likelihood alone — but the Bayesian Information Criterion subtracts a penalty for every parameter, trading fit against parsimony: EQ M12.7 — BIC FOR MODEL SELECTION $$ \text{BIC} \;=\; m\,\ln(n) \;-\; 2\,\ln \hat{L}, \qquad m = \#\text{free parameters},\quad n = \#\text{points},\quad \hat{L} = \text{maximized likelihood} $$ Lower BIC is better. The \(-2\ln\hat L\) term rewards fit; the \(m\ln(n)\) term punishes every extra parameter, so a component that barely improves the likelihood is rejected for the parameters it costs. For a \(d\)-dimensional, full-covariance GMM each component carries \(d\) mean parameters \(+\;d(d+1)/2\) covariance parameters \(+\;1\) weight, with one weight constrained away — so \(m = k\,[\,d + d(d{+}1)/2\,] + (k-1)\). Sweep \(k\), keep the minimum-BIC model. BIC is the most principled \(k\)-selector in this chapter — but it is only valid when the data really is a mixture of Gaussians; on crescents it will confidently choose the wrong \(k\) because the model is wrong, not the criterion. The genuinely decisive answer is external. When the clusters feed something downstream — segments feeding a campaign, codes feeding a classifier, groups a domain expert will inspect — let that task's metric choose \(k\) and the algorithm. This converts an unanswerable unsupervised question back into a measurable one, and it is the most honest move in the whole chapter. Where you have at least some ground-truth labels, the Adjusted Rand Index compares a clustering to the truth while correcting for chance agreement (0 = random, 1 = perfect), and is the standard external score. FINE PRINT Four traps that quietly invalidate a clustering. (1) Scale. Every Euclidean method here inherits Chapter 02's pathology — an unscaled feature in different units silently owns the result; standardize first unless the units are genuinely shared. (2) Metric chases shape. Silhouette and inertia reward convex blobs, so they will rank a wrong k-means clustering above a correct DBSCAN one — never validate a density method with a centre-based score. (3) The curse of dimensionality. In high dimensions all pairwise distances concentrate toward equality, so distance-based clustering loses its grip; cluster in a learned low-dimensional embedding (next chapter) instead. (4) Clusters always appear. Every algorithm here returns clusters even on pure noise — before trusting any partition, ask whether structure exists at all (e.g. the Hopkins statistic), because a confident clustering of structureless data is the most seductive artifact in the field. NEXT Clustering compresses \(n\) points into a handful of groups; the next chapter compresses a whole matrix into a handful of factors. Chapter 13 — Matrix Factorization — turns the user-item rating tables behind every recommender into low-rank products of latent factors, connects SVD, NMF and ALS, and shows how the same algebra that powers collaborative filtering reappears as the embedding machinery throughout modern AI. 12.R References Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. KDD-96 — the original DBSCAN: core/border/noise points and density-reachability (§12.3, EQ M12.3). Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B 39(1) — the Expectation–Maximization algorithm behind GMM fitting (§12.4, EQ M12.5). Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20 — the silhouette score for choosing and validating k (§12.5, EQ M12.6). Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58(301) — Ward's minimum-variance linkage for agglomerative clustering (§12.2, EQ M12.2). Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics 6(2) — the Bayesian Information Criterion for model (and component-count) selection (§12.5, EQ M12.7). Campello, R. J. G. B., Moulavi, D., Zimek, A. & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM TKDD 10(1) — HDBSCAN, the variable-density successor that removes DBSCAN's ε knob (§12.3 note). Hubert, L. & Arabie, P. (1985). Comparing Partitions. Journal of Classification 2(1) — the Adjusted Rand Index for chance-corrected external validation (§12.5). ← PREVIOUS 11 Distances & Similarity NEXT CHAPTER 13 Matrix Factorization AI // ENCYCLOPEDIA — MACHINE LEARNING · CH 12 FULL CONTENTS ↗ ## VOL I · 13 · Matrix Factorization & SVD (https://ai-encyclopedia.com/ml/13-matrix-factorization.html) 13 · Matrix Factorization & SVD — AI Encyclopedia AI // ENCYCLOPEDIA / MACHINE LEARNING / 13 / MATRIX FACTORIZATION INDEX NEXT: ENSEMBLES → MACHINE LEARNING · CHAPTER 13 / 15 Matrix Factorization & SVD in Practice A ratings table, a term-document count, a pixel grid, an adjacency matrix: most large matrices that show up in practice are not full-rank. They are low-rank. A few hidden factors explain almost everything, and writing the matrix as a product of two thin ones recovers those factors. Factorization is the shared engine under recommender systems, word and item embeddings, image compression, and the PCA from Chapter 05. This chapter builds it from the singular value decomposition outward, then covers the three variants you will deploy. LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON ML 02 · 05 · STATS 06 INSTRUMENTS RATINGS · SCREE · NMF IN THIS CHAPTER 13.1 Low-rank structure 13.2 SVD & truncation 13.3 Recommenders 13.4 Non-negative MF 13.5 PCA as SVD 13.R References 13.1 Low-rank structure in real data An \(m \times n\) matrix has up to \(mn\) free numbers, but its rank — the number of linearly independent rows (equivalently, columns) — is often far smaller than \(\min(m,n)\). A matrix of rank \(r\) can be written exactly as a product of an \(m \times r\) and an \(r \times n\) matrix, so it really only carries \(r(m+n)\) degrees of freedom. When \(r \ll \min(m,n)\), that is a colossal saving. Why should real matrices be low-rank? Because they are generated by a small number of latent causes. A ratings table is driven by a handful of taste dimensions, not by a million independent whims. A term–document matrix is driven by a few dozen topics. A grayscale photo's columns are nearly redundant because neighboring columns look almost identical. The rank measures how many independent patterns are actually present; everything else is a combination of them. EQ M13.1 — RANK-r FACTORIZATION $$ \underbrace{A}_{m\times n} \;=\; \underbrace{U}_{m\times r}\, \underbrace{V^{\top}}_{r\times n}, \qquad \operatorname{rank}(A) = r \;\le\; \min(m,n) $$ Row \(i\) of \(A\) is \(U_{i,:}V^{\top}\): a weighted mix of the \(r\) rows of \(V^{\top}\). Those \(r\) rows are the shared patterns; the row of \(U\) is the recipe that reconstructs entity \(i\) from them. Parameter count drops from \(mn\) to \(r(m+n)\). For a \(10{,}000 \times 10{,}000\) matrix of rank \(20\) that is \(10^8\) numbers collapsing to \(4\times10^5\) — a 250× compression with zero error if the matrix truly has that rank. Real data is rarely exactly low-rank — there is noise — but it is very often approximately low-rank: a sharp drop in the singular value spectrum (§13.2) followed by a long tail of small values. The art is choosing where to cut the tail: keep enough rank to capture the signal, drop enough to discard the noise. The rest of the chapter is variations on that single decision. You factor a \(10 \times 10\) matrix as \(A = UV^{\top}\) with rank \(r = 2\) (so \(U\) is \(10\times 2\) and \(V\) is \(10\times 2\)). How many free parameters does this factorization use in total? A rank-\(r\) factorization of an \(m\times n\) matrix uses \(r(m+n)\) parameters. Here \(r=2,\ m=n=10\): \(2\,(10+10) = 2 \times 20 = \) 40. The dense matrix has \(100\) entries, so even a rank-2 factorization more than halves the storage — and the gap widens fast as the matrix grows. INSTRUMENT M13.1 — LOW-RANK RATINGS FACTORIZE · PREDICT MISSING · EQ M13.1 LATENT FACTORS k 2 TRAINING STEPS 300 OBSERVED CELLS — TRAIN RMSE — PARAMS k(m+n) — Left grid: the observed ratings (blank = unrated). Right grid: the model's reconstruction \(UV^{\top}\) — the blank cells are now predictions, the whole point of a recommender. Drag training steps up to watch gradient descent (EQ M13.2) fit the observed cells and, in doing so, fill the gaps. Bump \(k\) past the true structure and the train RMSE keeps dropping while the predictions get noisier — the overfitting you will fight in §13.3. 13.2 SVD recap & truncation The singular value decomposition is the factorization that always exists, for every real matrix, and is in a precise sense the best one. It writes \(A\) as a rotation, a non-negative scaling, and another rotation: EQ M13.2 — THE SVD $$ A \;=\; U \Sigma V^{\top} \;=\; \sum_{i=1}^{r} \sigma_i\, u_i v_i^{\top}, \qquad \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r \ge 0 $$ \(U\) (\(m\times m\)) and \(V\) (\(n\times n\)) are orthogonal (\(U^{\top}U = I\)); their columns \(u_i, v_i\) are the left/right singular vectors. \(\Sigma\) is diagonal with the singular values \(\sigma_i\), which are always non-negative — they are the lengths the unit sphere is stretched to, and a length cannot be negative. The right form sums \(r\) rank-1 layers, each a singular value times an outer product, ordered most-important-first. The squared singular values \(\sigma_i^2\) are the energy (variance) each layer carries. Now truncate. Keep only the top \(k\) layers and you get the rank-\(k\) matrix \(A_k = \sum_{i=1}^{k}\sigma_i u_i v_i^{\top}\). The Eckart–Young theorem — one of the load-bearing results of applied linear algebra — says this is not just a good rank-\(k\) approximation, it is the best possible one: EQ M13.3 — ECKART–YOUNG (BEST RANK-k FIT) $$ \min_{\operatorname{rank}(B)\le k}\; \lVert A - B\rVert_F \;=\; \lVert A - A_k\rVert_F \;=\; \sqrt{\sum_{i=k+1}^{r}\sigma_i^{2}} $$ Among all matrices of rank \(\le k\), the truncated SVD \(A_k\) minimizes the Frobenius (and spectral) error, and the leftover error is exactly the energy of the singular values you threw away. This is why "keep the top \(k\) singular values" is the right thing to do, not a heuristic. The same theorem justifies LoRA (Vol II · EQ 6.1): if a weight update is low-rank, its best compression is its truncated SVD. How big should \(k\) be? Plot the singular values (a scree plot) or, better, the cumulative energy retained, \(\sum_{i\le k}\sigma_i^2 \big/ \sum_i \sigma_i^2\). A common rule is to keep enough components to retain 90–99% of the energy — the elbow where the curve flattens marks the boundary between signal and noise. True or false: the singular values \(\sigma_i\) produced by the SVD of a real matrix are always non-negative. (Answer true or false.) The singular values are the square roots of the eigenvalues of \(A^{\top}A\), a positive-semidefinite matrix, so each \(\sigma_i = \sqrt{\lambda_i} \ge 0\). Geometrically they are the factors by which the unit sphere is stretched along the principal axes, and a stretch length is never negative. The answer is true. (Their signs are absorbed into the singular vectors instead.) A matrix has singular values \(\sigma = (6,\, 3,\, 2,\, 1)\). What percent of the total energy (sum of squares) is retained by the best rank-2 approximation? Enter the percent, e.g. 90 for 90%. Energy is \(\sigma_i^2\). Total \(= 36 + 9 + 4 + 1 = 50\). The top two layers carry \(36 + 9 = 45\). Retained fraction \(= 45/50 = 0.90\), so \(\times 100 = \) 90 %. Rank-2 already captures nine-tenths of this matrix; the last two layers — energy \(4 + 1 = 5\) — are nearly noise. PYTHON · RUNNABLE IN-BROWSER # Truncated-SVD recommender: held-out RMSE as a function of rank (EQ M13.2-3) import numpy as np rng = np.random.default_rng(1) m, n, true_r = 40, 25, 2 # data is REALLY rank 2 A = rng.normal(0, 1, (m, true_r)) @ rng.normal(0, 1, (true_r, n)) A = 1 + 4 * (A - A.min()) / (A.max() - A.min()) # squash into 1..5 stars A += rng.normal(0, 0.15, A.shape) # + observation noise test = rng.random(A.shape) < 0.2 # hide 20% of cells as test mu = A[~test].mean() # global mean fills the holes F = np.where(test, mu, A) # mean-imputed train matrix ranks, rmses = [1, 2, 3, 5, 10], [] for r in ranks: Uu, s, Vt = np.linalg.svd(F - mu, full_matrices=False) Ahat = mu + (Uu[:,:r] * s[:r]) @ Vt[:r] # best rank-r fit (EQ M13.3) rmse = np.sqrt(np.mean((A[test] - Ahat[test]) ** 2)) rmses.append(rmse) print(f"rank {r:2d}: held-out RMSE {rmse:.3f}") print("\nRMSE bottoms out near the TRUE rank (2); higher rank just refits noise.") plot_xy(ranks, rmses) RUN ▶ edits are live — break it on purpose INSTRUMENT M13.2 — SCREE & ENERGY PICK k · VARIANCE RETAINED · EQ M13.3 KEEP TOP-k COMPONENTS 3 SPECTRUM DECAY fast ENERGY RETAINED — ERROR ‖A−A_k‖_F / ‖A‖_F — COMPRESSION (k=N→k) — Bars are squared singular values (energy); the mint line is cumulative energy retained as you keep more components. Slide \(k\) to the elbow — where the bars go flat — and you keep nearly all the energy for a fraction of the rank. Switch the decay from fast (sharp spectrum, a few components suffice) to slow (flat spectrum, no good low-rank fit exists) to feel when factorization helps and when it does not. 13.3 Recommender systems — latent factors The canonical application, made famous by the 2006–2009 Netflix Prize, is collaborative filtering. You have a sparse \(m\times n\) ratings matrix \(R\): users by items, with the vast majority of entries missing (a typical user has rated 0.1% of the catalog). The task is to fill in the blanks. Matrix factorization's answer: assign every user a latent vector \(p_u \in \mathbb{R}^{k}\) and every item a latent vector \(q_i \in \mathbb{R}^{k}\), and predict the rating as their dot product. EQ M13.4 — LATENT-FACTOR MODEL (WITH BIASES) $$ \hat r_{ui} \;=\; \mu + b_u + b_i + p_u^{\top} q_i, \qquad p_u, q_i \in \mathbb{R}^{k} $$ \(\mu\) is the global mean, \(b_u\) and \(b_i\) the user/item biases (a generous rater, a beloved film), and \(p_u^{\top}q_i\) is the interaction: how much user \(u\)'s tastes align with item \(i\)'s attributes along \(k\) hidden axes the model discovers for itself (one axis might turn out to be "arthouse ↔ blockbuster", another "comedy ↔ drama"). The latent vectors are exactly the rows of \(U\) and \(V\) in EQ M13.1 — collaborative filtering is matrix factorization, only on the observed entries. You cannot run a plain SVD here, because SVD needs a complete matrix and \(R\) is mostly holes — imputing the holes with a constant first then running SVD biases the result toward that constant. The fix is to fit only the observed entries, minimizing regularized squared error by gradient descent: EQ M13.5 — OBSERVED-ENTRY OBJECTIVE $$ \min_{P,Q,b}\; \sum_{(u,i)\in\mathcal{K}} \Big(r_{ui} - \hat r_{ui}\Big)^2 \;+\; \lambda\Big(\lVert p_u\rVert^2 + \lVert q_i\rVert^2 + b_u^2 + b_i^2\Big) $$ \(\mathcal{K}\) is the set of known ratings — the sum skips every missing cell, which is the whole trick. \(\lambda\) is the regularization strength that stops the model from memorizing the sparse observations; without it, a high \(k\) overfits instantly. The gradient for \(p_u\) is \(-2\,e_{ui}\,q_i + 2\lambda p_u\) with error \(e_{ui} = r_{ui} - \hat r_{ui}\) — the update each known rating sends to its user and item vectors. This is "Funk SVD", Simon Funk's Netflix-Prize method, and it is still the textbook recommender. The cold-start caveat, honestly. A new user or item with no ratings has no factors to estimate — the model defaults to biases alone, which is to say it guesses the average. Pure collaborative filtering is also blind to content (genre, text, image) and prone to popularity bias. Production systems since the mid-2010s blend factorization with content features and, increasingly, deep retrieval models — but a two-tower neural recommender is still computing a dot product of learned user and item embeddings. The latent-factor idea did not go away; it got an encoder in front of it. With \(\mu = 3.5\), user bias \(b_u = -0.2\), item bias \(b_i = 0.1\), and latent vectors \(p_u = (1,\, 0.5)\), \(q_i = (0.2,\, 0.6)\), what rating does EQ M13.4 predict for \(\hat r_{ui}\)? Dot product \(p_u^{\top}q_i = (1)(0.2) + (0.5)(0.6) = 0.2 + 0.3 = 0.5\). Then \(\hat r_{ui} = \mu + b_u + b_i + p_u^{\top}q_i = 3.5 - 0.2 + 0.1 + 0.5\): step by step \(3.5 - 0.2 = 3.3\), \(+0.1 = 3.4\), \(+0.5 = \) 3.9. The biases nudge a baseline 3.5 down for a harsh rater and up for a well-liked film; the interaction term adds the personalized lift. PYTHON · RUNNABLE IN-BROWSER # Funk SVD: matrix factorization by gradient descent on OBSERVED entries (EQ M13.5) import numpy as np rng = np.random.default_rng(0) R = np.array([ # 6 users x 5 movies; NaN = unrated [5, 3, np.nan, 1, np.nan], [4, np.nan, np.nan, 1, 2], [1, 1, np.nan, 5, np.nan], [1, np.nan, np.nan, 4, 5], [np.nan, 1, 5, 4, np.nan], [2, 1, 4, np.nan, 5]], float) mask = ~np.isnan(R) # True where a rating is known k, lr, lam = 2, 0.02, 0.1 U = rng.normal(0,.1, (R.shape[0], k)) # user factors P V = rng.normal(0,.1, (R.shape[1], k)) # item factors Q for step in range(4000): E = np.where(mask, R - U @ V.T, 0.0) # error on observed cells only U += lr * (E @ V - lam * U) # gradient step (EQ M13.5) V += lr * (E.T @ U - lam * V) P = U @ V.T print(f"train RMSE on observed entries: {np.sqrt(np.nanmean((R - P)[mask] ** 2)):.3f}") print("reconstructed matrix (blanks are now PREDICTIONS):") print(np.round(P, 1)) print("\nuser 0's two unrated movies (cols 2, 4) predicted:", np.round(P[0, [2, 4]], 2)) RUN ▶ edits are live — break it on purpose 13.4 Non-negative Matrix Factorization SVD's singular vectors mix positive and negative entries freely, so its factors add and subtract. That makes them mathematically optimal but often uninterpretable: a "component" of a face might be a ghostly blend that only makes sense once you cancel it against another. Non-negative Matrix Factorization (NMF) imposes one extra constraint — every entry of both factors must be \(\ge 0\) — and that constraint changes everything about what the parts look like. EQ M13.6 — NMF $$ A \approx W H, \qquad A \in \mathbb{R}_{\ge 0}^{m\times n},\; W \in \mathbb{R}_{\ge 0}^{m\times k},\; H \in \mathbb{R}_{\ge 0}^{k\times n} $$ With no subtraction allowed, the only way to build the data is to add up parts. Lee & Seung's 1999 result: applied to a set of face images, NMF discovers localized features — a nose, an eyebrow, a mouth — because a face is literally a sum of its parts, never a part minus another. On text, the \(k\) columns of \(W\) become interpretable topics (clusters of co-occurring words) — NMF is one of the classic topic models. The price: no closed form, and the factorization is not unique. NMF is fit by multiplicative updates, a pair of element-wise rules that preserve non-negativity automatically (no projection step, no learning rate to tune) and monotonically decrease the reconstruction error: EQ M13.7 — MULTIPLICATIVE UPDATE RULES $$ H \leftarrow H \odot \frac{W^{\top}A}{W^{\top}WH}, \qquad W \leftarrow W \odot \frac{A H^{\top}}{W H H^{\top}} \qquad (\odot,\, / \text{ element-wise}) $$ Each entry is multiplied by a ratio of "what the data wants" over "what the current model gives". Because every term is non-negative, a non-negative factor stays non-negative — the constraint is enforced by the algebra itself, not bolted on. A tiny \(\varepsilon\) in the denominator avoids division by zero. These converge to a local (not global) minimum, so initialization matters; NNDSVD initialization is the common fix. SVD vs NMF, the honest trade. SVD gives the provably best low-rank fit (Eckart–Young) with orthogonal, ordered, unique factors — ideal when you want compression or principal directions. NMF gives a usually-worse fit with non-orthogonal, unordered, non-unique factors — but ones a human can read as additive parts. Choose SVD/PCA when you want the optimal subspace; choose NMF when you want interpretable, parts-based components and the data is naturally non-negative (counts, intensities, spectra). INSTRUMENT M13.3 — NMF PARTS DECOMPOSITION ADD-ONLY PARTS · EQ M13.6-7 PARTS k 3 UPDATE ITERATIONS 120 RECONSTRUCTION ERR — ALL ENTRIES ≥ 0 — PARTS DISCOVERED — The data (left) is a set of glyphs built from a few overlapping strokes. NMF with \(k\) parts finds those strokes (middle, the columns of \(W\)) and reconstructs each glyph as a non-negative sum of them (right). Set \(k\) to the true number of strokes and the parts snap into clean, localized pieces; set it too low and parts get smeared together. Drag iterations from 0 to watch the multiplicative updates (EQ M13.7) drive the error down while every entry stays non-negative. 13.5 PCA as SVD; the connections You met Principal Component Analysis as variance-hunting in Chapter 05. Here is the secret it shares with everything above: PCA is just the SVD of the centered data matrix. Center each column to mean zero (call the result \(X_c\)); then the principal components are the right singular vectors, and the variance along each is the squared singular value, scaled by the sample count. EQ M13.8 — PCA = SVD OF CENTERED DATA $$ X_c = U\Sigma V^{\top} \;\Longrightarrow\; \underbrace{\tfrac{1}{m-1}X_c^{\top}X_c}_{\text{covariance } C} = V\,\frac{\Sigma^2}{m-1}\,V^{\top}, \qquad \lambda_i = \frac{\sigma_i^2}{m-1} $$ The eigenvectors of the covariance \(C\) are the right singular vectors \(V\); the eigenvalues are \(\lambda_i = \sigma_i^2/(m-1)\) — the variance captured by component \(i\). So "directions of maximum variance" (PCA) and "best low-rank approximation" (truncated SVD) are the same computation. The energy-retained curve from §13.2 is identical to PCA's explained-variance ratio. In practice you run the SVD directly on \(X_c\): it is more numerically stable than forming \(C\) and eigendecomposing it. That equivalence is the unifying thread of this chapter. The same decomposition, read three ways, becomes three tools: You want… The factorization What the factors mean Best low-rank fit / compression truncated SVD \(A_k\) orthogonal, ordered, optimal (Eckart–Young) Principal directions / decorrelation SVD of centered \(X_c\) PCA components = right singular vectors Fill missing entries (recommend) Funk SVD on observed \(\mathcal{K}\) user/item latent factors \(p_u, q_i\) Interpretable additive parts NMF \(WH\), \(W,H\ge 0\) topics, strokes, spectra — parts you can name Where this reaches. Latent Semantic Analysis is truncated SVD of a term–document matrix. Classical word embeddings (GloVe, and the implicit factorization behind word2vec) are matrix factorizations of co-occurrence statistics. Spectral clustering factorizes a graph Laplacian. Image and video codecs lean on related transforms. The low-rank prior — "a few hidden factors explain most of the data" — is one of the most reusable assumptions in all of machine learning, and factorization is how you cash it in. NEXT Factorization compresses one matrix into a few smart parts; ensembles compress many weak models into one strong vote. Chapter 14: bagging, boosting, and stacking — why a forest of mediocre trees beats one clever one, and how the bias–variance trade-off plays out when you combine predictors instead of features. 13.R References Koren, Y., Bell, R. & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42(8) — the canonical write-up of the latent-factor model and biases (EQ M13.4–M13.5), from the Netflix-Prize winners. Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 — NMF and the parts-based decomposition of §13.4 (EQ M13.6). Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1(3) — the theorem that makes truncated SVD the best low-rank fit (EQ M13.3). Lee, D. D. & Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization. NIPS 13 — the multiplicative update rules of EQ M13.7 and their convergence guarantee. Halko, N., Martinsson, P.-G. & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review 53(2) — randomized SVD, how truncated factorizations are actually computed at scale. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. JASIS 41(6) — LSA: truncated SVD of a term–document matrix, the §13.5 connection to embeddings. Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NIPS 27 — proof that word2vec's skip-gram is implicitly factorizing a shifted PMI matrix. ← PREVIOUS 12 Clustering Zoo NEXT CHAPTER 14 Ensembles AI // ENCYCLOPEDIA — MACHINE LEARNING · CH 13 FULL CONTENTS ↗ ## VOL I · 14 · Ensemble Methods (https://ai-encyclopedia.com/ml/14-ensembles.html) 14 · Ensemble Methods — AI Encyclopedia AI // ENCYCLOPEDIA / MACHINE LEARNING / 14 / ENSEMBLES INDEX NEXT: BOOSTING LIBRARIES → MACHINE LEARNING · CHAPTER 14 / 15 Ensemble Methods A single tree overfits, a single shallow model underfits, and any single model is a single point of failure. Combine many weak models the right way and their errors largely cancel, the most dependable improvement in applied machine learning. This chapter derives why ensembling works from the bias-variance-covariance decomposition, then walks the three families that exploit it: bagging (reduce variance), boosting (reduce bias, stagewise), and stacking (let a meta-model learn the blend). LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON CH 04 · 06 INSTRUMENTS VARIANCE DROP · STAGEWISE FIT · STACKER IN THIS CHAPTER 14.1 Why ensembles win 14.2 Bagging 14.3 Boosting 14.4 Stacking & blending 14.5 When ensembles fail § References 14.1 Why ensembles win — the error decomposition Start from the only identity that matters here. For a squared-error regression target \(y = f(x) + \varepsilon\) with irreducible noise \(\mathrm{Var}(\varepsilon) = \sigma^2\), the expected test error of an estimator \(\hat f\) — averaged over the randomness in the training set — splits into three terms that cannot interfere with one another: EQ M14.1 — BIAS-VARIANCE DECOMPOSITION $$ \mathbb{E}\big[(y - \hat f(x))^2\big] \;=\; \underbrace{\big(f(x) - \mathbb{E}[\hat f(x)]\big)^2}_{\text{bias}^2} \;+\; \underbrace{\mathbb{E}\big[(\hat f(x) - \mathbb{E}[\hat f(x)])^2\big]}_{\text{variance}} \;+\; \underbrace{\sigma^2}_{\text{noise}} $$ Bias is how far the average model is from the truth (systematic error — too rigid). Variance is how much the model jitters as the training set is resampled (instability — too flexible). Noise is the floor no model can beat. Each ensemble family attacks a different term: bagging shrinks variance, boosting shrinks bias, and a good stack can chip at both. The decomposition is exact for squared loss; for 0–1 classification it only holds in spirit, which is why we reason in regression first. Now average \(k\) models. Let each \(\hat f_j\) have the same variance \(\sigma_m^2\), and let any two of them have correlation \(\rho\). The variance of their average is the whole story of ensembling in one line: EQ M14.2 — VARIANCE OF A CORRELATED AVERAGE $$ \mathrm{Var}\!\left(\frac{1}{k}\sum_{j=1}^{k}\hat f_j\right) \;=\; \rho\,\sigma_m^2 \;+\; \frac{1-\rho}{k}\,\sigma_m^2 $$ Two regimes live inside this equation. As \(k \to \infty\) the second term vanishes and you are left with \(\rho\,\sigma_m^2\) — the correlation floor. If the members are independent (\(\rho = 0\)) the average has variance \(\sigma_m^2/k\): error falls like \(1/k\), a true free lunch. If they are identical (\(\rho = 1\)) you gain nothing — averaging a model with copies of itself changes nothing. Every variance-reduction trick in this chapter is an attempt to push \(\rho\) down without letting \(\sigma_m^2\) blow up. Averaging does not touch bias: the mean of \(k\) equally-biased models keeps that bias intact. Two consequences follow immediately. First, ensembling helps most when the members are good but different — accurate (small \(\sigma_m^2\)) yet decorrelated (small \(\rho\)). Second, there is a hard limit: no amount of averaging beats the correlation floor \(\rho\,\sigma_m^2\), so the engineering game is decorrelation, not just adding more models. This is exactly why random forests randomize the features at each split (it lowers \(\rho\)) and why diverse model classes stack better than ten retrained copies of the same gradient-boosted tree. You average \(k = 4\) independent models, each with variance \(v = 8\). By EQ M14.2 with \(\rho = 0\), what is the variance of their average \( \tfrac{1}{k}\sum_j \hat f_j \)? With \(\rho = 0\), EQ M14.2 collapses to \(\mathrm{Var} = v/k = 8/4 = \) 2. Independence is what makes the \(1/k\) free lunch real; correlation is what spoils it. True or false: bagging (averaging many bootstrap-trained models) primarily reduces the variance term of EQ M14.1, leaving bias roughly unchanged. Answer true or false. EQ M14.2 shows the averaged variance shrinks toward the correlation floor, while the bias of the average equals the (shared) bias of each member — averaging cannot move it. So bagging buys variance reduction at fixed bias: true. 14.2 Bagging & variance reduction Bootstrap aggregating — Breiman's bagging — is the most direct cash-out of EQ M14.2. Draw \(k\) bootstrap samples (each \(n\) points sampled with replacement from the original \(n\)), fit one high-variance, low-bias learner on each, and average their predictions (or majority-vote for classification). Deep decision trees are the canonical base learner because they are exactly what the math wants: nearly unbiased and wildly unstable, so there is a mountain of variance to drain and almost no bias to protect. EQ M14.3 — THE BAGGED PREDICTOR & ITS OOB FREE LUNCH $$ \hat f_{\text{bag}}(x) = \frac{1}{k}\sum_{j=1}^{k}\hat f_j^{*}(x), \qquad \Pr(\text{point } i \notin \text{bootstrap } j) = \left(1 - \tfrac{1}{n}\right)^{\!n} \xrightarrow[n\to\infty]{} e^{-1} \approx 0.368 $$ \(\hat f_j^{*}\) is the model trained on bootstrap sample \(j\). Each bootstrap omits about 36.8% of the data purely by chance — those are the out-of-bag (OOB) points for that tree. Average each point's predictions over only the trees that did not see it and you get a validation estimate for free, no held-out set required. OOB error is bagging's built-in cross-validation, and it is why random forests are so cheap to tune. Random forests add the second decorrelation lever EQ M14.2 begged for. Trees grown on bootstrap samples of the same data are still correlated — they all latch onto the few strongest features. So at every split, a random forest considers only a random subset of features (\(\sqrt{p}\) for classification, \(p/3\) for regression are the classic defaults). Forcing weaker features into the splits makes the trees genuinely different, drives \(\rho\) down, and pushes the correlation floor lower than plain bagging can reach. Extremely randomized trees (ExtraTrees) go further still, randomizing the split thresholds too — trading a touch of bias for even less correlation. In a large bootstrap sample (\(n \to \infty\)), what fraction of the original points are left out of any single bootstrap draw — i.e. become out-of-bag? Use the limit in EQ M14.3. \(\left(1 - \tfrac{1}{n}\right)^n \to e^{-1} = 0.3679\). Rounded, about 0.368 — roughly 36.8% of points sit out of each tree and form its OOB validation set. PYTHON · RUNNABLE IN-BROWSER # Bagging from scratch: average bootstrapped stumps, watch variance collapse import numpy as np rng = np.random.default_rng(0) def stump(x, y): # 1-split regression tree (high variance) s = np.argsort(x); xs, ys = x[s], y[s] best, thr, lo, hi = 1e18, x[0], y.mean(), y.mean() for i in range(1, len(x)): # try every midpoint as a split t = 0.5 * (xs[i-1] + xs[i]) l, r = ys[:i], ys[i:] e = ((l - l.mean())**2).sum() + ((r - r.mean())**2).sum() if e RUN ▶ edits are live — break it on purpose INSTRUMENT M14.1 — VARIANCE COLLAPSE AVERAGE N NOISY STUMPS · EQ M14.2 TREES IN ENSEMBLE k 16 NOISE σ 0.40 SINGLE-TREE VARIANCE — BAGGED VARIANCE (k TREES) — VARIANCE REDUCTION — Faint lines are individual bootstrap stumps; the bright mint line is their average; the dashed line is the true curve. Push k up and watch the ragged members fuse into a smooth, accurate fit while the variance readout drops toward the correlation floor — not to zero, because bootstrap stumps stay correlated. Raise the noise to see how much more there is to gain. 14.3 Boosting — sequential error correction Bagging builds its members in parallel and independently. Boosting does the opposite: it builds them in sequence, each new member trained to fix the mistakes of the running ensemble so far. Where bagging attacks variance with strong learners, boosting attacks bias by composing many weak learners (shallow trees, often stumps) into one strong one. The cost is that members are now dependent by construction — \(\rho\) is high — so the variance lever of EQ M14.2 is gone, and boosting trades it for a march down the bias term. The cleanest modern view is gradient boosting (Friedman): treat the ensemble as a function being optimized by gradient descent in function space. At each round you fit a new weak learner to the negative gradient of the loss — for squared error, that gradient is simply the residual — and take a small, shrunk step: EQ M14.4 — GRADIENT BOOSTING (FUNCTIONAL GRADIENT DESCENT) $$ r_i^{(m)} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}, \qquad F_m(x) = F_{m-1}(x) + \nu\, h_m(x),\quad h_m \approx \arg\min_h \sum_i (r_i^{(m)} - h(x_i))^2 $$ \(L\) is the loss, \(F_{m-1}\) the ensemble after \(m-1\) rounds, \(h_m\) the weak learner fit to the pseudo-residuals \(r^{(m)}\), and \(\nu \in (0,1]\) the learning rate (shrinkage). For squared loss \(L = \tfrac12(y-F)^2\), the residual is literally \(r_i = y_i - F_{m-1}(x_i)\): each round models what the last one missed. Small \(\nu\) with many rounds nearly always beats large \(\nu\) with few — the same regularize-by-small-steps logic as in SGD (Chapter 08). XGBoost and LightGBM (Chapter 15) are this recipe plus second-order Newton steps and brutal engineering. The older, equivalent-in-spirit ancestor is AdaBoost, which reweights the data instead of fitting residuals: misclassified points get heavier, so the next weak learner concentrates where the ensemble is failing. Its multiclass form, SAMME, assigns each weak learner a vote weighted by how much better than chance it does: EQ M14.5 — ADABOOST/SAMME WEIGHTS $$ \alpha_m = \log\!\frac{1 - \mathrm{err}_m}{\mathrm{err}_m} + \log(K - 1), \qquad \mathrm{err}_m = \frac{\sum_i w_i\,\mathbb{1}[y_i \neq h_m(x_i)]}{\sum_i w_i} $$ \(\mathrm{err}_m\) is the weighted error of weak learner \(m\); \(K\) is the number of classes (\(K = 2\) recovers classic AdaBoost, where the \(\log(K-1)\) term is zero). A learner barely better than random (\(\mathrm{err}_m\) just under \(1 - 1/K\)) gets \(\alpha_m \approx 0\); a strong one gets a large vote. After each round, weights on the still-wrong points are multiplied by \(e^{\alpha_m}\) and renormalized. SAMME merely requires each learner to beat random guessing — far weaker than the 50% bar binary AdaBoost demands. A binary (\(K = 2\)) AdaBoost weak learner has weighted error \(\mathrm{err}_m = 0.1\). By EQ M14.5 (the \(\log(K-1)\) term vanishes for \(K=2\)), what is its vote weight \(\alpha_m\)? Use natural log. \(\alpha_m = \ln\!\dfrac{1 - 0.1}{0.1} = \ln\dfrac{0.9}{0.1} = \ln 9 = \) 2.197. A 10%-error learner earns a large, confident vote; a 50%-error (chance) learner would earn \(\ln 1 = 0\). PYTHON · RUNNABLE IN-BROWSER # AdaBoost (SAMME, K=2) on toy data: print weighted error + alpha each round import numpy as np rng = np.random.default_rng(1) n = 200 X = rng.normal(0, 1, (n, 2)) y = np.where(X[:, 0] + X[:, 1] > 0, 1, -1) # linear boundary, labels +/-1 def stump(X, y, w): # best 1-feature threshold stump best = (1e9, 0, 0.0, 1) for f in range(X.shape[1]): for t in np.quantile(X[:, f], np.linspace(.1,.9, 9)): for s in (1, -1): pred = np.where(s * (X[:, f] - t) > 0, 1, -1) e = w[pred != y].sum() if e 0, 1, -1) w = np.full(n, 1 / n) # uniform sample weights print("round weighted_err alpha") for m in range(5): _, pred = stump(X, y, w) err = max(w[pred != y].sum(), 1e-12) alpha = np.log((1 - err) / err) # EQ M14.5, K=2 w *= np.exp(alpha * (pred != y)) # up-weight the still-wrong w /= w.sum() print(f" {m:2d} {err:8.4f} {alpha:6.3f}") print("\nerror drifts toward 0.5 as easy points are solved and hard ones dominate.") RUN ▶ edits are live — break it on purpose INSTRUMENT M14.2 — STAGEWISE RESIDUAL FIT GRADIENT BOOSTING · EQ M14.4 BOOSTING ROUNDS m 8 LEARNING RATE ν 0.30 ROUNDS USED — TRAIN MSE — RESIDUAL NORM — The bright line is the boosted ensemble \(F_m\) chasing the dashed target; the short red sticks are the current residuals \(r^{(m)}\) — what the next stump will be fit to. Step the rounds up and watch the residuals shrink stagewise. Drop the learning rate and you need more rounds for the same fit, but the staircase is smoother and overfits later. At round 0 the model is just the mean. 14.4 Stacking & blending Bagging and boosting both combine members of one kind with a fixed rule (average, weighted vote). Stacked generalization (Wolpert) asks the obvious next question: why hand-pick the combiner when you can learn it? Train several diverse base models — say a random forest, a gradient-boosted tree, a linear model, and a k-NN — then train a second-level meta-learner whose inputs are the base models' predictions and whose target is the true label. The meta-model discovers, per region of input space, which base learner to trust. EQ M14.6 — THE STACKED PREDICTOR $$ \hat y = g\big(\hat f_1(x),\, \hat f_2(x),\, \ldots,\, \hat f_M(x)\big), \qquad g = \arg\min_{g}\ \sum_{i}\, L\big(y_i,\, g(z_i)\big),\ \ z_i = \big(\hat f_1^{(-i)}(x_i),\ldots\big) $$ \(g\) is the meta-learner; \(z_i\) is the vector of base predictions for point \(i\). The crucial detail is the \((-i)\) superscript: the base predictions feeding the meta-model must be out-of-fold. Train the bases with k-fold cross-validation and predict each held-out fold, so no base model ever predicts a point it was trained on. Skip this and the meta-learner sees in-sample predictions that are unrealistically good, learns to trust an overfit base, and collapses on real data. A simple, well-regularized \(g\) (ridge, or non-negative-weighted logistic) almost always beats a fancy one. Blending is the lazy cousin: a single fixed holdout split instead of full CV — simpler, leak-resistant, but it wastes data. Stacking is the engine behind most Kaggle-winning solutions and many production ranking systems, precisely because EQ M14.2 rewards diversity: a forest and a boosted tree make different mistakes, so their stacked combination has lower \(\rho\) than either family alone. The returns are real but modest — typically a few percent over the best single model — and they come with real operational cost (more models to train, serve, monitor, and debug). That trade is the subject of the next section. PYTHON · RUNNABLE IN-BROWSER # Stacking toy: two biased base models; a ridge meta-learner finds the blend import numpy as np rng = np.random.default_rng(3) n = 400 x = rng.uniform(-3, 3, n) y = np.sin(x) + 0.1 * rng.normal(size=n) # ground truth + noise # two deliberately complementary, biased base predictions f1 = 0.9 * np.sin(x) - 0.2 # good shape, wrong offset f2 = x / 3.0 # a linear approximation for name, f in (("base f1", f1), ("base f2", f2)): print(f"{name:8s} MSE = {((y - f)**2).mean():.4f}") Z = np.column_stack([np.ones(n), f1, f2]) # meta features (+ intercept) lam = 1e-3 beta = np.linalg.solve(Z.T @ Z + lam*np.eye(3), Z.T @ y) # ridge meta-learner stack = Z @ beta print(f"\nmeta weights [bias, f1, f2] = {beta.round(3)}") print(f"stacked MSE = {((y - stack)**2).mean():.4f} RUN ▶ edits are live — break it on purpose INSTRUMENT M14.3 — STACKING META-LEARNER TWO BASES + LEARNED BLEND · EQ M14.6 BASE 1 WEIGHT w₁ 0.50 META-LEARNER MANUAL RIDGE FIT BASE 1 MSE — BASE 2 MSE — STACKED MSE — Two biased base learners (mint, blue) bracket the dashed truth. In MANUAL mode, drag w₁ to blend them by hand and hunt for the lowest stacked MSE. Switch to RIDGE FIT and the meta-learner solves for the optimal weights in closed form — landing at or below the best blend you can find by hand, and below either base alone. That is EQ M14.6 doing its job. 14.5 When ensembles fail The "free lunch" framing is a useful exaggeration. Ensembles are remarkably robust, but EQ M14.2 also tells you exactly where they stop helping — and several failure modes have nothing to do with the math at all. Failure mode Why it happens What it looks like Correlated members High \(\rho\) pins variance at the floor \(\rho\sigma_m^2\) The 200th tree adds nothing; OOB error flatlined at tree ~50 Bagging a stable learner Low-variance bases (linear/SVM) have little variance to drain Bagged linear model ≈ the single linear model, at 100× the cost Boosting on noisy labels AdaBoost up-weights hard points — which include mislabels Train error → 0, test error climbs; the ensemble memorizes noise Stacking with leakage In-sample (not out-of-fold) base predictions feed the meta-model Spectacular CV scores, collapse in production Distribution shift All members agree confidently on the wrong (shifted) inputs Calibrated, unanimous, and wrong — diversity gives no safety here Two of these deserve emphasis because they are genuinely contested in practice. First, boosting's noise sensitivity: AdaBoost's exponential loss punishes outliers viciously, which is why robust variants (LogitBoost, gradient boosting with Huber loss, and early stopping) exist — though on clean tabular data, gradient-boosted trees remain the strongest single tool and frequently beat deep nets, a result that surprises newcomers and is still actively debated. Second, the cost-benefit ledger: an ensemble multiplies training, serving, latency, memory, and debugging cost, often for a 1–3% metric gain. In a leaderboard that wins; in a latency-bound product it may not be worth it. The honest default is a single well-tuned gradient-boosted model for tabular problems, reaching for stacking only when the last percent genuinely pays. The deeper caveat. Ensembling reduces variance and bias, never the irreducible noise \(\sigma^2\) of EQ M14.1, and it cannot manufacture signal that the base learners never saw. Diversity is a property of errors, not of confidence: ten models can be diverse, confident, and unanimously wrong under distribution shift. Ensembles buy you stability and a few points of accuracy — not robustness to a world that has moved. NEXT The theory is settled; the speed is not. Chapter 15 takes gradient boosting from EQ M14.4 to the libraries that dominate tabular ML — XGBoost, LightGBM, and CatBoost — covering second-order Newton splits, histogram binning, leaf-wise growth, and the regularization knobs that decide whether boosting generalizes or memorizes. 14.R References Breiman, L. (1996). Bagging Predictors. Machine Learning 24(2) — bootstrap aggregating and its variance-reduction argument (EQ M14.2, M14.3). Breiman, L. (2001). Random Forests. Machine Learning 45(1) — feature subsampling as the second decorrelation lever; OOB error (§14.2). Wolpert, D. H. (1992). Stacked Generalization. Neural Networks 5(2) — learning the combiner with out-of-fold base predictions (EQ M14.6). Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems, LNCS 1857 — the canonical survey of why and when ensembles help (§14.1). Freund, Y. & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences 55(1) — AdaBoost and its training-error bound (EQ M14.5). Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5) — boosting as functional gradient descent (EQ M14.4). Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — free online; Ch. 8, 10, 15–16 cover bagging, boosting, and random forests with the bias-variance lens used here. ← PREVIOUS 13 Matrix Factorization NEXT CHAPTER 15 Boosting Libraries AI // ENCYCLOPEDIA — MACHINE LEARNING · CH 14 FULL CONTENTS ↗ ## VOL I · Gradient Boosting in Practice (https://ai-encyclopedia.com/ml/15-boosting-libraries.html) Gradient Boosting in Practice — XGBoost, LightGBM, CatBoost — AI Encyclopedia AI // ENCYCLOPEDIA / MACHINE LEARNING / 15 / BOOSTING LIBRARIES INDEX NEXT: MLOPS · 01 RESAMPLING & CV → MACHINE LEARNING · CHAPTER 15 / 15 Gradient Boosting in Practice — XGBoost, LightGBM, CatBoost Open any tabular-data leaderboard and the top is usually the same three names. All three implement one idea, gradient boosting, and differ mainly in their engineering choices. This chapter builds the algorithm from first principles, traces it back to AdaBoost, then shows what XGBoost, LightGBM, and CatBoost each changed and why it mattered. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON ML 04 · ML 14 INSTRUMENTS STAGEWISE · LR×TREES · LIBRARY MATRIX IN THIS CHAPTER 15.1 Gradient boosting 15.2 AdaBoost & exponential loss 15.3 XGBoost 15.4 LightGBM 15.5 CatBoost 15.R References 15.1 Gradient boosting — the general algorithm A single decision tree (Chapter 04) is a weak learner: it carves the input space into boxes and predicts a constant in each. Boosting is the idea that a sequence of such weak learners, each trained to fix the mistakes of the running total, can compose into a strong one. Where bagging (Chapter 14) builds many independent trees and averages them to cut variance, boosting builds trees sequentially, and each new tree is added to reduce the bias that remains. Friedman's gradient boosting machine (2001) gives this a clean, general formulation: treat the additive model \(F(x)\) as a point in function space, and run gradient descent on the loss, with respect to the function itself. We grow the model one stage at a time: EQ M15.1 — ADDITIVE STAGEWISE MODEL $$ F_0(x) = \arg\min_{c}\sum_{i=1}^{n} L\bigl(y_i, c\bigr), \qquad F_m(x) = F_{m-1}(x) + \nu\, h_m(x) $$ \(F_m\) is the ensemble after \(m\) trees; \(F_0\) is a constant (for squared loss, the mean of \(y\); for log-loss, the log-odds). \(h_m\) is the \(m\)-th tree and \(\nu \in (0,1]\) is the learning rate (shrinkage). We never re-touch earlier trees — the model is built stagewise, not stepwise. The whole game is choosing each \(h_m\) so that adding it most reduces the loss. How do we pick \(h_m\)? Gradient descent says: move in the direction of steepest descent. In function space, the steepest-descent direction at example \(i\) is the negative gradient of the loss with respect to the current prediction — the pseudo-residual: EQ M15.2 — PSEUDO-RESIDUALS (NEGATIVE GRADIENT) $$ r_{im} \;=\; -\left[\frac{\partial L\bigl(y_i, F(x_i)\bigr)}{\partial F(x_i)}\right]_{F = F_{m-1}} $$ Each tree is fit not to the labels \(y\) but to these negative gradients — the direction in which the prediction at each point should move to lower the loss. The tree then approximates that direction with a piecewise-constant function, and EQ M15.1 takes a small step \(\nu\) along it. This is the single defining act of gradient boosting: every new tree regresses on the negative gradient of the loss. Changing \(L\) changes only the residual formula — the machinery is identical for regression, classification, and ranking. The payoff is cleanest for the squared-error loss \(L(y,F)=\tfrac12(y-F)^2\). Its gradient is \(\partial L/\partial F = -(y-F)\), so the negative gradient is simply \(r_i = y_i - F(x_i)\): the ordinary residual. For squared loss, "fit the negative gradient" reduces to the intuitive "fit the leftover error" — which is why the from-scratch demo below works with plain residuals. EQ M15.3 — SQUARED LOSS: GRADIENT IS THE RESIDUAL $$ L(y,F)=\tfrac12\,(y-F)^2 \;\Longrightarrow\; -\frac{\partial L}{\partial F} = y - F = r $$ For squared loss the pseudo-residual is exactly the residual. For other losses it is not: log-loss gives \(r = y - p\) (label minus predicted probability), and the absolute-error loss gives \(r = \operatorname{sign}(y-F)\), which is why \(L_1\)-boosting chases the median rather than the mean. The framework is loss-agnostic; only EQ M15.2 changes. WORKED EXAMPLE ▾ 01 Three points with targets \(y = (3, 5, 9)\). The optimal constant under squared loss is the mean: \(F_0 = (3+5+9)/3 = 5.667\). 02 Residuals \(r = y - F_0 = (-2.667,\ -0.667,\ 3.333)\). A stump splits them into two leaves; say it isolates the third point, predicting the leaf means \((-1.667)\) for the first two and \((3.333)\) for the last. 03 With learning rate \(\nu = 0.5\): \(F_1 = F_0 + 0.5\,h_1\). Point 3 becomes \(5.667 + 0.5(3.333) = 7.333\) — moved halfway from 5.667 toward its target of 9. 04 New residuals are smaller; the next stump fits those. Repeated, the ensemble inches every prediction toward its label. The squared-error loss \(\sum r_i^2\) falls monotonically — the loss curve the stepper below draws. RESULT: tree m fits y − F₍ₘ₋₁₎; the step ν shrinks each correction True or false: in gradient boosting, each new tree is fit to approximate the negative gradient of the loss with respect to the current model's predictions. (Answer true or false.) This is the definition of gradient boosting (EQ M15.2). The negative gradient is the steepest-descent direction in function space; the tree regresses on it, and EQ M15.1 takes a shrunk step \(\nu\) along it. For squared loss this gradient equals the plain residual \(y-F\) (EQ M15.3), which is the special case people usually picture. Answer: true. A gradient-boosting regressor under squared-error loss starts from a constant \(F_0\). For targets \(y = (3, 5, 9)\), what value does it initialize \(F_0\) to? (Give a decimal.) EQ M15.1 sets \(F_0 = \arg\min_c \sum (y_i - c)^2/2\), which is minimized at the mean: \(F_0 = (3+5+9)/3 = 17/3 = \) 5.667. The first tree then fits the residuals \(y - F_0\). A gradient-boosting model uses learning rate \(\nu = 0.1\). By EQ M15.1, each new tree's contribution to the ensemble is scaled by what factor (the shrinkage applied per tree)? (Give a decimal.) The stagewise update is \(F_m = F_{m-1} + \nu\, h_m\), so each tree \(h_m\) enters the model multiplied by exactly \(\nu\). With \(\nu = 0.1\) the per-tree shrinkage is 0.1 — each tree contributes only a tenth of its raw correction, which is why low \(\nu\) demands more trees but generalizes better (INSTRUMENT M15.2). INSTRUMENT M15.1 — STAGEWISE BOOSTING STEPPER DEPTH-1 STUMPS · SQUARED LOSS · EQ M15.1–M15.3 LEARNING RATE ν 0.30 TREES ADD TREE ▶ +10 ▶▶ RESET TREES ADDED 0 TRAINING MSE — SHRINKAGE / TREE — A wavy target (mint dots) is fit by depth-1 stumps added one at a time; the white line is the running ensemble \(F_m\), the small inset traces the training MSE. With \(\nu = 1\) the fit lurches and overshoots; drop \(\nu\) to 0.1 and each step is a gentle nudge — smoother, slower, and it generalizes better. Click ADD TREE to watch a single stump attack the current residuals, or +10 to fast-forward. This is EQ M15.1 in motion. PYTHON · RUNNABLE IN-BROWSER # Gradient boosting for regression, from scratch: stumps fit to residuals import numpy as np rng = np.random.default_rng(0) x = np.linspace(-3, 3, 120) y = np.sin(x) + 0.15 * rng.standard_normal(x.size) # noisy target def best_stump(x, r): # depth-1 tree: one threshold, two leaf means order = np.argsort(x); xs, rs = x[order], r[order] best = (np.inf, x.mean(), r.mean(), r.mean()) for i in range(5, len(xs) - 5): # candidate split points t = 0.5 * (xs[i - 1] + xs[i]) lo, hi = rs[:i].mean(), rs[i:].mean() sse = ((rs[:i] - lo) ** 2).sum() + ((rs[i:] - hi) ** 2).sum() if sse < best[0]: best = (sse, t, lo, hi) return best[1:] # threshold, left value, right value nu, M = 0.3, 40 F = np.full_like(y, y.mean()) # F0 = mean (EQ M15.1, squared loss) loss = [] for m in range(M): r = y - F # negative gradient = residual (EQ M15.3) t, lo, hi = best_stump(x, r) h = np.where(x < t, lo, hi) F = F + nu * h # stagewise step (EQ M15.1) loss.append(np.mean((y - F) ** 2)) print(f"start MSE {np.mean((y - y.mean())**2):.4f} -> after {M} trees {loss[-1]:.4f}") print("loss is monotonically non-increasing:", all(np.diff(loss) <= 1e-9)) plot_xy(list(range(1, M + 1)), loss) RUN ▶ edits are live — try nu=1.0 (overshoots) or nu=0.05 (needs more trees) 15.2 AdaBoost & exponential loss — where it started Gradient boosting did not arrive first. AdaBoost (Freund & Schapire, 1997) predates the gradient view and looks, at a glance, like a different algorithm: it maintains a weight on every training example, up-weights the ones the current ensemble gets wrong, and trains the next weak learner on that re-weighted distribution. Misclassified points shout louder; the next learner is forced to attend to them. For binary labels \(y \in \{-1,+1\}\), each round fits a classifier \(h_m\), measures its weighted error \(\varepsilon_m\), and assigns it a vote \(\alpha_m\): EQ M15.4 — ADABOOST: VOTE WEIGHT & REWEIGHTING $$ \alpha_m = \tfrac12 \ln\!\frac{1-\varepsilon_m}{\varepsilon_m}, \qquad w_i \leftarrow w_i\,\exp\!\bigl(-\alpha_m\, y_i\, h_m(x_i)\bigr),\ \text{then renormalize} $$ \(\varepsilon_m\) is the weighted error rate of learner \(m\). A learner barely better than chance (\(\varepsilon \to 0.5\)) gets vote \(\alpha \to 0\); a near-perfect one (\(\varepsilon \to 0\)) gets a large vote. The reweighting term \(\exp(-\alpha y_i h_m)\) is \( 1\) for a wrong one (\(y_i h_m = -1\)), so errors grow heavier and successes fade. The final classifier is the sign of the weighted vote \(\sum_m \alpha_m h_m(x)\). The bridge to Friedman is one of the cleaner results in machine learning. Friedman, Hastie and Tibshirani (2000) showed that AdaBoost is exactly forward stagewise additive modeling under the exponential loss: EQ M15.5 — ADABOOST = STAGEWISE MINIMIZATION OF EXPONENTIAL LOSS $$ L\bigl(y, F(x)\bigr) = \exp\!\bigl(-y\,F(x)\bigr), \qquad F(x) = \sum_m \alpha_m h_m(x) $$ Minimizing the exponential loss one term at a time reproduces the AdaBoost weight update and the \(\alpha_m\) formula exactly — the example weights \(w_i\) are nothing but \(\exp(-y_i F_{m-1}(x_i))\). So AdaBoost is gradient boosting with a specific loss, and the negative-gradient view of §15.1 subsumes it. The practical consequence: exponential loss punishes confident mistakes ferociously (it grows without bound as \(yF \to -\infty\)), which makes vanilla AdaBoost sensitive to mislabeled data and outliers — a known weakness that log-loss boosting (LogitBoost) softens. WORKED EXAMPLE ▾ 01 Round \(m\): the current weak learner has weighted error \(\varepsilon_m = 0.30\). Its vote is \(\alpha_m = \tfrac12 \ln(0.70/0.30) = \tfrac12 \ln(2.333) = \tfrac12 (0.8473) = 0.4236\). 02 A correctly classified point (\(y_i h_m = +1\)) is reweighted by \(\exp(-0.4236) = 0.655\) — its influence shrinks by about a third. 03 A misclassified point (\(y_i h_m = -1\)) is reweighted by \(\exp(+0.4236) = 1.528\) — it grows by half, so the next learner cannot ignore it. 04 The ratio of the two factors is \(1.528 / 0.655 = 2.333 = (1-\varepsilon)/\varepsilon\): exactly the odds the better-than-chance learner beat. After renormalizing, the total weight on errors rises to \(0.5\) — the next round faces a perfectly balanced fight. RESULT: ε = 0.30 → α = 0.424; errors ×1.53, correct ×0.66 An AdaBoost weak learner achieves weighted error \(\varepsilon_m = 0.3\). By EQ M15.4, what vote weight \(\alpha_m\) does it receive? (Give a decimal.) \(\alpha_m = \tfrac12 \ln\!\dfrac{1-\varepsilon_m}{\varepsilon_m} = \tfrac12 \ln\!\dfrac{0.7}{0.3} = \tfrac12 \ln(2.3333) = \tfrac12 (0.84730) = \) 0.4236. A learner exactly at chance (\(\varepsilon = 0.5\)) would get \(\alpha = \tfrac12\ln 1 = 0\) — no vote at all. PYTHON · RUNNABLE IN-BROWSER # AdaBoost weight-update demo on toy 1-D data (decision stumps) import numpy as np x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10.0]) y = np.array([1, 1, 1,-1, 1,-1,-1, 1,-1,-1.0]) # not linearly separable by a stump w = np.full(y.size, 1.0 / y.size) # uniform start def best_stump(x, y, w): # weighted-error optimal stump best = (1.0, 0.0, 1) # (err, threshold, polarity) for t in (x[:-1] + x[1:]) / 2: for s in (+1, -1): pred = np.where(x < t, s, -s) err = w[pred != y].sum() if err < best[0]: best = (err, t, s) return best print(" round eps alpha max-weight") for m in range(3): eps, t, s = best_stump(x, y, w) eps = min(max(eps, 1e-9), 1 - 1e-9) alpha = 0.5 * np.log((1 - eps) / eps) # EQ M15.4 vote weight pred = np.where(x < t, s, -s) w = w * np.exp(-alpha * y * pred) # reweight: errors up, correct down w = w / w.sum() # renormalize to a distribution print(f" {m+1} {eps:.3f} {alpha:+.3f} {w.max():.3f}") print("\nweights after 3 rounds:", np.round(w, 3)) print("hard examples now carry the most weight (largest entries).") RUN ▶ edits are live — flip a label in y and watch its weight balloon 15.3 XGBoost — regularization & second-order gradients XGBoost (Chen & Guestrin, 2016) took Friedman's algorithm and made it production-grade. Two ideas matter most. First, it adds an explicit regularization term to the objective, penalizing trees that grow too many leaves or assign too-large leaf values. Second, it uses a second-order Taylor expansion of the loss — gradients and Hessians — so each tree solves a closer approximation of the true objective than first-order gradient boosting does. EQ M15.6 — REGULARIZED SECOND-ORDER OBJECTIVE $$ \mathcal{L}^{(m)} \approx \sum_{i=1}^{n}\Bigl[g_i\,h_m(x_i) + \tfrac12 h_i\,h_m(x_i)^2\Bigr] + \gamma T + \tfrac12\lambda\sum_{j=1}^{T} w_j^2 $$ \(g_i = \partial_F L\) and \(h_i = \partial_F^2 L\) are the first and second derivatives of the loss at the current prediction; \(T\) is the number of leaves, \(w_j\) their values. \(\gamma\) charges a fixed cost per leaf (so a split must earn its keep), and \(\lambda\) is an \(L_2\) penalty on leaf values (shrinking them toward zero). The Hessian \(h_i\) lets XGBoost weight each example by how curved the loss is there — confident-but-wrong points get more pull — which is why it converges in fewer trees than plain first-order GBM. Because the objective is now a sum of independent per-leaf quadratics, the optimum has a closed form. For a leaf holding instance set \(I_j\) with gradient sum \(G_j=\sum_{i\in I_j} g_i\) and Hessian sum \(H_j=\sum_{i\in I_j} h_i\), the best leaf value and the resulting loss reduction are: EQ M15.7 — OPTIMAL LEAF WEIGHT & SPLIT GAIN $$ w_j^\star = -\frac{G_j}{H_j + \lambda}, \qquad \text{Gain} = \tfrac12\!\left[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma $$ The leaf value is just the negative gradient sum, damped by the Hessian and \(\lambda\). The Gain scores a candidate split: the structure-score of the two children minus that of the parent, less the per-leaf cost \(\gamma\). XGBoost grows trees by greedily taking the highest-Gain split and prunes any split whose Gain is negative — \(\gamma\) is a built-in pre-pruning knob. This single formula is the engine behind every XGBoost split. WORKED EXAMPLE ▾ 01 A leaf collects gradient sum \(G = -10\) and Hessian sum \(H = 5\); take regularization \(\lambda = 1\). 02 Optimal leaf value: \(w^\star = -G/(H+\lambda) = -(-10)/(5+1) = 10/6 = 1.667\). 03 Raise \(\lambda\) to 9: \(w^\star = 10/(5+9) = 0.714\). Larger \(\lambda\) pulls the leaf value toward zero — that is the \(L_2\) penalty doing its job. 04 For a split with \(G_L=-6, H_L=3, G_R=-4, H_R=2\) (and \(\lambda=1, \gamma=0\)): Gain \(= \tfrac12[\,36/4 + 16/3 - 100/6\,] = \tfrac12[9 + 5.333 - 16.667] = \tfrac12(-2.333) = -1.167\). Negative — so XGBoost would reject this split. RESULT: G=−10, H=5, λ=1 → w* = 1.667 An XGBoost leaf has gradient sum \(G = -10\) and Hessian sum \(H = 5\), with \(L_2\) penalty \(\lambda = 1\). By EQ M15.7, what is its optimal leaf value \(w^\star\)? (Give a decimal.) \(w^\star = -\dfrac{G}{H+\lambda} = -\dfrac{-10}{5+1} = \dfrac{10}{6} = \) 1.667. With no regularization (\(\lambda = 0\)) it would be \(10/5 = 2.0\); the penalty shrinks the leaf toward zero, trading a little fit for stability. Beyond the formula. XGBoost also ships a sparsity-aware split finder (it learns a default direction for missing values rather than imputing), an approximate histogram-based split mode for large data, column and row subsampling à la random forests, and shrinkage on top — so EQ M15.1's \(\nu\) and EQ M15.6's \(\lambda,\gamma\) all coexist as separate regularizers. It was the algorithm that, for several years, won the majority of Kaggle's tabular competitions, and it remains the field's reference implementation. PYTHON · RUNNABLE IN-BROWSER # XGBoost leaf math (EQ M15.7): leaf value and split gain, in pure numpy import numpy as np # per-instance first/second-order grads for a node (toy values) g = np.array([-2.0, -3.0, -5.0, 1.0, 2.0, 3.0]) h = np.array([ 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]) lam, gamma = 1.0, 0.0 def leaf_value(G, H): return -G / (H + lam) def score(G, H): return G * G / (H + lam) # structure score G, H = g.sum(), h.sum() print(f"whole node: G={G:.1f} H={H:.1f} w*={leaf_value(G,H):.4f}") best = (-np.inf, None) for s in range(1, len(g)): # try each ordered split GL, HL = g[:s].sum(), h[:s].sum() GR, HR = g[s:].sum(), h[s:].sum() gain = 0.5 * (score(GL,HL) + score(GR,HR) - score(G,H)) - gamma print(f"split after idx {s}: gain={gain:+.4f} " f"wL={leaf_value(GL,HL):+.3f} wR={leaf_value(GR,HR):+.3f}") if gain > best[0]: best = (gain, s) print(f"\nbest split is after index {best[1]} with gain {best[0]:.4f}") RUN ▶ edits are live — raise gamma until the best gain goes negative (no split) INSTRUMENT M15.2 — LEARNING-RATE × N_ESTIMATORS TRAIN VS HELD-OUT · THE SHRINKAGE TRADE-OFF LEARNING RATE ν 0.10 N_ESTIMATORS 200 TRAIN LOSS — HELD-OUT LOSS — EFFECTIVE WORK ν·M — Mint = training loss, blue = held-out loss, the white line marks your current tree budget \(M\). Small \(\nu\) with many trees rides the held-out minimum down and sits there; large \(\nu\) drops training loss fast but the held-out curve turns up — overfitting. The classic recipe falls out visually: lower the learning rate, add trees, and stop early at the held-out minimum. Notice \(\nu\) and \(M\) trade off — halving \(\nu\) needs roughly twice the trees to reach the same fit. 15.4 LightGBM — histograms & leaf-wise growth LightGBM (Ke et al., 2017) keeps XGBoost's objective but rebuilds the machinery for speed and memory at scale. Three engineering bets define it. Histogram binning. Finding the best split by scanning every distinct feature value is \(O(n\,d)\) per level and dominated by sorting. LightGBM instead buckets each feature into a fixed number of bins (default 255) once, up front, then builds histograms of gradient and Hessian sums per bin. Split-finding becomes a cheap scan over bins, not rows: EQ M15.8 — HISTOGRAM SPLIT FINDING $$ \text{cost: } O(n\,d) \;\longrightarrow\; O\bigl(\#\text{bins}\cdot d\bigr), \qquad G_{\text{bin } b} = \!\!\sum_{i:\, x_i \in b}\!\! g_i,\quad H_{\text{bin } b} = \!\!\sum_{i:\, x_i \in b}\!\! h_i $$ Per node, accumulate each example's \((g_i,h_i)\) into its bin, then evaluate the EQ M15.7 Gain at the \(\#\text{bins}-1\) candidate cut points. With \(\#\text{bins}=255 \ll n\), this is dramatically faster and uses far less memory, at the cost of a slightly coarser split grid. A further trick — histogram subtraction — computes a child's histogram as parent minus sibling, halving the work again. This binning is what made boosting practical on tens of millions of rows. Leaf-wise growth. XGBoost grows trees level-wise (split every node at a depth before going deeper). LightGBM grows leaf-wise: at each step it splits the single leaf with the largest Gain, anywhere in the tree. For a fixed number of leaves this lowers loss faster — but it produces deep, unbalanced trees that overfit small data, so LightGBM caps growth with num_leaves and max_depth rather than depth alone. EQ M15.9 — LEAF-WISE VS LEVEL-WISE $$ \text{leaf-wise: split } \arg\max_{\ell \in \text{leaves}} \text{Gain}(\ell), \qquad \text{controlled by } \texttt{num\_leaves}\ (\le 2^{\texttt{max\_depth}}) $$ Level-wise keeps trees balanced and shallow; leaf-wise chases the steepest available descent, so it reaches a lower training loss with the same leaf budget but is more prone to overfit. The key tuning rule: set num_leaves meaningfully below \(2^{\texttt{max\_depth}}\). LightGBM also adds GOSS (keep large-gradient rows, subsample small-gradient ones) and EFB (bundle mutually-exclusive sparse features), the two tricks its name — Gradient-based One-Side Sampling + Exclusive Feature Bundling — refers to. LightGBM bins a continuous feature into 256 histogram bins. How many distinct interior split (cut) points does the histogram offer for that feature? (Give an integer.) A feature divided into \(b\) bins has \(b - 1\) boundaries between adjacent bins, and each boundary is a candidate threshold. With \(b = 256\): \(256 - 1 = \) 255 candidate cut points — instead of one per distinct value, which is what makes EQ M15.8 cheap. A LightGBM model uses num_leaves = 8 with max_depth = 10. What fraction of the depth-10 capacity \(2^{\texttt{max\_depth}}\) does it actually use? (Give a decimal.) Capacity is \(2^{10} = 1024\) leaves; the model uses 8. Fraction \(= 8 / 1024 = \) 0.0078125. Keeping num_leaves far below \(2^{\texttt{max\_depth}}\) (EQ M15.9) is the standard guard against leaf-wise overfitting — a deep but narrow tree. PYTHON · RUNNABLE IN-BROWSER # LightGBM histogram split (EQ M15.8): bin once, then scan bins not rows import numpy as np rng = np.random.default_rng(2) n = 20000 x = rng.uniform(0, 1, n) # one feature g = (x - 0.6) + 0.1 * rng.standard_normal(n) # toy gradients (sign flips near 0.6) h = np.ones(n) lam = 1.0 nbins = 256 edges = np.linspace(0, 1, nbins + 1) b = np.clip(np.digitize(x, edges) - 1, 0, nbins - 1) # bin index per row Gb = np.bincount(b, weights=g, minlength=nbins) # gradient histogram Hb = np.bincount(b, weights=h, minlength=nbins) # hessian histogram G, H = Gb.sum(), Hb.sum() GL = np.cumsum(Gb)[:-1]; HL = np.cumsum(Hb)[:-1] # left side at each cut GR, HR = G - GL, H - HL gain = 0.5 * (GL**2/(HL+lam) + GR**2/(HR+lam) - G**2/(H+lam)) cut = np.argmax(gain) print(f"rows={n}, but only {nbins} bins scanned to find the split") print(f"best cut at bin {cut} ~ x = {edges[cut+1]:.3f} (true sign flip at 0.600)") print(f"best gain = {gain[cut]:.2f}") plot_xy(list(range(len(gain))), gain.tolist()) RUN ▶ edits are live — drop nbins to 16 and watch the cut get coarser 15.5 CatBoost — ordered boosting & native categoricals CatBoost (Prokhorenkova et al., 2018) targets a subtle bug that the others share: target leakage. It shows up in two places — when you encode a categorical feature using the target, and when you compute residuals — and CatBoost's signature move, ordered boosting, is one mechanism that fixes both. Ordered target statistics. A natural way to turn a categorical value (say, a city) into a number is to replace it with the average target for that category — target or mean encoding. Done naïvely, this leaks: the row's own label is in the average used to encode it, so the model peeks at the answer. CatBoost computes the statistic using only the rows that came before in a random permutation: EQ M15.10 — ORDERED TARGET STATISTIC $$ \hat{x}_i \;=\; \frac{\displaystyle\sum_{j 0).astype(int) def fit_predict(Xtr, ytr, Xte): # ridge-ish least-squares classifier w = np.linalg.solve(Xtr.T @ Xtr + 1e-2*np.eye(d), Xtr.T @ (2*ytr - 1)) return (Xte @ w > 0).astype(int) accs = [] for _ in range(200): # vary ONLY the split, nothing else perm = rng.permutation(N) te, tr = perm[:80], perm[80:] # 80-row test set each time pred = fit_predict(X[tr], y[tr], X[te]) accs.append((pred == y[te]).mean()) accs = np.array(accs) print(f"holdout accuracy ranges over {accs.min():.3f}.. {accs.max():.3f}") print(f"mean = {accs.mean():.3f} std across splits = {accs.std():.3f}") print(f"so two single splits can disagree by ~{accs.max()-accs.min():.2f} on luck alone.") plot_xy(np.arange(len(accs)), np.sort(accs)) # sorted: the spread you'd never see once RUN ▶ edits are live — break it on purpose 1.2 k-fold cross-validation k-fold cross-validation partitions the data into \(k\) equal, disjoint folds. It then runs \(k\) experiments: in round \(i\), fold \(i\) is the validation set and the other \(k-1\) folds are the training set. Every row is validated exactly once. The cross-validation estimate is the average of the \(k\) fold scores: EQ V1.3 — THE k-FOLD CV ESTIMATE $$ \widehat{\mathrm{CV}} = \frac{1}{k}\sum_{i=1}^{k} \frac{1}{|F_i|}\sum_{(x,y)\in F_i} L\big(y,\, \hat{f}^{\,(-i)}(x)\big), \qquad \widehat{\mathrm{SE}} = \frac{s}{\sqrt{k}} $$ \(F_i\) is the \(i\)-th fold and \(\hat{f}^{\,(-i)}\) is the model trained on everything except \(F_i\). The estimate \(\widehat{\mathrm{CV}}\) is the mean of the \(k\) fold scores; \(s\) is their sample standard deviation, and \(s/\sqrt{k}\) is the usual standard error of that mean. Averaging \(k\) estimates is what buys the error bars the single split could not give you. A caveat experts insist on: the \(k\) fold scores are not independent (their training sets overlap heavily), so \(s/\sqrt{k}\) understates the true uncertainty — treat it as a useful indicator, not a calibrated interval. The choice of \(k\) is a bias–variance dial. Small \(k\) (e.g. 2) trains each model on much less data, so each fold model is weaker and \(\widehat{\mathrm{CV}}\) is pessimistically biased. Large \(k\) trains on almost all the data — at \(k = N\) you get leave-one-out CV (LOOCV), nearly unbiased but with \(N\) tightly correlated, high-variance fold scores and \(N\) model fits. The empirical sweet spot, established by Kohavi's classic study and unchanged in 2026, is \(k = 5\) or \(k = 10\): low enough bias, manageable variance, affordable compute. k Train size per fold Bias of estimate Variance / cost 2 N / 2 high (pessimistic) low cost, low variance 5 0.8 N small the common default 10 0.9 N smaller 2× the cost of k = 5 N (LOOCV) N − 1 ~unbiased N fits; high variance The total compute is exactly \(k\) model fits, each on a fraction \((k-1)/k\) of the data. That \(k\)-fold multiplier is the price of the error bars, and it is why §1.5's nested scheme — CV inside CV — is the expensive-but-honest end of the spectrum. You run 5-fold cross-validation on a dataset of \( N = 100 \) rows. With equal folds, how many rows are in the validation set of each fold (\( N/k \))? k-fold splits the data into \(k\) equal disjoint folds, and each fold is the validation set exactly once. So each fold holds \( N/k = 100/5 = \) 20 rows for validation, leaving the other 80 for training that round. PYTHON · RUNNABLE IN-BROWSER # k-fold CV from scratch in numpy: report mean +/- std of the metric. import numpy as np rng = np.random.default_rng(1) N, d, k = 300, 6, 5 X = rng.normal(0, 1, (N, d)) w_true = rng.normal(0, 1, d) y = ((X @ w_true + rng.normal(0, 1.0, N)) > 0).astype(int) def fit_predict(Xtr, ytr, Xte): w = np.linalg.solve(Xtr.T @ Xtr + 1e-2*np.eye(d), Xtr.T @ (2*ytr - 1)) return (Xte @ w > 0).astype(int) idx = rng.permutation(N) # shuffle once, then cut into k folds folds = np.array_split(idx, k) # k disjoint, near-equal index blocks scores = [] for i in range(k): val = folds[i] tr = np.concatenate([folds[j] for j in range(k) if j != i]) pred = fit_predict(X[tr], y[tr], X[val]) acc = (pred == y[val]).mean() scores.append(acc) print(f"fold {i+1}: train {tr.size:3d} val {val.size:3d} acc {acc:.3f}") scores = np.array(scores) se = scores.std(ddof=1) / np.sqrt(k) # EQ V1.3 (optimistic: folds correlate) print(f"\nCV accuracy = {scores.mean():.3f} +/- {scores.std(ddof=1):.3f} (std)") print(f" = {scores.mean():.3f} +/- {se:.3f} (std error of the mean)") print("One number with a band -- not a point estimate pretending to be the truth.") RUN ▶ edits are live — break it on purpose INSTRUMENT V1.1 — FOLD VISUALIZER & VARIANCE SINGLE SPLIT vs k-FOLD · EQ V1.3 NUMBER OF FOLDS k 5 ESTIMATOR SINGLE SPLIT k-FOLD RESHUFFLE ▶ CV / HOLDOUT ESTIMATE — SPREAD ACROSS RESHUFFLES (STD) — MODELS FIT — The bar of 30 cells is your dataset; mint cells are validation, grey are training, one row per fold. Press RESHUFFLE a dozen times and watch the right-hand readout. In SINGLE SPLIT the estimate jumps around wildly between reshuffles — the coin flip of §1.1. Switch to k-FOLD: the same reshuffles now barely move the averaged estimate, because the \(k\) folds cancel each other's luck. Raise \(k\) to shrink the spread further, at the cost of more model fits. 1.3 Stratified & grouped k-fold Plain k-fold shuffles rows and cuts blindly. That fails in two common situations, and both have a fix that costs nothing but a smarter partition. Stratified k-fold: preserve the class balance On a 1% positive fraud dataset, a random fold can easily land with zero positives — making its score meaningless and inflating the variance across folds. Stratified k-fold partitions within each class so every fold mirrors the overall label distribution: EQ V1.4 — STRATIFICATION CONSTRAINT $$ \frac{|\{(x,y)\in F_i: y = c\}|}{|F_i|} \;\approx\; \frac{|\{(x,y)\in \mathcal{D}: y = c\}|}{N} \quad\text{for every fold } F_i \text{ and class } c $$ Each fold's class proportions match the dataset's, up to rounding. For classification, stratification is the default, not an option — it removes a needless source of fold-to-fold variance and is essential under class imbalance, where a non-stratified fold may contain no minority examples at all. The same idea extends to regression by stratifying on binned targets. Grouped k-fold: respect dependence between rows If multiple rows share a hidden identity — several visits from one patient, many frames of one video, repeated measurements of one sensor — then a row in training and its sibling in validation creates leakage: the model effectively sees the answer. Grouped k-fold keeps every group entirely on one side of each split, so no group straddles the train/validation boundary. LEAKAGE The most expensive bug in applied ML is a leak you cannot see. If patient #42 has rows in both the training fold and the validation fold, your reported accuracy measures memorization of patient #42, not generalization to new patients — and it will collapse in production. The same trap appears with near-duplicate images, augmented copies, and any preprocessing (scaling, imputation, target encoding) fit on the full dataset before splitting. Rule: every transform must be fit inside the training fold only, and grouped splits are mandatory whenever rows are not independent. These choices compose: stratified group k-fold keeps groups intact and balances classes across folds, the standard recipe for imbalanced, clustered data. The honest caveat: when groups are few and uneven, perfect stratification and perfect grouping can conflict, and you accept an approximate balance. A dataset has \( N = 20{,}000 \) rows with a \( 1\% \) positive rate. How many positive rows are there in total (\( 0.01 \times N \))? \( 0.01 \times 20{,}000 = \) 200 positives. With only 200 positives spread across folds, a blind random split can easily hand one fold far fewer than its share — even zero — which is exactly the failure mode stratification (EQ V1.4) is built to prevent. For that same dataset (200 positives total), under 5-fold stratified CV, how many positives sit in each validation fold (\( 200/5 \))? Stratification forces each fold to carry the dataset's class proportions, so the positives are divided evenly: \( 200 / 5 = \) 40 per fold. Every fold therefore has enough minority examples to produce a meaningful score — the whole point of EQ V1.4. PYTHON · RUNNABLE IN-BROWSER # Stratified vs blind folds on a 5%-positive set: blind folds vary wildly. import numpy as np rng = np.random.default_rng(3) N, k = 1000, 5 y = (rng.random(N) ~50 of them print(f"dataset positives: {y.sum()} / {N} ({100*y.mean():.1f}%)\n") def blind_folds(idx): return np.array_split(rng.permutation(idx), k) def stratified_folds(y): folds = [[] for _ in range(k)] for c in (0, 1): # deal each class round-robin into folds members = rng.permutation(np.where(y == c)[0]) for j, row in enumerate(members): folds[j % k].append(row) return [np.array(f) for f in folds] print("blind fold positive-rates:", end=" ") for f in blind_folds(np.arange(N)): print(f"{y[f].mean():.3f}", end=" ") print("\nstrat. fold positive-rates:", end=" ") for f in stratified_folds(y): print(f"{y[f].mean():.3f}", end=" ") print("\n\nBlind folds scatter (one may even hit 0.00 -> a useless fold);") print("stratified folds all sit near the 0.05 base rate, by construction.") RUN ▶ edits are live — break it on purpose 1.4 Time-series cross-validation Everything above assumes the rows are exchangeable — that shuffling is harmless. For temporally ordered data it is not. Shuffling lets the model train on the future and validate on the past, which is impossible at deployment and produces gloriously optimistic, completely fake scores. The cardinal rule of temporal validation is brutal and simple: EQ V1.5 — THE FORWARD-CHAINING CONSTRAINT $$ \max_{t \in \text{train}_i} t \; Every timestamp used for training must precede every timestamp used for validation, in every fold. This is forward chaining (also "walk-forward" or "rolling-origin" validation): the validation window always lives strictly in the future relative to its training window. Standard k-fold violates this on roughly half of its train/validation pairs and is therefore invalid for any series with temporal structure. A further refinement inserts an embargo / purge gap between train and validation to kill leakage from overlapping feature windows or label horizons (the López de Prado correction for financial data). Two schemes both satisfy EQ V1.5; they differ in what they do with old data: Expanding window. The training set grows each fold — every split keeps all history up to the cut and validates on the next block. Uses all data; assumes the past stays relevant; training cost grows over folds. Rolling (sliding) window. The training set is a fixed-length window that slides forward, dropping the oldest data as it adds new. Constant training cost, and — more importantly — it adapts to non-stationarity and concept drift, where ancient history actively misleads. Which to prefer is genuinely contested and data-dependent: expanding windows win when the process is stable and data is scarce; rolling windows win when the world is drifting. Either way you typically report the average score across the forward-chained folds, exactly as in EQ V1.3 — just with splits that never look ahead. INSTRUMENT V1.2 — TIME-SERIES SPLIT VISUALIZER EXPANDING vs ROLLING · FORWARD-CHAINED · EQ V1.5 SPLITS 5 EMBARGO GAP 0 WINDOW EXPANDING ROLLING SCHEME — FORWARD-CHAINED? — TRAIN BLOCKS · FOLD 1 → LAST — Time runs left → right across 24 ordered periods, one row per fold. Grey is training, mint is validation, and any blue cell is the embargo gap that is thrown away to prevent leakage. Notice that validation is always to the right of training — the future is never used to predict the past. Switch to ROLLING and the grey training block becomes a fixed-width window that slides forward, forgetting the oldest data; EXPANDING keeps accumulating it. Raise the embargo to punch a blue moat between the two. In time-series cross-validation, the training data must always come before the validation data in time (no future rows in training). True or false? (Answer true or false.) This is the forward-chaining constraint of EQ V1.5: \(\max_{t\in\text{train}} t true. PYTHON · RUNNABLE IN-BROWSER # Forward-chained splits: expanding vs rolling. Verify NO split looks ahead. import numpy as np N, k = 24, 5 order = np.arange(N) # already time-ordered: 0 = oldest fold = N // (k + 1) # size of each validation block roll_train = 2 * fold # fixed window width for the rolling scheme print("EXPANDING window (training set grows):") ok = True for i in range(1, k + 1): tr = order[: i * fold] va = order[i * fold: (i + 1) * fold] leak = tr.max() >= va.min() ok &= not leak print(f" fold {i}: train {tr.min():2d}..{tr.max():2d} ({tr.size:2d}) " f"val {va.min():2d}..{va.max():2d} leak? {leak}") print("\nROLLING window (fixed width, slides forward):") for i in range(1, k + 1): end = i * fold tr = order[max(0, end - roll_train): end] va = order[end: end + fold] if va.size == 0: break leak = tr.max() >= va.min() ok &= not leak print(f" fold {i}: train {tr.min():2d}..{tr.max():2d} ({tr.size:2d}) " f"val {va.min():2d}..{va.max():2d} leak? {leak}") print(f"\nany split that trained on the future? {not ok} " "(EQ V1.5 holds this is False)") RUN ▶ edits are live — break it on purpose 1.5 Nested CV for honest tuning Here is the subtle, costly mistake that even careful practitioners make. You run k-fold CV, try a hundred hyperparameter settings, pick the one with the best CV score, and report that score as the model's performance. That number is biased upward — sometimes badly. You used the validation folds twice: once to tune and once to report. Selecting the maximum over many noisy estimates is selecting partly for noise, so the winner's CV score is an optimistic estimate of its true error. This is the cross-validation cousin of the multiple-comparisons problem (STATS · §4.6). EQ V1.6 — THE OPTIMISM OF SELECTION $$ \mathbb{E}\Big[\min_{\theta\in\Theta}\widehat{\mathrm{CV}}(\theta)\Big] \;\le\; \min_{\theta\in\Theta}\,\mathbb{E}\big[\widehat{\mathrm{CV}}(\theta)\big] \qquad\text{(Jensen / max-of-noisy-estimates)} $$ The expected score of the selected configuration is better (lower error) than the true error of the best configuration — the gap is pure selection bias, and it grows with the number of configurations tried \(|\Theta|\) and with the noise in each estimate. The fold you select on can no longer give an unbiased estimate of performance. The fix is to wall off an estimation set the selection never touches. Nested cross-validation does exactly that with two loops. The outer loop's folds are used only to estimate performance. Inside each outer training set, a separate inner CV loop performs the entire hyperparameter search and refits the chosen model. The outer fold — never seen by the inner search — then scores it. Because selection and evaluation use disjoint data, the outer score is an honest estimate of the whole pipeline, tuning included. INSTRUMENT V1.3 — NESTED CV STRUCTURE OUTER = SCORE · INNER = SELECT · EQ V1.6 OUTER FOLDS 3 INNER FOLDS 3 HIGHLIGHT OUTER FOLD 1 TOTAL MODEL FITS — OUTER × INNER × GRID — OUTER SCORE IS… UNBIASED The top band is one highlighted outer split: grey = outer-train, mint = the outer-test fold that is sealed away. The lower bands show the inner CV that runs inside the outer-train portion to pick hyperparameters — and never touches the mint band. Drag HIGHLIGHT OUTER FOLD to step through outer rounds. The fit-count readout makes the cost concrete: nested CV runs (outer × inner × grid-size) fits, which is why people reach for it only when an honest number actually matters. PYTHON · RUNNABLE IN-BROWSER # Optimistic bias of tuning on the test fold vs nested CV (pure noise data). import numpy as np rng = np.random.default_rng(7) N, k, G = 120, 5, 40 # G = number of hyperparameter settings tried y = rng.integers(0, 2, N) # labels are PURE NOISE: true acc = 0.50 # Each "config" is a random predictor independent of y -> all truly ~50% accurate. def config_preds(seed, idx): # deterministic per (config, rows) r = np.random.default_rng(seed) return r.integers(0, 2, len(idx)) def cv_acc(g, idx): # k-fold accuracy of config g on rows idx folds = np.array_split(rng.permutation(idx), k) accs = [(config_preds(g, f) == y[f]).mean() for f in folds] return np.mean(accs) # WRONG: tune AND report on the same CV -> pick the max over G noisy 0.5s. flat = [cv_acc(g, np.arange(N)) for g in range(G)] naive = max(flat) # NESTED: inner CV selects the best config; the held-out outer fold scores it. outer = np.array_split(rng.permutation(np.arange(N)), k) nested = [] for i in range(k): test = outer[i] train = np.concatenate([outer[j] for j in range(k) if j != i]) best = max(range(G), key=lambda g: cv_acc(g, train)) # select on inner data nested.append((config_preds(best, test) == y[test]).mean()) # score on sealed fold print(f"truth (labels are noise): 0.500") print(f"naive 'best CV' (tune==report): {naive:.3f} 0.5 on noise") print(f"nested CV outer mean: {np.mean(nested):.3f} RUN ▶ edits are live — break it on purpose When is the full nested machinery worth it? When you must report a trustworthy performance number after tuning — a benchmark, a paper, a go/no-go decision. For the cheaper everyday workflow, a fixed three-way split (train / validation / test) approximates one outer fold: tune on validation, report once on the untouched test set. Nested CV is simply that idea applied \(k\) times so the honest estimate itself gets error bars. The cost — outer × inner × grid model fits — is the reason it is reserved for when honesty is non-negotiable. NEXT Cross-validation tells you how to score a configuration honestly; it does not tell you which configurations to try. The inner loop of nested CV was a hand-wave — "search the hyperparameters." Chapter 02 opens that loop: grid and random search, Bayesian optimization, Hyperband and successive halving, and the budget arithmetic that decides how many of those expensive inner fits you can actually afford. 1.R References Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. B 36(2) — the foundational formalization of cross-validation for model assessment. Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI 1995 — the empirical case for stratified 10-fold CV (§1.2). Varma, S. & Simon, R. (2006). Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinformatics 7:91 — the selection-bias result behind nested CV (EQ V1.6). Bergmeir, C. & Benítez, J. M. (2012). On the Use of Cross-Validation for Time Series Predictor Evaluation. Information Sciences 191 — forward-chaining validation for temporal data (§1.4). Arlot, S. & Celisse, A. (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys 4 — the comprehensive modern reference on CV variants and their bias/variance. Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning and reporting on the same folds inflates scores. ← PREVIOUS 15 Boosting Libraries NEXT CHAPTER 02 Hyperparameter Tuning AI // ENCYCLOPEDIA — MODEL VALIDATION & RISK · CH 01 FULL CONTENTS ↗ ## MLOPS · Hyperparameter Tuning (https://ai-encyclopedia.com/mlops/02-hyperparameter-tuning.html) Hyperparameter Tuning — AI Encyclopedia AI // ENCYCLOPEDIA / MODEL RISK / 02 / TUNING INDEX NEXT: 03 METRICS → MODEL VALIDATION & RISK · CHAPTER 02 / 07 Hyperparameter Tuning Training fits the model parameters. The hyperparameters, such as learning rate, tree depth, and regularization strength, are set by a search over a validation objective that you define. The shipped model is selected by that search, and random search often outperforms grid search at equal budget. LEVEL CORE READING TIME ≈ 27 MIN BUILDS ON MLOPS 01 · ML 06 INSTRUMENTS GRID vs RANDOM · BAYES OPT · HALVING IN THIS CHAPTER 2.1 The search space & objective 2.2 Grid search 2.3 Random search 2.4 Bayesian optimization 2.5 Hyperband & halving 2.R References 2.1 The search space & the objective A hyperparameter is any setting fixed before training that the optimizer is not allowed to touch: the learning rate of gradient descent, the number of trees and their depth in a forest, the \(C\) of an SVM, the dropout rate of a network. Cross-validation (the previous chapter) tells you how to score one configuration honestly. Tuning is the outer question it left open: which configurations should you even try? Frame it as optimization. Let \(\theta \in \Theta\) be a configuration drawn from a search space \(\Theta\), and let \(f(\theta)\) be the validation objective — typically the cross-validated loss, which we want to minimize: EQ V2.1 — THE HYPERPARAMETER OPTIMIZATION PROBLEM $$ \theta^{\star} \;=\; \arg\min_{\theta \in \Theta}\; f(\theta), \qquad f(\theta) \;=\; \widehat{\mathrm{CV}}\big(\theta\big) \;=\; \frac{1}{k}\sum_{i=1}^{k} L\big(y,\, \hat{f}_{\theta}^{\,(-i)}(x)\big) $$ \(\Theta\) is the Cartesian product of every hyperparameter's allowed range; \(f(\theta)\) is the k-fold CV score (MLOPS · EQ V1.3) of training with configuration \(\theta\). Two properties make this hard and define the whole chapter: \(f\) is a black box — no gradient, no formula, you can only sample it — and each sample is expensive, since one evaluation means training the model \(k\) times. Every method below is a different policy for spending a fixed budget of these expensive black-box queries. Three properties of the space drive every design choice. First, scale: a learning rate ranges over orders of magnitude, so you search it on a log scale (\(10^{-5}\) to \(10^{-1}\)), not a linear one — uniform-in-log, where each decade gets equal attention. Second, type: some knobs are continuous (learning rate), some integer (tree depth), some categorical (optimizer ∈ {Adam, SGD}). Third, and most consequential, effective dimensionality: of the dozen hyperparameters you nominally tune, usually only two or three actually move the score. That last fact is the hinge on which random search beats grid search. EQ V2.2 — SIZE OF A GRID $$ |\Theta_{\text{grid}}| \;=\; \prod_{j=1}^{D} n_j \qquad\Longrightarrow\qquad \text{evaluations} \;=\; k \cdot \prod_{j=1}^{D} n_j $$ A grid that tries \(n_j\) values of hyperparameter \(j\) across \(D\) hyperparameters has \(\prod_j n_j\) configurations, and each costs \(k\) model fits under k-fold CV. The product is the curse of dimensionality in one line: ten values across six hyperparameters is \(10^6\) configurations — a million fits before you score anything. The objective \(f\) is cheap to state and ruinous to sweep. You define a grid over three hyperparameters with \(4\), \(5\), and \(3\) candidate values respectively. By EQ V2.2, how many distinct configurations does the full grid contain (\(\prod_j n_j\))? The grid is the Cartesian product, so its size is the product of the per-axis counts: \(4 \times 5 \times 3 = \) 60 configurations. Under 5-fold CV that is already \(60 \times 5 = 300\) model fits — and adding a single fourth hyperparameter with just 4 values would quadruple it to 240 configurations. You search a learning rate log-uniformly between \(10^{-5}\) and \(10^{-1}\). The geometric center of that range — the value sitting exactly halfway in log space — is \(\sqrt{10^{-5}\cdot 10^{-1}}\). What is it? In log space the midpoint of \([-5, -1]\) is \(-3\), so the value is \(10^{-3}\). Equivalently the geometric mean: \(\sqrt{10^{-5}\cdot 10^{-1}} = \sqrt{10^{-6}} = 10^{-3} = \) 0.001. Searching log-uniformly is why each decade of learning rate gets equal sampling attention. 2.2 Grid search: exhaustive and quietly wasteful Grid search is the reflex: pick a finite set of values per hyperparameter, take the Cartesian product, evaluate every point, keep the best. It is trivial to implement, embarrassingly parallel, and reproducible to the last digit. For one or two hyperparameters it is perfectly reasonable. Its problems begin the moment the space has more dimensions or more resolution than your budget can sweep. The first problem is the exponential of EQ V2.2: refining a grid from 5 to 10 values per axis multiplies the cost by \(2^D\), so the same compute buys you exponentially coarser resolution as \(D\) grows. The second problem is subtler and more damaging. A grid spends its budget on the Cartesian product even when most axes do not matter. Project a 2D grid onto the one hyperparameter that actually drives the score, and the grid's points collapse onto each other — a \(g \times g\) grid tests only \(g\) distinct values of the hyperparameter that matters, no matter how large \(g\) is. EQ V2.3 — DISTINCT VALUES TESTED ON THE IMPORTANT AXIS $$ \underbrace{g^{D}}_{\text{grid evaluations}} \quad\text{but only}\quad \underbrace{g}_{\substack{\text{distinct values of} \\ \text{the one axis that matters}}} \;\;\ll\;\; \underbrace{g^{D}}_{\substack{\text{distinct values random} \\ \text{search would test}}} $$ A \(g\)-per-axis grid in \(D\) dimensions makes \(g^{D}\) evaluations but, by construction, repeats each value of every single axis \(g^{D-1}\) times. If the objective depends on only one axis, all that repetition is wasted: the grid resolves the important axis at resolution \(g\), while spending \(g^{D}\) queries doing it. Random search spends the identical budget on \(g^{D}\) distinct values of every axis — the observation that motivates §2.3. Grid search is not obsolete. When you have exactly one or two hyperparameters, when reproducibility and an auditable, even sweep matter (a regulated model-risk setting), or when each evaluation is cheap, a small grid is clear and defensible. The lesson is narrower than "never grid": do not let a grid's tidiness lull you into spending an exponential budget resolving axes that do not move the metric. PYTHON · RUNNABLE IN-BROWSER # Grid vs random search on a 2D objective; best-found vs number of evaluations. import numpy as np rng = np.random.default_rng(0) # A black-box objective on [0,1]^2 with one DOMINANT axis (x) and a near-flat one (y). # Minimum sits near x*=0.30; y barely matters -> the classic random-beats-grid setup. def f(x, y): return (x - 0.30)**2 + 0.03*(y - 0.7)**2 + 0.01*np.sin(12*x) # GRID: g x g points -> g^2 evals, but only g DISTINCT x values are ever tested. for g in (4, 6): xs = np.linspace(0, 1, g) grid = [f(x, y) for x in xs for y in xs] print(f"grid {g}x{g} = {g*g:2d} evals | best f = {min(grid):.4f} " f"| distinct x tested = {g}") # RANDOM: same budgets, but every draw is a fresh x AND y. print() for n in (16, 36): X = rng.random((n, 2)) vals = f(X[:, 0], X[:, 1]) print(f"random {n:2d} evals | best f = {vals.min():.4f} " f"| distinct x tested = {n}") print("\nEqual budgets: random resolves the x that matters far more finely than grid.") RUN ▶ edits are live — break it on purpose 2.3 Random search: the surprising default Random search replaces the grid with independent draws: sample each hyperparameter from a distribution (uniform, log-uniform, categorical) for some budget \(n\) of trials, evaluate, keep the best. Bergstra and Bengio's 2012 result — one of the most quietly influential papers in applied ML — showed that for the same budget, random search matches or beats grid search on neural-network tuning, and the reason is exactly EQ V2.3: when only a few of the many hyperparameters matter, random search devotes its full budget to many distinct values of the important ones, while the grid wastes most of its points on combinations of the unimportant ones. There is also a clean probabilistic guarantee that needs no assumption about which axes matter. If you call a configuration "good" when it lands in the top \(p\) fraction of the search space, then \(n\) independent random draws miss the good region entirely with probability \((1-p)^n\): EQ V2.4 — RANDOM SEARCH COVERAGE GUARANTEE $$ \Pr[\text{at least one trial in the top } p] \;=\; 1 - (1-p)^{n} \qquad\Longrightarrow\qquad n \;\geq\; \frac{\ln(1-c)}{\ln(1-p)} $$ To hit the top \(p = 5\%\) of configurations with confidence \(c = 0.95\), you need \(n \ge \ln(0.05)/\ln(0.95) \approx 59\) random trials — independent of how many hyperparameters you have. This is the dimension-free property that makes random search scale where grids cannot: the bound depends only on how good "good enough" is, not on \(D\). It does assume the top-\(p\) region has \(p\) probability mass under your sampling distribution, which is why choosing sensible ranges and log-scales still matters. Random search tends to beat grid search precisely when only a few of the many hyperparameters actually matter, because it spends its budget on more distinct values of those important axes. True or false? (Answer true or false.) This is the central finding of Bergstra & Bengio (2012) and the meaning of EQ V2.3. A \(g\times g\) grid tests only \(g\) distinct values of the axis that matters and wastes the rest of its \(g^2\) budget on the flat axis; random search at the same budget tests \(g^2\) distinct values of every axis. When the effective dimensionality is low, that is a decisive advantage — so the statement is true. = ln(1 - c) / ln(1 - p), with c = 0.95 and p = 0.05; round up."> Using the coverage bound of EQ V2.4, how many random trials \(n\) do you need to land at least one configuration in the top \(p = 5\%\) of the space with confidence \(c = 95\%\)? (Compute \(\lceil \ln(1-c)/\ln(1-p)\rceil\).) \(\ln(1-c) = \ln(0.05) = -2.996\) and \(\ln(1-p) = \ln(0.95) = -0.0513\). Their ratio is \(-2.996 / -0.0513 = 58.4\), and you round up to a whole trial: \(\lceil 58.4 \rceil = \) 59 trials. Notably, this number does not depend on the number of hyperparameters at all. INSTRUMENT V2.1 — GRID vs RANDOM ON A RESPONSE SURFACE EQUAL BUDGET · ONE DOMINANT AXIS · EQ V2.3 BUDGET (EVALUATIONS) 36 AXIS IMPORTANCE SKEW high RESEED ▶ GRID — DISTINCT x TESTED — GRID BEST f — RANDOM BEST f — Both panels spend the same budget on the same 2D response surface; brighter background is lower (better) loss, and the minimum is the mint ring. The grid (left) lays points on a lattice, so its projection onto the important x axis (the strip under each panel) bunches into a few repeated values. Random search (right) scatters, so its x-projection is dense. Crank AXIS IMPORTANCE SKEW up — make the vertical axis nearly irrelevant — and the grid's wasted budget becomes obvious: it keeps re-testing the same handful of x values, while random keeps finding new ones. Press RESEED to redraw the random trials. 2.4 Bayesian optimization: spend queries where they pay Grid and random search are both uninformed — neither looks at the results of past trials when choosing the next one. When evaluations are genuinely expensive (training a large model for hours), that is wasteful: you have data about the objective and you are ignoring it. Bayesian optimization closes the loop. It fits a cheap probabilistic surrogate model of \(f\) from the trials seen so far, then uses the surrogate to choose the most promising next query — sequential and sample-efficient by design. The classic surrogate is a Gaussian process, which returns, at any candidate \(\theta\), a posterior mean \(\mu(\theta)\) (its best guess of the objective) and a standard deviation \(\sigma(\theta)\) (its uncertainty). The genius is the second number: \(\sigma\) is large where you have not looked. An acquisition function turns \((\mu, \sigma)\) into a single score that balances exploitation (go where \(\mu\) is good) against exploration (go where \(\sigma\) is high). A common, transparent choice is the upper/lower confidence bound: EQ V2.5 — CONFIDENCE-BOUND ACQUISITION (minimization) $$ \theta_{\text{next}} \;=\; \arg\min_{\theta}\; \alpha(\theta), \qquad \alpha(\theta) \;=\; \mu(\theta) \;-\; \kappa\,\sigma(\theta), \qquad \kappa \geq 0 $$ For minimization we pick the point with the lowest lower confidence bound: \(\mu - \kappa\sigma\) is optimistic in the direction we care about, rewarding both small predicted loss (\(\mu\) low) and high uncertainty (\(\sigma\) large). \(\kappa\) is the explore–exploit dial: \(\kappa = 0\) is pure greedy exploitation; large \(\kappa\) chases the most uncertain region. Expected Improvement (EI) is the other standard acquisition and needs no \(\kappa\), self-balancing via the improvement over the best seen so far. After each real evaluation the surrogate is refit and the loop repeats — typically converging in tens of trials where random search needs hundreds. Bayesian optimization is the right tool when evaluations dominate cost and trials are necessarily sequential. Its honest caveats: a Gaussian process surrogate scales poorly past a few thousand trials and a couple dozen dimensions, and it is harder to parallelize than random search (each query depends on the last). For high-dimensional or massively parallel settings, tree-based surrogates (the TPE algorithm behind Optuna and Hyperopt) and the bandit-style methods of §2.5 often win — and in practice modern tuners (Optuna, Ray Tune) combine a Bayesian sampler with an early-stopping scheduler rather than choosing one. A Gaussian-process surrogate predicts, at a candidate point, mean \(\mu = 2.0\) and standard deviation \(\sigma = 0.2\). Using the confidence-bound acquisition of EQ V2.5 with \(\kappa = 2\), what is its acquisition value \(\mu - \kappa\sigma\) for minimization? \(\mu - \kappa\sigma = 2.0 - 2 \times 0.2 = 2.0 - 0.4 = \) 1.6. A second candidate with the same \(\mu = 2.0\) but larger \(\sigma = 0.5\) would score \(2.0 - 2\times0.5 = 1.0\) — lower, hence preferred, because the optimizer is rewarded for probing where it is uncertain. PYTHON · RUNNABLE IN-BROWSER # Toy Bayesian optimization: maximize a 1D function with a simple RBF surrogate. import numpy as np rng = np.random.default_rng(0) def objective(x): # the expensive black box (we "don't know" it) return np.sin(3*x) + 0.5*np.sin(7*x) - 0.05*(x-2)**2 grid = np.linspace(0, 5, 400) # candidate pool for the acquisition step X = np.array([0.5, 4.5]); Y = objective(X) # two initial probes ls, kappa = 0.4, 2.0 # RBF length-scale; explore-exploit dial def kern(a, b): # RBF / squared-exponential kernel return np.exp(-0.5*((a[:, None]-b[None,:])/ls)**2) for step in range(8): # 8 sequential, sample-efficient queries K = kern(X, X) + 1e-6*np.eye(len(X)) Kinv = np.linalg.inv(K) ks = kern(grid, X) mu = ks @ Kinv @ Y # posterior mean over the grid var = 1.0 - np.sum((ks @ Kinv) * ks, axis=1) sd = np.sqrt(np.clip(var, 1e-9, None)) # posterior std (uncertainty) acq = mu + kappa*sd # UCB: we are MAXIMIZING here nxt = grid[np.argmax(acq)] # next query = argmax acquisition X = np.append(X, nxt); Y = np.append(Y, objective(nxt)) print(f"true max over grid: {objective(grid).max():.4f} at x = {grid[objective(grid).argmax()]:.3f}") print(f"BO best after {len(X)} evals: {Y.max():.4f} at x = {X[np.argmax(Y)]:.3f}") plot_xy(grid, mu) # final surrogate mean RUN ▶ edits are live — break it on purpose INSTRUMENT V2.2 — BAYESIAN-OPTIMIZATION STEPPER GP SURROGATE · μ ± κσ · NEXT QUERY = argmin · EQ V2.5 EXPLORE–EXPLOIT κ 2.0 STEP ▶ RESET EVALUATIONS SO FAR — BEST f FOUND (min) — NEXT QUERY x — The hidden objective (faint grey line) is a black box; the optimizer only knows the mint probes it has spent. The blue band is the surrogate's mean \(\mu\) bracketed by its uncertainty \(\pm\sigma\) — wide where nothing has been sampled, pinched to nothing at each probe. The dashed marker is the next query: the argmin of the lower confidence bound \(\mu - \kappa\sigma\). Press STEP to spend one evaluation and watch the band collapse there; the optimizer converges on the true minimum in a handful of steps. Slide \(\kappa\) to 0 for pure greedy (it can get stuck in a local dip) or up to 4 for restless exploration. 2.5 Hyperband & successive halving Every method so far runs each configuration to completion before judging it. But for iterative learners — neural nets, gradient boosting — a configuration's fate is often visible early: a learning rate that will diverge usually starts diverging in the first few epochs. Successive halving exploits this. It is a tournament on budget: start many configurations on a tiny budget (a few epochs, a small data subset), throw away the worst fraction, give the survivors more budget, and repeat. Most of the compute lands on the few configurations that have already proven themselves. EQ V2.6 — SUCCESSIVE HALVING SCHEDULE $$ n_r \;=\; \Big\lfloor \frac{n_0}{\eta^{\,r}} \Big\rfloor, \qquad b_r \;=\; b_0\,\eta^{\,r}, \qquad r = 0, 1, \ldots, \big\lfloor \log_{\eta} n_0 \big\rfloor $$ Round \(r\) keeps \(n_r\) survivors and gives each a per-config budget \(b_r\); \(\eta\) is the cull factor (keep the top \(1/\eta\) each round, usually \(\eta = 3\)). Configurations fall geometrically while budget per survivor rises geometrically, so the total budget spent per round stays roughly constant \((n_r\, b_r \approx n_0\, b_0)\) — the scheme reallocates a fixed pot toward the promising, never enlarging it. With \(\eta = 3\) and \(n_0 = 27\): \(27 \to 9 \to 3 \to 1\) over \(\log_3 27 = 3\) elimination rounds. Successive halving has one free parameter that hides a real dilemma: how many configurations \(n_0\) to start with, given a fixed total budget \(B\)? Start with many cheap configs (large \(n_0\), small \(b_0\)) and you explore widely but might cut a slow-starting winner before it blooms. Start with few well-funded configs and you risk never sampling a good one. There is no universally right answer because it depends on how fast the early signal correlates with final performance. Hyperband (Li et al., 2017) resolves this by refusing to choose. It runs successive halving as a subroutine across a spectrum of brackets, each with a different \((n_0, b_0)\) trade-off — from "many configs, tiny budget each" (aggressive early stopping) to "few configs, full budget each" (essentially random search, which never wrongly culls). By hedging across brackets it is robust to how early-stopping-friendly the problem turns out to be, and it provably loses only a small logarithmic factor to the best fixed bracket chosen in hindsight. In 2026 the bracket idea, usually under the ASHA variant (asynchronous successive halving), is the default scheduler in Ray Tune and Optuna — and is routinely paired with a Bayesian sampler choosing which configurations enter the tournament. Successive halving starts with \(n_0 = 27\) configurations and culls with factor \(\eta = 3\) (keep the top third each round). How many elimination rounds does it take to reach a single survivor (\(\log_{\eta} n_0\))? Each round divides the survivors by \(\eta = 3\): \(27 \to 9 \to 3 \to 1\). The number of rounds is \(\log_3 27 = \log_3 3^3 = \) 3. Because budget per survivor triples each round while their count thirds, the compute spent per round is roughly constant. INSTRUMENT V2.3 — SUCCESSIVE-HALVING BRACKET CULL THE WORST 1−1/η EACH ROUND · EQ V2.6 STARTING CONFIGS n₀ 27 CULL FACTOR η 3 RESEED ▶ ELIMINATION ROUNDS — SURVIVORS PER ROUND — TOTAL BUDGET vs RUN-ALL — Each column is one configuration; each row is a round, time flowing downward. A mint cell is a survivor still being trained; a faded grey cell was culled. Bar height within a survivor cell encodes the budget it now receives — short at the top (cheap early peeks), tall at the bottom (the finalists get full training). Notice the shape: a wide cheap top narrowing to a single well-funded survivor. The right-hand readout compares the total budget spent against naively running every configuration to completion — the savings that make early stopping worth its one risk, culling a late bloomer. Raise \(\eta\) to cull more aggressively (fewer rounds, bigger savings, higher risk). NEXT Tuning optimizes a number; the next chapter asks whether you chose the right number. Every method here minimized a validation metric — but accuracy, AUC, F1, RMSE, and log loss can rank the same models in opposite orders, and the "best" configuration is only as honest as the metric it was scored on. Chapter 03: the regression and classification metrics, what each one rewards and quietly punishes, and how to pick the objective your search should have been optimizing all along. 2.R References Bergstra, J. & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. JMLR 13 — the result that random search matches or beats grid search under low effective dimensionality (§2.3). Snoek, J., Larochelle, H. & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS 2012 — Gaussian-process surrogates and acquisition functions for tuning (§2.4). Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. (2017). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. JMLR 18 — successive halving across brackets, the bandit view of early stopping (§2.5). Jamieson, K. & Talwalkar, A. (2016). Non-stochastic Best Arm Identification and Hyperparameter Optimization. AISTATS 2016 — the successive-halving subroutine behind EQ V2.6. Li, L. et al. (2020). A System for Massively Parallel Hyperparameter Tuning. MLSys 2020 — ASHA, the asynchronous successive halving used in production tuners. Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD 2019 — the TPE sampler plus pruning combination that is the practical default in 2026. ← PREVIOUS 01 Cross-Validation NEXT CHAPTER 03 Metrics AI // ENCYCLOPEDIA — MODEL VALIDATION & RISK · CH 02 FULL CONTENTS ↗ ## MLOPS · Metrics (https://ai-encyclopedia.com/mlops/03-regression-classification-metrics.html) Metrics — Regression & Classification — AI Encyclopedia AI // ENCYCLOPEDIA / MODEL RISK / 03 / METRICS INDEX NEXT: 04 RANKING & CALIBRATION → MODEL VALIDATION & RISK · CHAPTER 03 / 07 Metrics — Regression & Classification A metric is not just a final report; it is the objective the pipeline optimizes toward, and it determines which errors the model is willing to make. The metric you optimize is the behavior you get, and accuracy is the one most often misread on imbalanced data. This chapter covers the working vocabulary: regression error measures, the confusion matrix and the rates derived from it, and probabilistic scores that grade the predicted confidence as well as the answer. LEVEL INTRO READING TIME ≈ 24 MIN BUILDS ON MLOPS 01 · STATS 04 INSTRUMENTS CONFUSION · MAE vs RMSE · MAPE TRAP IN THIS CHAPTER 3.1 Regression metrics 3.2 When each one misleads 3.3 The confusion matrix 3.4 Precision, recall, F1, accuracy 3.5 Log loss & probabilistic scoring 3.R References 3.1 Regression metrics: MSE, RMSE, MAE, MAPE, R² A regression model predicts a number \(\hat{y}_i\) for each row whose truth is \(y_i\). The single object every regression metric chews on is the vector of residuals \(e_i = y_i - \hat{y}_i\). The metrics differ only in how they punish a residual — and that choice of punishment is the choice of what the model will try hardest to avoid. EQ V3.1 — MEAN SQUARED ERROR & ITS ROOT $$ \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2, \qquad \mathrm{RMSE} = \sqrt{\mathrm{MSE}} $$ Squaring makes large residuals dominate: a single error of 10 contributes as much as one hundred errors of 1. So MSE/RMSE is the metric of choice when big misses are disproportionately bad (a forecast that is occasionally catastrophic). RMSE takes the square root to return to the original units — predict dollars, read dollars — and is the value almost always reported. The estimator that minimizes MSE is the conditional mean \(\mathbb{E}[y\mid x]\). EQ V3.2 — MEAN ABSOLUTE ERROR $$ \mathrm{MAE} = \frac{1}{n}\sum_{i=1}^{n} \lvert\, y_i - \hat{y}_i \,\rvert $$ MAE punishes every dollar of error equally, with no squaring. It is in the original units already and is far more robust to outliers than RMSE — one wild residual moves it linearly, not quadratically. The estimator that minimizes MAE is the conditional median, which is why MAE-trained models lean toward the typical case and ignore rare extremes. The gap between RMSE and MAE is itself a diagnostic: RMSE \(\ge\) MAE always, and a large ratio signals a heavy tail of big errors. EQ V3.3 — MEAN ABSOLUTE PERCENTAGE ERROR $$ \mathrm{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n} \left\lvert \frac{y_i - \hat{y}_i}{y_i} \right\rvert $$ MAPE rescales every residual by the truth, giving a unit-free percentage that lets you compare error across series of wildly different magnitude. That convenience hides two traps it is famous for: it explodes when any \(y_i\) is near zero (the demo in §3.2), and it is asymmetric — it penalizes over-prediction more harshly than under-prediction, quietly biasing a MAPE-tuned forecast low. Use it for reporting across scales; never as your sole training objective. EQ V3.4 — COEFFICIENT OF DETERMINATION (R²) $$ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} = 1 - \frac{\mathrm{SS}_{\text{res}}}{\mathrm{SS}_{\text{tot}}} $$ \(R^2\) is the fraction of the target's variance the model explains, measured against the dumbest honest baseline: always predicting the mean \(\bar{y}\). \(R^2 = 1\) is perfect; \(R^2 = 0\) means you matched the mean and learned nothing; \(R^2\) can go negative when the model is worse than that constant — a fact that surprises people who assume it lives in \([0,1]\). Because it is normalized by the data's own spread, \(R^2\) is the one regression metric that is comparable across datasets. WORKED EXAMPLE ▾ 01 Four rows, truth \(y = (10, 12, 14, 16)\), predictions \(\hat{y} = (11, 11, 15, 15)\). Residuals \(e = (-1, 1, -1, 1)\). 02 MSE \(= \frac{1}{4}(1+1+1+1) = 1\); RMSE \(= \sqrt{1} = 1\); MAE \(= \frac{1}{4}(1+1+1+1) = 1\). Here RMSE = MAE because every error has the same size. 03 \(\mathrm{SS}_{\text{res}} = 4\). Mean \(\bar{y} = 13\), deviations \((-3,-1,1,3)\), \(\mathrm{SS}_{\text{tot}} = 9+1+1+9 = 20\). 04 \(R^2 = 1 - 4/20 = 1 - 0.2 = 0.80\) — the model explains 80% of the variance the mean leaves on the table. RESULT: RMSE = MAE = 1 · R² = 0.80 A model makes two predictions with errors \( e = (3,\ 4) \). Using EQ V3.1, what is the RMSE of these errors, \( \sqrt{\tfrac{1}{2}(3^2 + 4^2)} \)? Square the errors: \( 3^2 = 9 \) and \( 4^2 = 16 \). The mean squared error is \( (9 + 16)/2 = 25/2 = 12.5 \). Taking the root: \( \sqrt{12.5} = \) 3.54. (Note the MAE of the same errors is \( (3+4)/2 = 3.5 \) — RMSE sits above it because squaring inflates the larger residual.) PYTHON · RUNNABLE IN-BROWSER # Every regression metric from scratch on a toy fit (EQ V3.1-V3.4). import numpy as np rng = np.random.default_rng(0) n = 60 x = np.linspace(0, 10, n) y = 3.0 * x + 5.0 + rng.normal(0, 2.0, n) # truth: a noisy line # Fit y = a*x + b by least squares (this is what MSE training would find). A = np.vstack([x, np.ones_like(x)]).T a, b = np.linalg.lstsq(A, y, rcond=None)[0] yhat = a * x + b e = y - yhat # residuals: the raw material mse = np.mean(e**2) rmse = np.sqrt(mse) mae = np.mean(np.abs(e)) mape = 100 * np.mean(np.abs(e / y)) # y is safely far from 0 here r2 = 1 - np.sum(e**2) / np.sum((y - y.mean())**2) print(f"fitted line: y = {a:.2f}*x + {b:.2f}") print(f"MSE = {mse:.3f}") print(f"RMSE = {rmse:.3f} (same units as y)") print(f"MAE = {mae:.3f} ( RUN ▶ edits are live — break it on purpose 3.2 When each regression metric misleads None of these numbers is neutral. Each encodes an opinion about which errors matter, and each has a regime where it quietly tells you the wrong thing. The professional habit is to report at least two — usually RMSE and MAE — and to read the gap between them. Metric What it rewards Where it misleads RMSE / MSE getting the big cases right one outlier can dominate the score; over-sensitive to a single bad row MAE getting the typical case right indifferent to whether a miss is huge or merely large; ignores the tail MAPE comparable error across scales undefined / explodes near \(y = 0\); asymmetric, biases forecasts low R² beating the mean baseline inflates with more features; can go negative; meaningless on tiny test sets The cleanest demonstration is the outlier sensitivity of RMSE versus MAE. Take ten residuals of size 1. MAE = 1 and RMSE = 1. Now turn one of those into a residual of 10 — a single bad day. MAE crawls up to \((9\cdot 1 + 10)/10 = 1.9\). RMSE leaps to \(\sqrt{(9\cdot 1 + 100)/10} = \sqrt{10.9} = 3.30\). The same data; one metric shrugged, the other tripled. Which reaction you want is a domain decision — but you must know the metric is making it for you. THE MAPE TRAP MAPE divides by the truth, so a single true value near zero detonates it. If one row has \(y_i = 0.01\) and you predict \(0.5\), that term alone is \(|{-0.49}/0.01| = 49 = 4900\%\) — and the average is now hostage to one near-zero label, regardless of how good the other thousand predictions are. The standard escapes are sMAPE (symmetric MAPE), WAPE / weighted MAPE (divide by the sum of actuals, not row by row), or simply MAE when the targets can be small. Never compute MAPE on data with zeros in it. A model has residual sum of squares \( \mathrm{SS}_{\text{res}} = 4 \) against a total sum of squares \( \mathrm{SS}_{\text{tot}} = 20 \). Using EQ V3.4, what is \( R^2 = 1 - \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}} \)? \( \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}} = 4/20 = 0.20 \), so \( R^2 = 1 - 0.20 = \) 0.80. The model explains 80% of the variance that always-predict-the-mean would leave unexplained. INSTRUMENT V3.1 — MAE vs RMSE DIVERGENCE ADD AN OUTLIER · EQ V3.1 / V3.2 TYPICAL ERRORS (count of size 1) 9 OUTLIER RESIDUAL SIZE 10 MAE — RMSE — RMSE / MAE RATIO — Each bar is one residual: grey are the typical errors of size 1, the red bar is the single outlier you control. Drag OUTLIER RESIDUAL SIZE up and watch MAE rise gently while RMSE — and the RMSE/MAE ratio — climbs far faster, because squaring lets one bad row dominate. With no outlier (size 1) the two metrics agree exactly; the ratio is the tell-tale of a heavy error tail. PYTHON · RUNNABLE IN-BROWSER # One outlier: MAE shrugs, RMSE leaps. And MAPE near zero detonates. import numpy as np e = np.ones(10) # ten residuals of size 1 print("clean residuals:", e.astype(int)) print(f" MAE = {np.mean(np.abs(e)):.3f} RMSE = {np.sqrt(np.mean(e**2)):.3f}") e_out = e.copy(); e_out[0] = 10 # turn ONE into a size-10 miss print("\nwith one size-10 outlier:") print(f" MAE = {np.mean(np.abs(e_out)):.3f} RMSE = {np.sqrt(np.mean(e_out**2)):.3f}") print(f" RMSE rose {np.sqrt(np.mean(e_out**2))/np.sqrt(np.mean(e**2)):.2f}x; " f"MAE rose only {np.mean(np.abs(e_out))/np.mean(np.abs(e)):.2f}x") # The MAPE trap: identical absolute errors, but one true value sits near zero. y = np.array([100., 100., 100., 0.01]) yhat = np.array([101., 101., 101., 0.50]) print("\nMAPE per row (%):", np.round(100*np.abs((y-yhat)/y), 1)) print(f"overall MAPE = {100*np.mean(np.abs((y-yhat)/y)):.0f} % " " RUN ▶ edits are live — break it on purpose INSTRUMENT V3.2 — THE MAPE NEAR-ZERO PITFALL ONE SMALL TRUTH BREAKS THE AVERAGE · EQ V3.3 ONE TRUE VALUE y₄ 0.50 ITS PREDICTION ŷ₄ 1.00 MAE (units, stable) — MAPE (%, fragile) — y₄'s SHARE OF MAPE — Three well-behaved rows sit near \(y = 100\) with tiny errors; a fourth row's truth \(y_4\) is the slider, plotted on a log-distance scale. As you drag \(y_4\) toward zero, MAE barely twitches — the absolute error is unchanged — but MAPE blows up and that one row's share of the total MAPE races toward 100%. Slide \(y_4\) past zero and the percentage error becomes nonsense entirely: this is why MAPE is banned on data containing zeros. 3.3 The confusion matrix Classification swaps a continuous truth for a discrete one, and the entire grammar of classification metrics is built from a single 2×2 table. A binary classifier converts a score into a label by comparing it to a threshold; every prediction then lands in one of four cells: EQ V3.5 — THE CONFUSION MATRIX $$ \begin{array}{c|cc} & \hat{y}=1 & \hat{y}=0 \\ \hline y=1 & \mathrm{TP} & \mathrm{FN} \\ y=0 & \mathrm{FP} & \mathrm{TN} \end{array} $$ TP (true positive): predicted positive, was positive. FP (false positive, "false alarm"): predicted positive, was negative. FN (false negative, "miss"): predicted negative, was positive. TN (true negative). Every classification metric in §3.4 is just a ratio of these four counts. The deep point: FP and FN have different costs — a missed tumor is not the same as a false alarm — so no single number can serve every problem, and the threshold that balances them is a business decision, not a statistical one. The threshold is the dial that moves counts between cells. Lower it and you call more things positive: TP and FP both rise, FN falls. Raise it and you become conservative: FP and TP both fall, FN rises. You cannot lower false alarms and misses at the same time by moving the threshold — you can only trade one for the other. That trade-off is the single most important intuition in classification, and Instrument V3.3 below exists to make you feel it in your hands. A confusion matrix has \( \mathrm{TP}=60 \), \( \mathrm{FP}=40 \), \( \mathrm{FN}=40 \), \( \mathrm{TN}=60 \). What is the accuracy, \( (\mathrm{TP}+\mathrm{TN}) / (\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}) \)? Correct predictions are \( \mathrm{TP}+\mathrm{TN} = 60 + 60 = 120 \); the total is \( 60+40+40+60 = 200 \). Accuracy \( = 120/200 = \) 0.6. Note that this "balanced-looking" matrix still gets two in five wrong — and on an imbalanced set the same accuracy could come from a model that never finds a single positive. INSTRUMENT V3.3 — CONFUSION-MATRIX EXPLORER MOVE THE THRESHOLD · PRECISION ↔ RECALL · EQ V3.5 DECISION THRESHOLD 0.50 CLASS SEPARATION 2.0 PRECISION — RECALL — F1 — ACCURACY — Two overlapping bell curves are the score distributions of the negative and positive classes; the vertical line is your threshold. Everything to its right is called positive. Slide the threshold left and recall climbs while precision falls (you catch more positives but raise false alarms); slide it right and the trade reverses. The four counts and all four metrics update live. Then drag CLASS SEPARATION up: a genuinely better model is the only thing that improves precision and recall together. PYTHON · RUNNABLE IN-BROWSER # Confusion matrix -> precision, recall, F1, accuracy, all from scratch. import numpy as np rng = np.random.default_rng(2) n = 1000 y = rng.integers(0, 2, n) # true labels # scores: positives score higher on average, but the classes overlap score = rng.normal(0, 1, n) + 1.3 * y thr = 0.5 pred = (score > thr).astype(int) TP = int(np.sum((pred == 1) & (y == 1))) FP = int(np.sum((pred == 1) & (y == 0))) FN = int(np.sum((pred == 0) & (y == 1))) TN = int(np.sum((pred == 0) & (y == 0))) print(f"confusion: TP={TP} FP={FP} FN={FN} TN={TN}") precision = TP / (TP + FP) # of predicted positives, how many real recall = TP / (TP + FN) # of real positives, how many caught f1 = 2 * precision * recall / (precision + recall) accuracy = (TP + TN) / n print(f"precision = {precision:.3f}") print(f"recall = {recall:.3f}") print(f"F1 = {f1:.3f} (harmonic mean of the two)") print(f"accuracy = {accuracy:.3f}") RUN ▶ edits are live — break it on purpose 3.4 Precision, recall, F1, accuracy Four ratios of the four counts. They look interchangeable; they are not, and choosing the wrong one is the most common way a model ships looking great and fails in production. EQ V3.6 — PRECISION & RECALL $$ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \qquad \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$ Precision answers: of everything I flagged positive, what fraction really was? It is the metric you care about when a false alarm is expensive — a spam filter that quarantines a real invoice, a fraud system that freezes an honest card. Recall (sensitivity, true-positive rate) answers: of everything that really was positive, what fraction did I catch? It is what you care about when a miss is expensive — a cancer screen, a security threat. The two pull in opposite directions along the threshold (§3.3): you buy recall with precision and vice versa. EQ V3.7 — F1: THE HARMONIC MEAN $$ F_1 = \frac{2\,\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}, \qquad F_\beta = (1+\beta^2)\frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\beta^2\,\mathrm{Precision}+\mathrm{Recall}} $$ \(F_1\) is the harmonic mean of precision and recall, not the arithmetic one — and the choice is deliberate. The harmonic mean is dragged toward the smaller of the two, so \(F_1\) is high only when precision and recall are both high; a model with precision 1.0 and recall 0.0 scores \(F_1 = 0\), not 0.5. \(F_\beta\) generalizes it: \(\beta > 1\) weights recall more (use when misses hurt), \(\beta < 1\) weights precision more. \(F_1\) is the right summary on imbalanced data where accuracy is useless. EQ V3.8 — ACCURACY (AND WHY IT LIES) $$ \mathrm{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}} $$ Accuracy is the fraction of predictions that are correct — intuitive, and the default everyone reaches for first. It is also the metric that lies most often, because it collapses the confusion matrix into one number and so is blind to class imbalance. On a dataset that is 99% negative, the model that predicts "negative" for everything scores 99% accuracy while catching exactly zero positives — useless, yet by accuracy alone it looks excellent. This is the accuracy paradox, and it is why the lede of this chapter singles accuracy out. Under imbalance, report precision, recall, F1, or balanced accuracy instead. A COMMON ERROR "The model is 97% accurate, ship it." Always ask the base rate first. If 97% of the rows are negative, a constant "no" predictor already scores 97% — your model may have learned nothing. The diagnostic reflex: compute the accuracy of the majority-class baseline, and never report accuracy on imbalanced data without precision and recall beside it. Accuracy is a fine metric only when the classes are roughly balanced and false positives and false negatives cost about the same. A classifier produces \( \mathrm{TP} = 40 \) true positives and \( \mathrm{FP} = 10 \) false positives. Using EQ V3.6, what is its precision, \( \mathrm{TP}/(\mathrm{TP}+\mathrm{FP}) \)? Precision \( = \dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} = \dfrac{40}{40+10} = \dfrac{40}{50} = \) 0.8. Four out of every five items the model flagged as positive truly were — but precision alone says nothing about how many positives it missed; that is recall's job. That same classifier also has \( \mathrm{FN} = 20 \) false negatives (positives it missed). Using EQ V3.6, what is its recall, \( \mathrm{TP}/(\mathrm{TP}+\mathrm{FN}) \)? Recall \( = \dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} = \dfrac{40}{40+20} = \dfrac{40}{60} = \) 0.667. So the model is precise (0.8) but leaky on recall (0.67): its \( F_1 = \frac{2\cdot 0.8\cdot 0.667}{0.8+0.667} = 0.727 \), pulled below the average toward the weaker of the two. PYTHON · RUNNABLE IN-BROWSER # The accuracy paradox: 99% accurate and completely useless. import numpy as np rng = np.random.default_rng(5) n = 10000 y = (rng.random(n) the metric that 'works' is a mirage.\n") # "Model" B: a real but imperfect detector. score = rng.normal(0, 1, n) + 2.5 * y predB = (score > 1.5).astype(int) TP=int(np.sum((predB==1)&(y==1))); FP=int(np.sum((predB==1)&(y==0))) FN=int(np.sum((predB==0)&(y==1))); TN=int(np.sum((predB==0)&(y==0))) prec = TP/(TP+FP) if TP+FP else 0 rec = TP/(TP+FN) if TP+FN else 0 f1 = 2*prec*rec/(prec+rec) if prec+rec else 0 print(f"REAL DETECTOR: accuracy = {(TP+TN)/n:.3f}") print(f" precision = {prec:.3f} recall = {rec:.3f} F1 = {f1:.3f}") print("Same-ish accuracy, but only F1/precision/recall reveal which model works.") RUN ▶ edits are live — break it on purpose 3.5 Log loss & probabilistic scoring Everything so far grades a decision — the label after thresholding. But most classifiers output a probability, and throwing it away to compute accuracy discards information: a model that says "90% sure" and is right is better than one that says "51% sure" and is right. Log loss (binary cross-entropy) grades the probability itself, rewarding confidence only when it is earned. EQ V3.9 — BINARY CROSS-ENTROPY (LOG LOSS) $$ \mathrm{LogLoss} = -\frac{1}{n}\sum_{i=1}^{n}\Big[\, y_i\ln \hat{p}_i + (1-y_i)\ln(1-\hat{p}_i) \,\Big] $$ For each row, only one term survives: if \(y_i = 1\) the penalty is \(-\ln\hat{p}_i\), if \(y_i = 0\) it is \(-\ln(1-\hat{p}_i)\). Predict the truth with probability 1 and the penalty is \(-\ln 1 = 0\); predict it with probability \(0.5\) and you pay \(\ln 2 \approx 0.693\) — the cost of a coin flip, and the score a model that has learned nothing converges to. The penalty is unbounded: a confident wrong answer (\(\hat{p}\to 0\) when \(y=1\)) costs \(-\ln(0)\to\infty\). Log loss is the loss most classifiers are actually trained on, and it is the proper scoring rule that calibration (Chapter 04) exists to keep honest. WORKED EXAMPLE ▾ 01 True label \(y = 1\). A confident-correct model says \(\hat{p} = 0.9\): penalty \(= -\ln 0.9 = 0.105\). Cheap. 02 A hedging model says \(\hat{p} = 0.5\): penalty \(= -\ln 0.5 = 0.693\). The price of saying "I don't know." 03 A confident-wrong model says \(\hat{p} = 0.1\): penalty \(= -\ln 0.1 = 2.303\). More than 20× the confident-correct cost. 04 Push it to \(\hat{p} = 0.01\): penalty \(= -\ln 0.01 = 4.605\). As \(\hat{p}\to 0\) the loss \(\to\infty\) — log loss punishes arrogance without mercy. RESULT: 0.9 → 0.105 · 0.5 → 0.693 · 0.1 → 2.303 Two sibling scores are worth knowing. The Brier score is the mean squared error of the probabilities, \(\frac{1}{n}\sum(\hat{p}_i - y_i)^2\) — also a proper scoring rule, but bounded (a confident wrong answer maxes out at 1 rather than infinity), so it is gentler on outliers and easier to read. And cross-entropy generalizes immediately to \(K\) classes as \(-\frac{1}{n}\sum_i\sum_{c} y_{ic}\ln\hat{p}_{ic}\), the multiclass loss behind virtually every neural classifier. The honest caveat: log loss assumes the probabilities are calibrated; a model can have great ranking (AUC) yet terrible log loss if its probabilities are systematically over- or under-confident — exactly the gap the next chapter closes. A model predicts probability \( \hat{p} = 0.9 \) for a row whose true label is \( y = 1 \). Using EQ V3.9, what is the log-loss penalty for this single row, \( -\ln(\hat{p}) \)? (Use \( \ln 0.9 = -0.105 \).) With \( y = 1 \) only the first term survives: penalty \( = -\ln(\hat{p}) = -\ln(0.9) = -(-0.105) = \) 0.105. Compare a hedge at \( \hat{p}=0.5 \) (cost \(0.693\)) and a confident error at \( \hat{p}=0.1 \) (cost \(2.303\)): log loss rewards confidence only when it is right. PYTHON · RUNNABLE IN-BROWSER # Log loss vs Brier: confidence is rewarded only when it's right (EQ V3.9). import numpy as np def log_loss(y, p): p = np.clip(p, 1e-12, 1 - 1e-12) # guard ln(0) = -inf return -np.mean(y*np.log(p) + (1-y)*np.log(1-p)) def brier(y, p): return np.mean((p - y)**2) y = np.array([1, 1, 0, 0]) # two positives, two negatives confident_right = np.array([0.95, 0.90, 0.05, 0.10]) hedging = np.array([0.55, 0.55, 0.45, 0.45]) confident_wrong = np.array([0.05, 0.10, 0.95, 0.90]) for name, p in [("confident & right", confident_right), ("hedging (~0.5) ", hedging), ("confident & WRONG", confident_wrong)]: print(f"{name}: log loss = {log_loss(y,p):.3f} Brier = {brier(y,p):.3f}") print("\nlog loss of a single confident-correct 0.9:", round(-np.log(0.9), 3)) print("log loss of a single coin-flip 0.5:", round(-np.log(0.5), 3)) print("log loss explodes for confident errors; Brier stays bounded by 1.") RUN ▶ edits are live — break it on purpose NEXT Every metric here graded a fixed threshold or assumed the probabilities were trustworthy — two assumptions the next chapter refuses to make. Chapter 04 sweeps the threshold to draw the ROC and precision–recall curves (and the AUC that summarizes them), then asks the harder question log loss only hinted at: when the model says 70%, does it happen 70% of the time? — calibration, reliability diagrams, and the fixes that make probabilities mean what they say. 3.R References Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Machine Learning Technologies 2(1) — the definitive survey of confusion-matrix metrics, their biases, and what each one really measures (§3.3–3.4). Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — the standard reference for loss functions, R², and the bias/variance view of regression error (§3.1–3.2). Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1) — the original proper scoring rule for probabilistic forecasts (§3.5). Gneiting, T. & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. J. American Statistical Association 102(477) — the theory of why log loss and Brier reward honest probabilities (§3.5). Hyndman, R. J. & Koehler, A. B. (2006). Another Look at Measures of Forecast Accuracy. International J. Forecasting 22(4) — the canonical critique of MAPE and the case for scaled error measures (§3.2). Chicco, D. & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6 — a modern argument for why accuracy and F1 mislead on imbalanced data (§3.4). ← PREVIOUS 02 Tuning NEXT CHAPTER 04 Ranking & Calibration AI // ENCYCLOPEDIA — MODEL VALIDATION & RISK · CH 03 FULL CONTENTS ↗ ## MLOPS · Ranking, Calibration, ROC, KS & PSI (https://ai-encyclopedia.com/mlops/04-ranking-calibration.html) Ranking, Calibration, ROC, KS & PSI — AI Encyclopedia AI // ENCYCLOPEDIA / MODEL RISK / 04 / RANKING & CALIBRATION INDEX NEXT: 05 STABILITY & DRIFT → MODEL VALIDATION & RISK · CHAPTER 04 / 07 Ranking, Calibration, ROC, KS & PSI A scoring model makes two separate promises, and most teams check only the first. One is correct ordering, placing risky cases above safe ones. The other is correct magnitude: a score of 0.30 should default about 30% of the time. ROC/AUC and KS measure the ranking; calibration measures whether the scores match observed rates. The two properties are independent, and a model can satisfy one while failing the other. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON MLOPS 03 · STATS 04 INSTRUMENTS ROC/PR · CALIBRATION · COST CUTOFF IN THIS CHAPTER 4.1 ROC curves & AUC 4.2 Precision–recall curves 4.3 The KS statistic & Gini 4.4 Calibration & Brier score 4.5 Cutoff selection by cost 4.R References 4.1 ROC curves & AUC A binary classifier that emits a score (a probability, a logit, a credit grade) does not commit to a decision until you pick a threshold. Sweep the threshold from high to low and you trace out the full menu of operating points the model can offer. The Receiver Operating Characteristic curve plots two of them against each other: the true positive rate (recall, sensitivity) on the vertical axis and the false positive rate (1 − specificity) on the horizontal. EQ V4.1 — THE TWO RATES OF THE ROC AXES $$ \mathrm{TPR}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FN}(t)}, \qquad \mathrm{FPR}(t) = \frac{\mathrm{FP}(t)}{\mathrm{FP}(t) + \mathrm{TN}(t)} $$ At threshold \(t\), everything scoring \(\ge t\) is called positive. TPR is the fraction of true positives the model catches; FPR is the fraction of true negatives it falsely raises. As \(t \to \infty\) you predict nothing positive and sit at \((0,0)\); as \(t \to -\infty\) you predict everything positive and sit at \((1,1)\). Crucially, both rates condition on the true class — so the ROC curve is invariant to class prevalence. A 1%-positive fraud set and a balanced one produce the same ROC for the same ranking, which is exactly why it is the standard summary of a model's discrimination. The single-number summary is the Area Under the ROC Curve (AUC, or AUROC). Its value is not a coincidence of geometry — it equals a probability: EQ V4.2 — AUC AS A RANKING PROBABILITY $$ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\,\big(\mathrm{FPR}^{-1}(u)\big)\,du \;=\; \Pr\big(\,s(X^{+}) > s(X^{-})\,\big) + \tfrac{1}{2}\Pr\big(\,s(X^{+}) = s(X^{-})\,\big) $$ Draw one random positive and one random negative; AUC is the probability the model scores the positive higher (ties split evenly). This is the Wilcoxon–Mann–Whitney statistic. AUC = 1.0 is a perfect ranker, 0.5 is a coin flip, and below 0.5 means your score is backwards (flip its sign and you are above 0.5 again). Because it asks only "is the positive ranked above the negative?", AUC measures ordering and is completely blind to whether the scores are calibrated probabilities — the gap §4.4 exists to fill. WORKED EXAMPLE ▾ 01 Two positives score \((0.9,\ 0.6)\); three negatives score \((0.7,\ 0.4,\ 0.2)\). Form all \(2\times3 = 6\) positive–negative pairs. 02 Count pairs where the positive outranks the negative: \(0.9\) beats all three (3); \(0.6\) beats \(0.4\) and \(0.2\) but loses to \(0.7\) (2). Total concordant \(= 5\), with no ties. 03 \(\mathrm{AUC} = \dfrac{\text{concordant} + \tfrac12\,\text{ties}}{\text{all pairs}} = \dfrac{5 + 0}{6} = 0.8\overline{3}\). RESULT: AUC = 5/6 ≈ 0.833 — five of six pairs ranked correctly Computing AUC by sweeping thresholds is the slow way; the pair-counting identity is the fast and exact way. Sort by score, walk the list, and accumulate how many negatives each positive outranks — \(O(n \log n)\), no integration error. A perfect classifier assigns every positive a higher score than every negative. Using EQ V4.2 (AUC = probability a random positive outranks a random negative), what is its AUC? If every positive outranks every negative, then for every positive–negative pair the positive wins: the concordant fraction is \(1\) and there are no ties, so \(\mathrm{AUC} = \Pr(s(X^+) > s(X^-)) = \) 1.0. The ROC curve hugs the top-left corner, passing through \((0,1)\). PYTHON · RUNNABLE IN-BROWSER # ROC points and AUC from scores, two ways: threshold sweep vs pair-counting. import numpy as np rng = np.random.default_rng(0) # 600 negatives ~ N(0,1), 400 positives ~ N(1.1,1): overlapping but separable. neg = rng.normal(0.0, 1.0, 600) pos = rng.normal(1.1, 1.0, 400) scores = np.concatenate([neg, pos]) y = np.concatenate([np.zeros(600), np.ones(400)]).astype(int) # --- ROC points by sweeping every distinct score as a threshold (EQ V4.1) --- order = np.argsort(-scores) # high score first ys = y[order] P, Nn = ys.sum(), (1 - ys).sum() tpr = np.cumsum(ys) / P # caught positives so far fpr = np.cumsum(1 - ys) / Nn # false alarms so far auc_curve = np.sum(np.diff(fpr) * (tpr[1:] + tpr[:-1]) / 2) # trapezoid area # --- AUC by the Mann-Whitney pair-counting identity (EQ V4.2) --- ranks = scores.argsort().argsort() + 1 # average-free rank of each score auc_rank = (ranks[y == 1].sum() - P*(P+1)/2) / (P*Nn) print(f"positives: {int(P)} negatives: {int(Nn)}") print(f"AUC (threshold sweep / trapezoid): {auc_curve:.4f}") print(f"AUC (Mann-Whitney pair counting): {auc_rank:.4f}") print(f"the two agree to rounding: {abs(auc_curve-auc_rank) RUN ▶ edits are live — break it on purpose INSTRUMENT V4.1 — ROC / PR / KS EXPLORER DRAG THE TWO CLASS DISTRIBUTIONS · EQ V4.1–V4.2 CLASS SEPARATION (Δμ) 1.40 POSITIVE SPREAD (σ⁺) 1.00 PREVALENCE (% POS) 40% VIEW ROC PRECISION–RECALL AUC (AUROC) — KS STATISTIC — AVG PRECISION (PR-AUC) — The two bell curves are the score distributions of negatives and positives. Slide SEPARATION to zero and the curves collapse onto the diagonal — AUC → 0.5, a useless ranker. Pull them apart and the ROC bows toward the top-left corner. The KS gap marked on the ROC view is the largest vertical distance between TPR and FPR (§4.3). Switch to PRECISION–RECALL and drop PREVALENCE to 2% to watch the lesson of §4.2: the ROC barely moves, but the PR curve collapses — because precision pays the rent on rarity. 4.2 Precision–recall curves ROC's prevalence-invariance is a feature when you want to judge a ranker in the abstract — and a trap when you deploy it. On a 1%-positive fraud problem, a model can post a gorgeous 0.95 AUC and still flag fifty false alarms for every real fraud, because the false-positive rate is measured against the vast negative pool. The precision–recall curve tells the story ROC hides: it plots precision (of the cases I flagged, what fraction were right?) against recall (of the real positives, what fraction did I catch?). EQ V4.3 — PRECISION, RECALL, AND THE PR BASELINE $$ \mathrm{Precision}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FP}(t)}, \qquad \mathrm{Recall}(t) = \mathrm{TPR}(t), \qquad \text{baseline} = \frac{P}{P + N} = \pi $$ Precision has \(\mathrm{FP}\) in its denominator, and \(\mathrm{FP}\) scales with the size of the negative pool — so precision is acutely sensitive to prevalence in a way TPR and FPR are not. The no-skill baseline of a PR curve is the positive rate \(\pi\) (a random classifier holds precision \(\pi\) at every recall), versus the fixed diagonal at AUC = 0.5 for ROC. The area under the PR curve is summarized by Average Precision (AP), the precision averaged over the recall levels at which a new positive is retrieved. The practical rule, widely repeated since Saito & Rehmsmeier's 2015 study and still the consensus in 2026: use ROC/AUC to compare rankers and report discrimination; use PR/AP when positives are rare and the cost of false alarms is concrete. A change that is invisible on ROC can be dramatic on PR precisely because the rare class is where the action is. The two are not rivals — they answer different questions about the same ranking. THE PREVALENCE TRAP "0.97 AUC" is not a deployment guarantee. AUC conditions on the true class, so it cannot see that your negatives outnumber positives 100-to-1. Two models with identical AUC can have wildly different false-alarm volumes at any usable operating point. Before you ship a rare-event detector, look at the PR curve and the absolute counts at your chosen threshold — precision, not AUC, is what your reviewers and on-call team will actually feel. At your chosen threshold the model flags 50 cases as positive; 30 of them are truly positive (\(\mathrm{TP} = 30\), \(\mathrm{FP} = 20\)). What is the precision, \(\dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\)? Of the 50 flagged, 30 are correct and 20 are false alarms: \(\mathrm{Precision} = \dfrac{30}{30+20} = \dfrac{30}{50} = \) 0.6. Sixty percent of your alerts are real — a number ROC's two rates never put in front of you. PYTHON · RUNNABLE IN-BROWSER # Same ranking, two prevalences: AUC barely moves, PR-AUC collapses. import numpy as np rng = np.random.default_rng(2) def auc_ap(pos, neg): s = np.concatenate([pos, neg]) y = np.concatenate([np.ones(len(pos)), np.zeros(len(neg))]).astype(int) order = np.argsort(-s); ys = y[order] P, N = ys.sum(), (1 - ys).sum() tpr = np.cumsum(ys) / P fpr = np.cumsum(1 - ys) / N auc = np.sum(np.diff(fpr) * (tpr[1:] + tpr[:-1]) / 2) # trapezoid area prec = np.cumsum(ys) / np.arange(1, len(ys) + 1) # precision at each cutoff rec = tpr ap = np.sum(np.diff(np.concatenate([[0], rec])) * prec) # area under PR return auc, ap, P / (P + N) # Identical separability; only the negative pool grows. mu = 1.3 for n_neg in (500, 5000, 50000): pos = rng.normal(mu, 1.0, 500) neg = rng.normal(0.0, 1.0, n_neg) auc, ap, pi = auc_ap(pos, neg) print(f"prevalence {100*pi:5.1f}% -> AUC {auc:.3f} PR-AUC {ap:.3f} baseline {pi:.3f}") print("\nAUC is nearly constant (it conditions on the true class);") print("PR-AUC sinks toward the shrinking baseline as positives get rarer.") RUN ▶ edits are live — break it on purpose PR-AUC is summarized two ways and they differ: Average Precision (a step-wise sum, the scikit-learn default) and the trapezoidal area (which can be optimistic because linear interpolation between PR points is not achievable). Report which one you mean — and never compare an AP from one library to a trapezoidal PR-AUC from another. 4.3 The KS statistic & Gini (credit scoring) Credit risk has its own ranking dialect, inherited from decades of scorecard practice. Two numbers dominate model documentation in banking: the Kolmogorov–Smirnov statistic and the Gini coefficient. Both measure the same thing AUC does — how well the score separates good from bad — but in coordinates a risk committee reads fluently. The KS statistic is the largest gap between the two cumulative distributions of the score: the cumulative share of positives (bads) versus the cumulative share of negatives (goods), as you walk the score from one end to the other. EQ V4.4 — THE KS STATISTIC $$ \mathrm{KS} = \max_{t}\;\big|\, F_{+}(t) - F_{-}(t)\,\big| \;=\; \max_{t}\;\big|\,\mathrm{TPR}(t) - \mathrm{FPR}(t)\,\big| $$ \(F_{+}\) and \(F_{-}\) are the cumulative distribution functions of the score among positives and negatives. Because \(\mathrm{TPR} = 1 - F_{+}\) and \(\mathrm{FPR} = 1 - F_{-}\) up to orientation, KS is exactly the maximum vertical distance between the ROC curve and the diagonal — the most-separated operating point. KS ranges 0 (curves identical, no separation) to 1 (perfectly disjoint). In retail credit, KS in the 30s–40s is a healthy application scorecard; above ~75 usually means a leak, not a triumph. The threshold at which the gap is maximized is a natural — though rarely cost-optimal (§4.5) — cutoff. The Gini coefficient is just AUC rescaled to put a random model at zero and a perfect model at one: EQ V4.5 — GINI FROM AUC $$ \mathrm{Gini} = 2\,\mathrm{AUC} - 1 \qquad\Longleftrightarrow\qquad \mathrm{AUC} = \frac{\mathrm{Gini} + 1}{2} $$ Gini is the ratio of the area between the ROC curve and the diagonal to the area between the perfect curve and the diagonal — twice the area AUC adds above 0.5. A model with AUC 0.80 has Gini 0.60; AUC 0.5 → Gini 0; AUC 1.0 → Gini 1. KS, Gini, and AUC all rank a model's discrimination, but they are not monotone transforms of one another: Gini is a fixed function of AUC, whereas KS depends on the shape of the separation and can reorder two models that AUC ranks the other way. Banks report all three because regulators expect them, and because a model strong on KS but weak on Gini (or vice versa) signals an unusual score distribution worth a second look. The KS statistic is the maximum gap between the two classes' cumulative distribution functions of the score (equivalently, the largest vertical distance between the ROC curve and the diagonal). True or false? (Answer true or false.) By definition (EQ V4.4), \(\mathrm{KS} = \max_t |F_{+}(t) - F_{-}(t)| = \max_t |\mathrm{TPR}(t) - \mathrm{FPR}(t)|\) — precisely the maximum separation between the cumulative distributions of positives and negatives, which is the largest vertical gap between the ROC curve and the chance diagonal. The statement is true. A scorecard reports \(\mathrm{AUC} = 0.80\). Using EQ V4.5, what is its Gini coefficient (\(2\,\mathrm{AUC} - 1\))? \(\mathrm{Gini} = 2 \times 0.80 - 1 = 1.60 - 1 = \) 0.6. Equivalently, the model captures 60% of the way from a coin flip (Gini 0) to a perfect ranker (Gini 1). PYTHON · RUNNABLE IN-BROWSER # KS statistic and Gini from two score distributions (goods vs bads). import numpy as np rng = np.random.default_rng(5) bads = rng.normal(0.65, 0.18, 800).clip(0, 1) # higher score = riskier goods = rng.normal(0.40, 0.18, 4000).clip(0, 1) # KS: walk a common grid of thresholds, compare cumulative shares (EQ V4.4). grid = np.linspace(0, 1, 501) F_bad = np.searchsorted(np.sort(bads), grid, side="right") / len(bads) F_good = np.searchsorted(np.sort(goods), grid, side="right") / len(goods) gap = np.abs(F_bad - F_good) ks = gap.max() ks_at = grid[gap.argmax()] # AUC by pair-counting -> Gini = 2*AUC - 1 (EQ V4.2, V4.5). s = np.concatenate([bads, goods]) y = np.concatenate([np.ones(len(bads)), np.zeros(len(goods))]) ranks = s.argsort().argsort() + 1 P, N = len(bads), len(goods) auc = (ranks[y == 1].sum() - P*(P+1)/2) / (P*N) gini = 2*auc - 1 print(f"AUC = {auc:.4f}") print(f"Gini = 2*AUC - 1 = {gini:.4f}") print(f"KS = {ks:.4f} (max gap at score ~ {ks_at:.2f})") print("\nKS is the widest separation of the two cumulative curves;") print("Gini is AUC stretched so chance=0 and perfect=1.") plot_xy(grid, gap) # the KS gap as a function of cutoff RUN ▶ edits are live — break it on purpose A note on PSI. The Population Stability Index — the workhorse for detecting that today's score distribution has drifted from the development sample — lives in the same credit-scoring toolbox and shares KS's distribution-comparison spirit, but it answers a different question: not "how well does the score separate good from bad?" but "has the input or score population shifted since the model was built?" PSI is therefore a stability and drift diagnostic, and Chapter 05 develops it in full alongside characteristic-stability and drift detection. Here it is enough to know that KS/Gini measure discrimination, PSI measures population shift, and a healthy KS today says nothing about whether PSI has quietly crept past its alarm threshold. 4.4 Calibration — reliability curves & Brier score Everything so far judged ordering. None of it cares about the actual value of the score, because you can apply any strictly increasing transform — square it, pass it through a sigmoid, raise it to the tenth power — and AUC, KS, and Gini are all unchanged. But a score that drives a decision usually has to mean something: an expected-loss calculation needs a real probability of default, a triage tool needs to say "this patient has a 12% chance," not merely "this patient ranks 47th." Calibration is the property that closes the gap between the number and the world. EQ V4.6 — PERFECT CALIBRATION $$ \Pr\big(\,Y = 1 \,\mid\, \hat{p}(X) = p\,\big) = p \qquad \text{for all } p \in [0, 1] $$ Among all cases the model assigns probability \(p\), a fraction \(p\) should actually be positive. Calibration and discrimination are orthogonal. A model can be perfectly calibrated yet useless at ranking (predict the base rate \(\pi\) for everyone — calibrated, AUC = 0.5), or a flawless ranker yet badly miscalibrated (AUC = 1.0 with every probability squashed toward 0.5). You need both, and you must measure them separately because no single ranking metric will ever catch a calibration failure. You inspect calibration with a reliability curve: bin the predicted probabilities, and for each bin plot the mean prediction against the observed positive frequency. Perfect calibration is the 45° diagonal. A curve that sags below it means the model is over-confident (it says 0.9 but only 0.7 actually happen); a curve that bows above means it is under-confident. The classic shapes have classic causes: modern neural nets and boosted trees tend to over-confidence, naive Bayes pushes probabilities toward the extremes, and a well-regularized logistic regression is often calibrated almost for free. The standard scalar summary is the Brier score — the mean squared error of the probabilities themselves: EQ V4.7 — THE BRIER SCORE $$ \mathrm{BS} = \frac{1}{n}\sum_{i=1}^{n}\big(\hat{p}_i - y_i\big)^2, \qquad y_i \in \{0, 1\} $$ Lower is better; \(0\) is perfect, and predicting the base rate \(\pi\) for everyone gives \(\pi(1-\pi)\). The Brier score is a strictly proper scoring rule: it is uniquely minimized in expectation by the true probabilities, so you cannot game it by shading your forecasts. Its great virtue is also its limit — it bundles two things together. The Murphy decomposition splits it into calibration (reliability) plus refinement (resolution minus uncertainty), so a low Brier score can come from sharp-and-calibrated forecasts or from a timid model hugging the base rate. Read it alongside the reliability curve, never alone; for a pure calibration number, the Expected Calibration Error (the average bin-wise gap from the diagonal) is the common companion. When a model ranks well but is miscalibrated, you do not retrain — you recalibrate the output with a cheap monotone post-processor fit on held-out data: Platt scaling (a one-parameter logistic on the scores) or isotonic regression (a free-form non-decreasing step function). Both preserve the ranking exactly — AUC, KS, and Gini are untouched — while bending the reliability curve back onto the diagonal. Isotonic is more flexible but needs more data and can overfit; Platt is robust on small validation sets. This is the standard fix established by Niculescu-Mizil & Caruana and unchanged in practice today. Two predictions, both for true positives (\(y = 1\)): \(\hat{p}_1 = 0.8\) and \(\hat{p}_2 = 0.9\). Using EQ V4.7, what is the Brier score \(\tfrac{1}{2}\big[(\hat{p}_1 - 1)^2 + (\hat{p}_2 - 1)^2\big]\)? \((0.8 - 1)^2 = (-0.2)^2 = 0.04\) and \((0.9 - 1)^2 = (-0.1)^2 = 0.01\). The mean is \(\tfrac{1}{2}(0.04 + 0.01) = \tfrac{0.05}{2} = \) 0.025 — the squared-error penalty grows fast as a probability drifts from the truth. PYTHON · RUNNABLE IN-BROWSER # Discrimination vs calibration are orthogonal: same ranking, three Brier scores. import numpy as np rng = np.random.default_rng(11) n = 4000 p_true = rng.beta(2, 5, n) # the genuine probabilities y = (rng.random(n) a monotone re-map logit = np.log(p / (1 - p)) * gamma # gamma>1 sharpens, gamma 8}{'Brier':>9}") for name, p in [("calibrated", calibrated), ("over-confident", overconf), ("under-confident", underconf)]: print(f"{name: 8.3f}{brier(p):>9.4f}") print("\nAUC is IDENTICAL for all three -- warp() is monotone, so the ranking") print("never changes. Brier separates them: only the calibrated model is honest") print("about its probabilities. Discrimination cannot see what calibration measures.") RUN ▶ edits are live — break it on purpose INSTRUMENT V4.2 — RELIABILITY CURVE & BRIER OVER- vs UNDER-CONFIDENT MODELS · EQ V4.6–V4.7 CONFIDENCE (γ) 1.00 BIN COUNT 10 REGIME — BRIER SCORE — EXPECTED CALIB. ERROR — The model's probabilities are warped by an exponent \(\gamma\): the dots are binned predictions, the dashed line is perfect calibration. At \(\gamma = 1\) the model sits on the diagonal — honest. Push \(\gamma\) above 1 to make it over-confident (the curve sags below the line; it claims more certainty than it has) and below 1 to make it under-confident (the curve bows above). Watch the Brier score and ECE bottom out exactly at \(\gamma = 1\) — and note the ranking never changes, because \(\gamma\) is a monotone transform: this is calibration moving while discrimination stands still. 4.5 Cutoff selection by cost The ROC curve hands you every operating point the model can reach; it does not tell you which one to stand on. The default of 0.5 is almost always wrong — it is correct only when classes are balanced and the two error types cost the same, which is to say almost never. The right threshold is the one that minimizes expected cost, and that depends on numbers the model never sees: the price of a false positive, the price of a false negative, and the prevalence. EQ V4.8 — EXPECTED COST OF A THRESHOLD $$ \mathbb{E}[\text{cost}](t) = c_{\mathrm{FP}}\cdot\mathrm{FP}(t) + c_{\mathrm{FN}}\cdot\mathrm{FN}(t) \;\;\Big(- \, b_{\mathrm{TP}}\cdot\mathrm{TP}(t) - b_{\mathrm{TN}}\cdot\mathrm{TN}(t)\Big) $$ Each cell of the confusion matrix carries a cost (or benefit); the total is their weighted sum, and you choose the \(t\) that minimizes it. The benefit terms in parentheses are optional — when only errors are penalized, dropping them does not move the optimum. The optimal threshold is governed by the cost ratio, not by 0.5. If a missed fraud costs ten times a false alarm, you should accept far more false alarms to catch it — the cutoff slides down accordingly. For a model that emits a true probability \(\hat{p}\), the cost-minimizing rule has a clean closed form. Flagging a case is worth it when its expected cost of being positive falls below the expected cost of being negative, which rearranges to a single threshold on the probability: EQ V4.9 — THE COST-OPTIMAL PROBABILITY THRESHOLD $$ t^{\star} = \frac{c_{\mathrm{FP}}}{c_{\mathrm{FP}} + c_{\mathrm{FN}}} \qquad\Longleftrightarrow\qquad \text{predict positive when } \hat{p} \;\ge\; t^{\star} $$ The optimal cutoff depends only on the ratio of error costs. Equal costs (\(c_{\mathrm{FP}} = c_{\mathrm{FN}}\)) give \(t^{\star} = 0.5\) — the only case the default is right. If a false negative costs \(9\times\) a false positive, \(t^{\star} = \frac{1}{1+9} = 0.1\): flag anything above a 10% probability. This formula is only valid if \(\hat{p}\) is calibrated — which is precisely why §4.4 comes before §4.5. Feed it the over-confident scores of an uncalibrated model and the "optimal" threshold is optimal for a world that does not exist. Calibrate first, then optimize the cutoff; otherwise you are tuning a decision on a lie. Two things follow. First, the whole pipeline composes: rank well (§4.1–4.3), calibrate the probabilities (§4.4), then place the cutoff by cost (§4.5). Skip the middle step and the last one is meaningless. Second, when costs are uncertain — as they usually are — do not pick a single \(t^{\star}\); sweep the cost ratio and present the operating frontier, so the business owner can see the trade and choose with open eyes rather than inherit a hidden 0.5. A false positive costs \(c_{\mathrm{FP}} = 1\) and a false negative costs \(c_{\mathrm{FN}} = 9\). Using EQ V4.9, at what calibrated probability \(t^{\star}\) should you start predicting positive (\(\tfrac{c_{\mathrm{FP}}}{c_{\mathrm{FP}}+c_{\mathrm{FN}}}\))? \(t^{\star} = \dfrac{c_{\mathrm{FP}}}{c_{\mathrm{FP}} + c_{\mathrm{FN}}} = \dfrac{1}{1 + 9} = \dfrac{1}{10} = \) 0.1. Because a miss is nine times as expensive as a false alarm, you flag any case with at least a 10% probability — far below the naive 0.5. PYTHON · RUNNABLE IN-BROWSER # Cost-based cutoff: sweep thresholds, find the minimum-cost operating point. import numpy as np rng = np.random.default_rng(9) n = 6000 y = (rng.random(n) = t).astype(int) fp = int(((pred == 1) & (y == 0)).sum()) fn = int(((pred == 0) & (y == 1)).sum()) costs.append(c_fp*fp + c_fn*fn) costs = np.array(costs) t_star_grid = ts[costs.argmin()] # empirically optimal cutoff t_star_formula = c_fp / (c_fp + c_fn) # EQ V4.9 closed form print(f"closed-form t* = c_FP/(c_FP+c_FN) = {t_star_formula:.3f}") print(f"grid-search t* (min cost) = {t_star_grid:.3f}") print(f"cost at t=0.50 (naive default) = {costs[np.argmin(np.abs(ts-0.5))]:.0f}") print(f"cost at t* (cost-optimal) = {costs.min():.0f}") print("\nThe default 0.5 leaves money on the table whenever costs are asymmetric.") plot_xy(ts, costs) # the cost-vs-threshold curve (U-shaped) RUN ▶ edits are live — break it on purpose INSTRUMENT V4.3 — COST-BASED CUTOFF OPTIMIZER SWEEP THE THRESHOLD · EQ V4.8–V4.9 FALSE-POSITIVE COST 1 FALSE-NEGATIVE COST 9 PREVALENCE (% POS) 15% COST-OPTIMAL t* — COST @ t* vs @ 0.50 — SAVINGS vs DEFAULT — The U-shaped curve is total expected cost (EQ V4.8) as the threshold sweeps left to right; the mint marker is the cost-minimizing \(t^{\star}\), the grey line is the naive 0.5. Raise FALSE-NEGATIVE COST and watch \(t^{\star}\) slide left — you accept more false alarms to stop catching fewer expensive misses — landing near the closed form \(c_{\mathrm{FP}}/(c_{\mathrm{FP}}+c_{\mathrm{FN}})\) of EQ V4.9. The "savings vs default" readout is the money the standard 0.5 quietly throws away whenever your costs are asymmetric. NEXT These metrics all assume the world stays still — the population you scored yesterday is the population you score today. It never does. Chapter 05 turns to stability and drift: the Population Stability Index (PSI) and characteristic stability that catch a shifting input distribution, covariate and concept drift, and the monitoring that tells you when a once-excellent AUC has quietly stopped describing reality. 4.R References Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters 27(8) — the canonical tutorial on ROC curves, AUC, and the pair-counting identity (§4.1). Hand, D. J. (2009). Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve. Machine Learning 77(1) — the influential critique of AUC and the proposed H-measure. Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities With Supervised Learning. ICML 2005 — calibration behavior across model families and the Platt / isotonic fixes (§4.4). Saito, T. & Rehmsmeier, M. (2015). The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10(3) — the empirical case for PR over ROC under class imbalance (§4.2). Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1) — the original mean-squared-error scoring rule for probabilities (§4.4). Hanley, J. A. & McNeil, B. J. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 143(1) — the AUC = Wilcoxon–Mann–Whitney equivalence (EQ V4.2). Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017 — modern deep networks are systematically over-confident; temperature scaling as a fix (§4.4). ← PREVIOUS 03 Metrics NEXT CHAPTER 05 Stability & Drift AI // ENCYCLOPEDIA — MODEL VALIDATION & RISK · CH 04 FULL CONTENTS ↗ ## MLOPS · Stability & Drift (https://ai-encyclopedia.com/mlops/05-stability-drift.html) Stability & Drift — PSI, CSI & Concept Drift — AI Encyclopedia AI // ENCYCLOPEDIA / MODEL RISK / 05 / STABILITY & DRIFT INDEX NEXT: 06 EXPLAINABILITY → MODEL VALIDATION & RISK · CHAPTER 05 / 07 Stability & Drift A model is trained once on a fixed snapshot, then deployed into an environment that keeps changing. As the input distribution and the input-output relationships shift, an unchanged model gradually loses accuracy. Every deployed model decays; the open question is whether you detect the drift before users do. LEVEL CORE READING TIME ≈ 27 MIN BUILDS ON MLOPS 01 · STATS 04 INSTRUMENTS PSI · STREAM DETECTOR · DECAY IN THIS CHAPTER 5.1 Distribution shift 5.2 PSI & CSI 5.3 Detecting drift in production 5.4 Monitoring & retraining triggers 5.5 A model decaying in the wild 5.R References 5.1 Distribution shift — covariate, label & concept drift Supervised learning rests on one quiet assumption: the data you serve is drawn from the same distribution as the data you trained on. Write the joint distribution of inputs \(x\) and labels \(y\) as \(P(x, y) = P(y \mid x)\,P(x)\). Training estimates \(\hat{f}\) against a fixed \(P_{\text{train}}\); production feeds it some \(P_{\text{prod}}\). When the two diverge, the model is being asked a question it was never taught to answer. There are three textbook ways for them to diverge, and they are not interchangeable. EQ V5.1 — THE THREE SHIFTS $$ \underbrace{P_{\text{prod}}(x)\neq P_{\text{train}}(x)}_{\text{covariate shift}},\qquad \underbrace{P_{\text{prod}}(y)\neq P_{\text{train}}(y)}_{\text{label / prior shift}},\qquad \underbrace{P_{\text{prod}}(y\mid x)\neq P_{\text{train}}(y\mid x)}_{\text{concept drift}} $$ Covariate shift moves the inputs (\(P(x)\) changes) while the rule \(P(y\mid x)\) holds — your traffic now skews toward regions of feature space the model rarely saw. Label / prior shift moves the class balance \(P(y)\) — fraud spikes, the base rate moves. Concept drift is the dangerous one: the relationship itself, \(P(y\mid x)\), changes, so the function you learned is now wrong, not merely under-sampled. Critically, only the first two are visible from inputs alone; concept drift can be invisible in the features and surface only as a collapse in accuracy — which is why you monitor both. The distinction is operational, not academic, because it dictates the fix. Covariate shift can sometimes be corrected by importance weighting — reweight training examples by \(w(x) = P_{\text{prod}}(x)/P_{\text{train}}(x)\) so the old data resembles the new — without any fresh labels. Label shift is corrected by re-estimating the priors. Concept drift admits no such trick: the mapping moved, so the model must relearn it from freshly labelled data. Worse, concept drift can be real (the world genuinely changed — a new fraud tactic) or virtual (only \(P(x)\) moved, \(P(y\mid x)\) is intact); Gama et al. carefully separate the two, because virtual drift may need nothing more than a wider training set. Drift also has a shape in time, and the shape decides how you watch for it. The canonical taxonomy (Gama et al., 2014): Pattern What happens Example Sudden An abrupt jump to a new concept a sensor is replaced; a regulation flips overnight Gradual The new concept slowly overtakes the old, the two coexisting for a while a product preference migrating between cohorts Incremental A slow, continuous slide through intermediate concepts inflation eroding a price model month by month Recurring Old concepts return on a cycle (seasonality) holiday shopping, weekday/weekend traffic Seasonality is the great impostor. A recurring pattern looks like drift to a naïve detector but needs no retraining — only a model that already encodes the cycle, or a baseline that compares like-for-like (this December against last December, not against November). Treating seasonality as drift is the most common false alarm in production monitoring, and the reason §5.4 insists on a sensible reference window. A spam filter's inputs look unchanged, but spammers adopt a brand-new phrasing so the same words now mean something different and accuracy collapses. Which of the three shifts is this — covariate, label, or concept ? (one word) The feature distribution \(P(x)\) is stable, but the mapping \(P(y\mid x)\) — which words imply spam — has moved. That is concept drift, the one invisible in the inputs and the one that genuinely requires relearning the function (EQ V5.1). 5.2 Population Stability Index (PSI) & CSI Before you can react to drift you have to measure it, and the industry's workhorse — born in credit-risk scorecards and now ubiquitous — is the Population Stability Index. Take a feature (or the model's output score), bin it once on a reference period to get expected proportions \(E_i\), then count the same bins on the live period to get actual proportions \(A_i\). PSI is the symmetric relative-entropy-style sum over bins: EQ V5.2 — POPULATION STABILITY INDEX $$ \mathrm{PSI} \;=\; \sum_{i=1}^{B}\big(A_i - E_i\big)\,\ln\!\frac{A_i}{E_i} $$ \(B\) bins; \(E_i\) is the expected (reference) fraction of mass in bin \(i\), \(A_i\) the actual (current) fraction; both sets sum to 1. Each term is \(\ge 0\) — a bin that gained or lost mass contributes a positive amount, and the larger the relative move, the larger the term. PSI is exactly the symmetrized KL divergence (the Jeffreys divergence) between the two binned distributions: \(\mathrm{KL}(A\Vert E) + \mathrm{KL}(E\Vert A)\). It is zero only when every bin matches and grows without bound as mass migrates. The number is a single scalar you can alarm on. PSI earns its keep because, empirically, its magnitude maps onto a stable rule of thumb that has survived decades of scorecard practice: PSI Interpretation Action < 0.10 No significant population change continue monitoring 0.10 – 0.25 Moderate shift — worth investigating investigate, watch closely > 0.25 Significant shift act — retrain or recalibrate Those thresholds (0.1 and 0.25) are heuristic, not theorems — they predate any distributional theory and assume roughly 10 bins of reasonable size. Treat them as alarm levels, not laws: with very large samples even a trivial, harmless shift can clear 0.25, and with tiny samples noise inflates PSI. Always pair the number with a look at which bins moved. Apply EQ V5.2 to the model's output score and people call it PSI; apply the identical formula to a single input feature and the same community calls it the Characteristic Stability Index (CSI). The math is the same; only the target differs — and the pairing is diagnostic. A stable PSI with a drifting CSI says an input moved but the model's score has so far absorbed it; a drifting PSI tells you the score distribution itself has shifted, which is what actually feeds downstream decisions and cutoffs. EQ V5.3 — ONE PSI BUCKET'S CONTRIBUTION $$ \mathrm{psi}_i \;=\; (A_i - E_i)\,\ln\!\frac{A_i}{E_i}, \qquad \mathrm{PSI} = \sum_i \mathrm{psi}_i $$ The per-bucket term is the unit you actually reason about. A bucket whose expected mass was \(E_i = 0.20\) and whose actual mass rose to \(A_i = 0.30\) contributes \((0.30-0.20)\ln(0.30/0.20) = 0.10 \times \ln 1.5 = 0.10 \times 0.405 = \mathbf{0.0405}\). Sum these across bins and one or two large terms usually dominate — read the bucket breakdown, not just the total, because it points straight at the feature region that moved. Using EQ V5.3, a PSI bucket has expected proportion \( E_i = 0.20 \) and actual proportion \( A_i = 0.30 \). What is this single bucket's contribution to PSI, \( (A_i - E_i)\ln(A_i/E_i) \)? (Use \( \ln 1.5 = 0.405 \).) \( A_i - E_i = 0.30 - 0.20 = 0.10 \); the ratio \( A_i/E_i = 0.30/0.20 = 1.5 \), so \( \ln 1.5 = 0.405 \). The contribution is \( 0.10 \times 0.405 = \) 0.0405. Four or five buckets of that size already push the total past the 0.25 alarm. A PSI above 0.25 usually signals a significant population shift that warrants action (retrain or recalibrate). True or false? (Answer true or false.) By the standard scorecard rule of thumb, PSI < 0.1 is stable, 0.1–0.25 is a moderate shift worth investigating, and PSI > 0.25 is a significant shift that calls for action. So the statement is true — with the honest caveat that the 0.25 line is a heuristic, not a proof, and must be read alongside sample size and the per-bucket breakdown. PYTHON · RUNNABLE IN-BROWSER # PSI between an expected (reference) and actual (live) binned distribution. import numpy as np # Fixed reference scores; fit 10 equal-width bins ONCE on the reference. rng = np.random.default_rng(0) ref = rng.normal(0.0, 1.0, 5000) # training-time score distribution live = rng.normal(0.5, 1.1, 5000) # production: shifted right + wider edges = np.linspace(-4, 4, 11) # 10 bins, frozen on the reference E = np.histogram(ref, edges)[0] / len(ref) A = np.histogram(live, edges)[0] / len(live) eps = 1e-6 # guard empty bins (ln 0 is undefined) E = np.clip(E, eps, None); A = np.clip(A, eps, None) terms = (A - E) * np.log(A / E) # EQ V5.3, one per bin psi = terms.sum() # EQ V5.2 print("bin E A contribution") for i in range(len(E)): print(f"{i:2d} {E[i]:.3f} {A[i]:.3f} {terms[i]:+.4f}") band = "STABLE" if psi {band}") print(f"biggest single bucket: bin {int(np.argmax(terms))} " f"({terms.max():.4f}) -- read the breakdown, not just the total.") RUN ▶ edits are live — break it on purpose INSTRUMENT V5.1 — PSI CALCULATOR SHIFT A DISTRIBUTION · CROSS 0.10 / 0.25 · EQ V5.2 MEAN SHIFT (σ) +0.50 SPREAD CHANGE ×σ 1.10 BINS B 10 PSI — VERDICT — TOP BUCKET TERM — The grey outline is the reference (expected) distribution; the mint bars are the live (actual) one. Bins are frozen on the reference. Push MEAN SHIFT from 0 and watch PSI climb through the dashed 0.10 line into the 0.25 danger zone; widening the spread alone moves both tails and also raises PSI even with zero mean shift. Add bins to see the total wobble — PSI is bin-count sensitive, which is why a fixed binning matters. 5.3 Detecting drift in production PSI is a batch statistic: you compute it over a window. Production also wants streaming detectors that raise a flag the moment a process changes, and they split cleanly by what they watch. Watch the inputs (label-free) Labels usually arrive late — a loan defaults months after approval, a churn label resolves a quarter later — so the first line of defence watches the feature distribution, which is available instantly. The tools are statistical two-sample tests between a reference window and a recent window: Kolmogorov–Smirnov for a continuous feature: the maximum gap between the two empirical CDFs. Chi-squared for a categorical feature: observed-vs-expected counts per category. PSI / CSI (§5.2) as a thresholded scalar, the operations-friendly summary. Maximum Mean Discrepancy (MMD) for the joint multivariate input, when per-feature tests miss a shift in the correlations. The hard truth of label-free detection: it can only ever see covariate shift. A pure concept drift — \(P(y\mid x)\) moves while \(P(x)\) stays put — leaves every input test silent while accuracy quietly rots. Input monitoring is necessary and cheap, but it is not sufficient. Watch the errors (label-dependent) The only thing that directly sees concept drift is the model's own error stream. The classic online detector is DDM (Drift Detection Method): treat the per-example error as a Bernoulli sequence whose error rate \(p_t\) should fall or hold as a stable model sees more data. Track the running rate and its standard deviation \(s_t = \sqrt{p_t(1-p_t)/t}\), remember the minimum point \((p_{\min}, s_{\min})\) reached, and alarm when the current point drifts a few standard deviations above that best: EQ V5.4 — DDM WARNING & DRIFT LEVELS $$ \text{warning: } p_t + s_t \ge p_{\min} + 2\,s_{\min}, \qquad \text{drift: } p_t + s_t \ge p_{\min} + 3\,s_{\min} $$ As long as the model is stable, \(p_t\) drifts down and \(p_{\min}+2s_{\min}\) tracks the best-so-far error. When the error climbs two standard deviations above that floor, DDM enters a warning zone (start buffering recent data); at three it declares drift (the buffered window becomes the retraining set). The \(2\sigma/3\sigma\) bands are the Gaussian-tail logic of a control chart applied to a learning curve. Variants — EDDM (watches the distance between errors, better for gradual drift), ADWIN (an adaptive window with a formal false-positive bound), Page-Hinkley (a CUSUM on the error) — trade sensitivity against false alarms. The honest framing is a detection-theory trade-off, not a free lunch: a sensitive detector catches drift early but cries wolf on noise and seasonality; a conservative one is quiet but lets the model rot longer before it fires. There is no setting that is both early and silent — you tune the operating point to the cost of a missed drift versus the cost of a needless retrain. PYTHON · RUNNABLE IN-BROWSER # Concept-drift detection with a rolling error monitor (DDM-style, EQ V5.4). import numpy as np rng = np.random.default_rng(1) # A stream of 0/1 errors: stable ~8% for 600 steps, then concept drift -> ~32%. n1, n2 = 600, 400 errors = np.concatenate([rng.random(n1) reset the floor p_min, s_min = p, s if drift_at is None and p + s >= p_min + 3 * s_min and t > 30: drift_at = t elif warn_at is None and p + s >= p_min + 2 * s_min and t > 30: warn_at = t print(f"true change point: {n1}") print(f"DDM warning raised at step: {warn_at}") print(f"DDM drift declared at step: {drift_at}") print(f"detection delay: {drift_at - n1} steps after the real shift") plot_xy(np.arange(len(errors)), np.cumsum(errors) / np.arange(1, len(errors) + 1)) RUN ▶ edits are live — break it on purpose INSTRUMENT V5.2 — STREAMING DRIFT DETECTOR ROLLING z-TEST ON A FEATURE · WARNING → DRIFT DRIFT MAGNITUDE (σ) 2.0 WINDOW W 40 SENSITIVITY (z) 3.0 DETECTED AT STEP — DETECTION DELAY — FALSE ALARMS (PRE-DRIFT) — A feature streams across 240 steps. It is stationary until the dashed change line at step 120, then its mean jumps by DRIFT MAGNITUDE. The detector keeps a reference window and a recent window of width \(W\) and fires when their means differ by more than \(z\) standard errors; the mint marker is the first detection. Crank sensitivity down (low \(z\)) to catch tiny drifts at the cost of false alarms before the change; raise it for silence-but-late. There is no setting that is both early and quiet — that is the detection trade-off made visible. 5.4 Monitoring & retraining triggers Detection is only half the loop. A monitoring system has to turn a signal into a decision: do nothing, alert a human, or retrain. Three trigger philosophies, roughly in order of maturity: Scheduled retraining. Refit on a fixed cadence — nightly, weekly, monthly. Dead simple and predictable, but it is both wasteful (you retrain when nothing changed) and dangerous (you wait until the next cycle while the model rots). It is a default, not an answer. Performance-triggered. Retrain when a live metric — accuracy, AUC, calibration, a business KPI — crosses a threshold. The gold standard, because it reacts to what you actually care about, but it needs ground-truth labels, and those often arrive with a long, costly delay. Drift-triggered. Retrain when an input statistic (PSI/CSI, KS, a streaming detector) crosses a threshold. Available immediately and label-free — the proxy you reach for while labels are in flight — but it can fire on harmless covariate shift and stay silent on pure concept drift. In practice you run drift triggers as an early warning and performance triggers as the authoritative one. Every trigger needs a reference window to compare against, and the choice is consequential. A fixed reference (the training set) detects drift relative to the world the model actually learned — the correct baseline for "is my model still valid?" A sliding reference (last month) detects change but normalizes away slow incremental drift, so the model can boil like the proverbial frog while every week looks like the last. Most mature stacks keep the training distribution as the anchor and add seasonality-aware comparisons on top. The cost side has its own arithmetic. Suppose drift erodes value at a roughly linear rate after each retrain, so the average performance gap you carry scales with the time between retrains. Retrain too often and you pay compute and review for nothing; too rarely and you eat accumulating decay. The optimum balances the two — a classic inventory-style trade-off: EQ V5.5 — RETRAIN-CADENCE COST $$ \text{Cost}(T) \;=\; \underbrace{\frac{c_{\text{retrain}}}{T}}_{\text{amortized retrain}} \;+\; \underbrace{\tfrac{1}{2}\,d\,T}_{\text{average decay carried}} \qquad\Longrightarrow\qquad T^\star = \sqrt{\frac{2\,c_{\text{retrain}}}{d}} $$ \(T\) is the interval between retrains, \(c_{\text{retrain}}\) the cost (compute + validation + risk) of one retrain, and \(d\) the per-unit-time rate at which value decays after a fresh fit. The first term falls with \(T\) (retrain less, amortize more); the second rises with \(T\) (carry more accumulated decay on average). Setting the derivative to zero gives the square-root cadence \(T^\star=\sqrt{2c_{\text{retrain}}/d}\) — the same shape as the economic-order-quantity rule. Faster-drifting models (large \(d\)) should retrain more often; expensive retrains (large \(c\)) push the cadence out. It is a back-of-envelope model, not gospel — real decay is rarely linear and seasonality breaks the smoothness — but it gives the right instinct for the dial. Using EQ V5.5, one retrain costs \( c_{\text{retrain}} = 200 \) units and value decays at \( d = 1 \) unit per day. What is the cost-optimal interval between retrains, \( T^\star = \sqrt{2c_{\text{retrain}}/d} \), in days? \( 2 c_{\text{retrain}}/d = 2 \times 200 / 1 = 400 \), and \( \sqrt{400} = \) 20 days. Halve the retrain cost and the cadence tightens to \(\sqrt{200}\approx14\) days; double the drift rate and it tightens to \(\sqrt{200}\approx14\) days too — the square root makes the dial gentle. PITFALLS Four ways drift monitoring goes wrong: (1) alarm fatigue — a detector tuned so hot it fires on every Monday; teams learn to ignore it and miss the real one. (2) seasonality mistaken for drift — comparing December to November instead of to last December. (3) retraining on contaminated data — the freshly buffered window includes the very anomaly that triggered the alarm, so you retrain the model to expect the disaster. (4) silent label delay — your performance trigger cannot fire because the labels for the drifted period have not arrived yet, and your input triggers cannot see concept drift; the gap between them is where models die quietly. 5.5 A model decaying in the wild Put the pieces together and a deployed model's life has a characteristic arc: a fresh fit performs near its validation score, holds for a while, then bends downward as the world drifts away from the snapshot it learned. The slope of that bend is the decay rate \(d\); a retrain snaps performance back toward the top and the clock restarts. The whole job of this chapter is to see the bend early enough — through PSI on the inputs and error monitors on the outputs — to retrain on the way down rather than at the bottom. EQ V5.6 — PERFORMANCE DECAY & SAWTOOTH RECOVERY $$ \mathrm{Acc}(t) \;=\; \mathrm{Acc}_0 \;-\; d\,(t - t_{\text{last}}) \;+\; \varepsilon_t, \qquad \text{retrain at } t \;\Rightarrow\; t_{\text{last}}\leftarrow t,\;\; \mathrm{Acc}\leftarrow \mathrm{Acc}_0 $$ Between retrains, accuracy falls roughly linearly from its post-fit ceiling \(\mathrm{Acc}_0\) at rate \(d\), buried in measurement noise \(\varepsilon_t\); a retrain resets the elapsed-time clock \(t-t_{\text{last}}\) and lifts performance back toward the ceiling. Trace this over many cycles and you get the familiar sawtooth: decay, snap, decay, snap. The area between the ceiling and the sawtooth is the value lost to drift — and retraining more often trades compute to shrink it, exactly the EQ V5.5 balance. Real curves are noisier, sometimes step rather than slope, and a retrain on bad data can fail to recover at all. PYTHON · RUNNABLE IN-BROWSER # Sawtooth decay (EQ V5.6): no-retrain vs periodic retrain -> value recovered. import numpy as np rng = np.random.default_rng(2) T = 180 # days in service acc0, d, noise = 0.90, 0.0015, 0.004 # ceiling, decay/day, measurement noise # Scenario A: never retrain -> monotone decay from the ceiling. never = acc0 - d * np.arange(T) + rng.normal(0, noise, T) # Scenario B: retrain every 30 days -> reset the clock each cycle. period, retr = 30, acc0 - d * (np.arange(T) % 30) + rng.normal(0, noise, T) print(f"day 0: never {never[0]:.3f} retrained {retr[0]:.3f}") print(f"day 90: never {never[90]:.3f} retrained {retr[90]:.3f}") print(f"day 179: never {never[179]:.3f} retrained {retr[179]:.3f}") print(f"\nmean accuracy, never-retrain: {never.mean():.3f}") print(f"mean accuracy, retrain @30d: {retr.mean():.3f}") print(f"value recovered by retraining: {retr.mean() - never.mean():+.3f} acc") plot_xy(np.arange(T), retr) # the sawtooth: decay, snap, decay, snap RUN ▶ edits are live — break it on purpose INSTRUMENT V5.3 — PERFORMANCE-DECAY SIMULATOR SAWTOOTH RECOVERY · EQ V5.5 / V5.6 DECAY RATE d (acc/period) 0.0015 RETRAIN EVERY 30 RETRAIN COST c 200 MEAN ACCURACY HELD — TOTAL COST (RETRAIN + DECAY) — COST-OPTIMAL T★ — The grey ceiling is the post-fit accuracy \(\mathrm{Acc}_0\); the mint sawtooth is live accuracy decaying at rate \(d\) and snapping back at every retrain. The shaded gap between them is value lost to drift. Slide RETRAIN EVERY down to chase the ceiling — but watch TOTAL COST, which adds the price of all those retrains via EQ V5.5. The readout marks \(T^\star=\sqrt{2c/d}\): set the interval near it and the total cost sits in its valley. Raise \(d\) (faster-drifting world) and the optimal cadence tightens; raise \(c\) and it loosens. NEXT Drift monitoring tells you that the model changed; it never tells you why. When PSI spikes and accuracy bends, the next question is always "which feature, which interaction, which case?" — and answering it is the job of the explainability toolkit. Chapter 06: SHAP and its game-theoretic guarantees, LIME's local surrogates, partial dependence and ICE, and the honest limits of post-hoc explanation. 5.R References Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia, A. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys 46(4) — the canonical taxonomy of drift types and adaptation strategies (§5.1). Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L. & Petitjean, F. (2016). Characterizing Concept Drift. Data Mining and Knowledge Discovery 30 — a quantitative framework for describing how concepts drift over time. Gama, J., Medas, P., Castillo, G. & Rodrigues, P. (2004). Learning with Drift Detection (DDM). SBIA 2004, LNCS 3171 — the error-rate drift detector behind EQ V5.4. Bifet, A. & Gavaldà, R. (2007). Learning from Time-Changing Data with Adaptive Windowing (ADWIN). SIAM SDM 2007 — an adaptive-window detector with a formal false-positive bound. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. (2012). A Kernel Two-Sample Test (MMD). JMLR 13 — the maximum-mean-discrepancy test for multivariate covariate-shift detection (§5.3). Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. (eds.) (2009). Dataset Shift in Machine Learning. MIT Press — the reference volume formalizing covariate, prior, and concept shift (EQ V5.1). ← PREVIOUS 04 Ranking & Calibration NEXT CHAPTER 06 Explainability AI // ENCYCLOPEDIA — MODEL VALIDATION & RISK · CH 05 FULL CONTENTS ↗ ## MLOPS · Explainability (https://ai-encyclopedia.com/mlops/06-explainability.html) Explainability — SHAP, LIME & Partial Dependence — AI Encyclopedia AI // ENCYCLOPEDIA / MODEL RISK / 06 / EXPLAINABILITY INDEX NEXT: 07 MLOPS & GOVERNANCE → MODEL VALIDATION & RISK · CHAPTER 06 / 07 Explainability — SHAP, LIME & Partial Dependence A model that predicts well is not the same as a model you can account for. When a loan is denied, a tumour flagged, or a transaction blocked, "the gradient-boosted ensemble said so" will not satisfy a customer, an engineer, or a regulator. Shapley values attribute each prediction to its input features, and the attributions sum exactly to the score the model produced. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON ML 13 · STATS 04 INSTRUMENTS FORCE PLOT · PDP/ICE · LIME IN THIS CHAPTER 6.1 Why explainability 6.2 Global vs local 6.3 Permutation & PDP/ICE 6.4 LIME 6.5 SHAP 6.R References 6.1 Why explainability — trust, debugging, regulation A high cross-validated score (Chapter 01) tells you a model is accurate on data that looks like your test set. It tells you nothing about why a particular prediction came out the way it did, whether the model leans on a feature it should never have seen, or whether it will hold up when the world shifts under it. Explainability — also called interpretability — is the discipline of answering "why this output?" in terms a human can check. It serves three distinct masters. Driver The question it asks What an explanation buys Trust Should a clinician, underwriter, or operator act on this? A reason the human can sanity-check against domain knowledge before deferring to the model. Debugging Why is this prediction wrong / surprising? Exposes leakage, spurious correlations, and shortcut features — the snow-in-the-background-means-husky failures. Regulation Can you justify an adverse decision to the subject and an auditor? A per-decision record that satisfies a legal right to an explanation. The regulatory pressure is no longer hypothetical. In the United States, the Equal Credit Opportunity Act and its Regulation B have for decades required lenders to give applicants the specific principal reasons for an adverse credit action; the CFPB confirmed in 2023 that this duty applies to opaque machine-learning models too — "the algorithm did it" is not a lawful reason. In the EU, the GDPR grants meaningful information about the logic of automated decisions, and the AI Act (in force from 2024, with high-risk obligations phasing in through 2026–2027) mandates transparency and human oversight for high-risk systems such as credit scoring and medical devices. Explanations are now a compliance artifact, not a research nicety. A LOAD-BEARING CAVEAT An explanation is a model of a model, and models can lie. Every method in this chapter is a post-hoc approximation of an opaque function — it tells you what the model appears to do near a point, not the ground truth of the world. Post-hoc explanations can be unstable (small input changes flip them), unfaithful (they describe a surrogate, not the model), and even adversarially manipulable. The honest position, argued forcefully by Rudin (2019), is that for genuinely high-stakes decisions an inherently interpretable model (a sparse linear model, a short rule list, a small tree) is often preferable to a black box with an explanation bolted on. Use post-hoc tools, but never confuse them with understanding. 6.2 Global vs local explanations Explanations split along one axis above all others: scope. A global explanation describes the model's behaviour over the whole input distribution — "income is the most important feature on average." A local explanation describes one prediction — " this applicant was denied chiefly because of three recent late payments." The two answer different questions and must not be substituted for one another. Scope Answers Methods in this chapter Typical consumer Global What does the model do overall? permutation importance, PDP model owner, validator Local Why this single prediction? ICE, LIME, SHAP end user, regulator, debugger A feature can be globally unimportant yet decisive for one row, and globally important yet irrelevant for another. Averaging local explanations recovers a global one — this is exactly how SHAP unifies the two scopes (§6.5) — but you cannot run the inference backwards: a single global importance bar does not tell any individual applicant why they were refused. The right-to-explanation laws of §6.1 are fundamentally demands for local explanations. A second, orthogonal axis is model access. Model-agnostic methods (LIME, permutation importance, PDP, KernelSHAP) treat the model as a black box and only call its predict function, so one implementation works for any model. Model-specific methods exploit internal structure for speed or fidelity — TreeSHAP reads the splits of a tree ensemble to compute exact Shapley values in polynomial time; integrated gradients use a neural network's backward pass. Agnostic methods are universal but slow; specific methods are fast but tied to an architecture. A useful sanity rule: choose the explanation scope to match the decision being made. A board reviewing whether to deploy a fraud model wants a global picture; a customer disputing a blocked card wants a local one. Reporting the wrong scope is a more common error than computing either one incorrectly. 6.3 Permutation importance & PDP/ICE The cheapest global tool needs nothing but the trained model and a held-out set. Permutation importance asks a blunt question: if I destroy a feature's information by shuffling its column, how much worse does the model get? A feature the model relies on will see its score collapse when scrambled; a feature it ignores will not move the needle. EQ V6.1 — PERMUTATION IMPORTANCE $$ \mathrm{Imp}_j \;=\; s\big(\hat{f},\, X,\, y\big) \;-\; \frac{1}{K}\sum_{k=1}^{K} s\big(\hat{f},\, X^{(\pi_k, j)},\, y\big) $$ \(s\) is any score where higher is better (\(R^2\), accuracy, AUC); \(X^{(\pi_k, j)}\) is the data with column \(j\) randomly permuted under permutation \(\pi_k\), leaving every other feature and the labels untouched. Importance is the drop in score caused by breaking the link between feature \(j\) and the target, averaged over \(K\) shuffles to tame the randomness. Because it only calls \(\hat{f}\), it is fully model-agnostic and uses the same predict-and-score loop for any estimator. Two warnings come with it. First, importance is measured on data the model was scored against, so prefer a held-out set: permutation importance on the training set rewards overfitting. Second — the one experts always raise — correlated features split and hide each other's importance. If two columns carry nearly the same information, shuffling one leaves the model propped up by the other, so both look unimportant even though the pair is decisive. With strong collinearity, permutation importance under-reports; cluster correlated features and permute the cluster, or reach for Shapley values, which share credit more fairly. Permutation importance measures the drop in model score when a single feature's column is randomly shuffled (breaking its link to the target) while all other features and the labels are left intact. True or false? (Answer true or false.) That is exactly EQ V6.1: \(\mathrm{Imp}_j = s(\hat f, X, y) - \tfrac1K\sum_k s(\hat f, X^{(\pi_k,j)}, y)\). The first term is the score on intact data; the second is the score after column \(j\) is permuted. A feature the model leans on causes a large score drop when scrambled; an ignored feature causes none. The statement is true. PYTHON · RUNNABLE IN-BROWSER # Permutation importance from scratch: rank features by the R^2 drop on shuffle. import numpy as np rng = np.random.default_rng(0) # A model that truly uses x0 strongly, x1 mildly, and ignores x2, x3. N, d = 400, 4 X = rng.normal(0, 1, (N, d)) w_true = np.array([3.0, 1.0, 0.0, 0.0]) y = X @ w_true + rng.normal(0, 0.5, N) beta = np.linalg.lstsq(X, y, rcond=None)[0] # the fitted "black box" def r2(Xp): pred = Xp @ beta return 1 - ((y - pred) ** 2).sum() / ((y - y.mean()) ** 2).sum() base = r2(X) # score on intact data (EQ V6.1, term 1) print(f"baseline R^2 = {base:.4f}\n") names, imp = ["x0", "x1", "x2", "x3"], [] for j in range(d): drops = [] for _ in range(10): # K = 10 shuffles, average them Xs = X.copy() Xs[:, j] = rng.permutation(Xs[:, j]) # break feature j target only drops.append(base - r2(Xs)) # the score drop = importance imp.append(np.mean(drops)) for j in np.argsort(imp)[::-1]: # rank: most important first print(f"{names[j]}: importance {imp[j]:+.4f}") print("\nx0 dominates, x1 is mild, x2/x3 ~ 0 -- the model's true reliance, recovered.") plot_xy(range(d), sorted(imp, reverse=True)) RUN ▶ edits are live — break it on purpose Permutation importance ranks features but says nothing about shape: is the effect of income linear, threshold-like, or U-shaped? The partial dependence plot (PDP), introduced by Friedman with gradient boosting, answers that. Fix feature \(j\) to a value \(v\), set it to \(v\) for every row in the data while leaving the other features as they are, average the predictions, and sweep \(v\) across its range: EQ V6.2 — PARTIAL DEPENDENCE $$ \mathrm{PD}_j(v) \;=\; \mathbb{E}_{X_{-j}}\!\big[\,\hat{f}(v,\, X_{-j})\,\big] \;\approx\; \frac{1}{N}\sum_{i=1}^{N} \hat{f}\big(v,\, x^{(i)}_{-j}\big) $$ \(X_{-j}\) is every feature except \(j\); the expectation marginalizes them out, leaving the average effect of feature \(j\) alone as a curve. The Monte-Carlo estimate just averages the model over the actual dataset with column \(j\) overwritten by \(v\). Its blind spot is the same as permutation importance: by overwriting \(j\) for all rows it can create off-manifold inputs (a pregnant 80-year-old) when \(j\) is correlated with the others, and by averaging it hides heterogeneity — opposite effects on two subgroups cancel to a flat line. Individual conditional expectation (ICE) curves fix that second flaw by not averaging: plot one line per row, each showing how that single prediction would move as \(j\) sweeps. The PDP is exactly the average of all the ICE lines. When the ICE lines are parallel, the PDP tells the whole story; when they fan out or cross, the feature interacts with others and the average is a lie of omission. PDP for the headline, ICE to check it is honest. INSTRUMENT V6.1 — PDP / ICE EXPLORER AVERAGE EFFECT vs PER-ROW LINES · EQ V6.2 INTERACTION STRENGTH 0.0 ICE LINES SHOWN 14 FEATURE SHAPE THRESHOLD LINEAR PDP RANGE (MAX − MIN) — ICE SPREAD AT MID — PDP TRUSTWORTHY? — The bold mint curve is the PDP — the model's average response as the feature sweeps left → right. The faint grey lines are ICE curves, one per row, and the PDP is literally their average. Set INTERACTION STRENGTH to 0 and the ICE lines stay parallel: the average tells the whole story. Crank it up and the lines fan out and cross — now the flat-looking average is hiding subgroups that move in opposite directions, and the readout flips to "MISLEADING". This is precisely why you never trust a PDP without its ICE. 6.4 LIME — local surrogate models Global tools blur the individual case. LIME — Local Interpretable Model-agnostic Explanations, Ribeiro et al. (2016) — takes the opposite stance: forget the global function, just explain one prediction by approximating the black box with a simple, interpretable model in a small neighbourhood around that point. The intuition is that any wiggly decision surface looks roughly linear if you zoom in far enough. The recipe for explaining a single instance \(x\): (1) generate a cloud of perturbed samples around \(x\); (2) ask the black box \(\hat{f}\) for its prediction on each; (3) weight each sample by how close it is to \(x\) with a kernel \(\pi_x\); (4) fit a sparse linear model \(g\) to that weighted, labelled cloud. The coefficients of \(g\) are the explanation: a signed weight per feature, valid only near \(x\). EQ V6.3 — THE LIME OBJECTIVE $$ \xi(x) \;=\; \underset{g \in G}{\arg\min}\; \underbrace{\mathcal{L}\big(\hat{f},\, g,\, \pi_x\big)}_{\text{local fidelity}} \;+\; \underbrace{\Omega(g)}_{\text{simplicity}}, \qquad \pi_x(z) = \exp\!\left(\frac{-D(x,z)^2}{\sigma^2}\right) $$ \(G\) is a family of interpretable models (sparse linear, short trees). \(\mathcal{L}\) penalizes \(g\) for disagreeing with \(\hat{f}\) on samples \(z\), each weighted by proximity \(\pi_x(z)\); \(\Omega\) penalizes complexity (e.g. number of nonzero weights). The result is the simplest surrogate that is faithful to the black box right around \(x\) — explicitly trading global accuracy for local interpretability. The neighbourhood width \(\sigma\) is a free knob, and that is exactly LIME's weak spot: the explanation can swing with the kernel width and with the random sample, so two runs can disagree. LIME's appeal is that it is genuinely model-agnostic and produces a human-readable handful of "because feature X was high and feature Y was low" reasons. Its documented failure modes are equally real: the explanations can be unstable (re-running with a new random seed or a different bandwidth perturbs the weights), the linear surrogate can be a poor fit where the surface is sharply curved, and the choice of neighbourhood is more art than science. SHAP can be seen as the principled answer to "how should I have weighted those samples?" — which is the bridge to §6.5. INSTRUMENT V6.2 — LIME LOCAL SURROGATE BLACK-BOX BOUNDARY → LOCAL LINEAR FIT · EQ V6.3 NEIGHBOURHOOD WIDTH σ 0.30 PERTURBATION SAMPLES 120 RESEED ▶ LOCAL SURROGATE — LOCAL FIT (WEIGHTED R²) — STABILITY OVER RESEEDS — The curved blue line is the black box's true decision boundary; the white dot is the instance we want to explain. Each RESEED draws a fresh cloud of perturbations (sized by σ), weighted by how close they sit to the dot, and fits a straight mint surrogate — LIME's local linear explanation. Shrink σ and the surrogate hugs the curve tightly (high local fidelity); widen it and the line tries to span a curved region and fits badly. Press RESEED a few times at a wide σ and watch the surrogate slope wander: that wobble is LIME's notorious instability, made visible. 6.5 SHAP — Shapley values for features SHAP — SHapley Additive exPlanations, Lundberg & Lee (2017) — is the most-used method in the field because it rests on the one result everything else lacks: a uniqueness theorem. Borrow the Shapley value from cooperative game theory, where it is the provably unique fair way to split a coalition's payout among its players. Cast the prediction as the payout and the features as the players, and you get the only feature-attribution method satisfying a set of common-sense axioms simultaneously. The Shapley value of feature \(j\) is its average marginal contribution across every possible order in which features could be added to the prediction. "Marginal contribution" means: how much does the model's output change when \(j\) joins a coalition \(S\) of features that are already "present" (set to their instance value) versus "absent" (marginalized to the background)? EQ V6.4 — THE SHAPLEY VALUE $$ \phi_j \;=\; \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!\,\big(|F| - |S| - 1\big)!}{|F|!}\;\Big[\, v\big(S \cup \{j\}\big) - v(S) \,\Big] $$ \(F\) is the full feature set, \(S\) any coalition not containing \(j\), and \(v(S)\) the model's expected output when only the features in \(S\) are known. The bracket is \(j\)'s marginal contribution when it joins \(S\); the combinatorial weight is the fraction of orderings in which exactly that coalition precedes \(j\), so \(\phi_j\) is the average marginal contribution over all orderings. It is the unique attribution satisfying efficiency, symmetry, dummy (a feature that never changes \(v\) gets 0), and additivity. The axiom that matters most for an audit is efficiency (also called local accuracy): the attributions and the base value must add up to exactly the prediction. Nothing is invented, nothing is lost — every unit of "why this number and not the average" is assigned to some feature. EQ V6.5 — EFFICIENCY: THE EXPLANATION ADDS UP $$ \hat{f}(x) \;=\; \underbrace{\phi_0}_{\text{base value } \mathbb{E}[\hat f]} \;+\; \sum_{j=1}^{|F|} \phi_j \qquad\Longleftrightarrow\qquad \sum_{j=1}^{|F|} \phi_j \;=\; \hat{f}(x) - \mathbb{E}[\hat{f}(X)] $$ \(\phi_0\) is the base value — the average prediction over the background, what you would guess knowing nothing about this row. The SHAP values are the signed pushes from that baseline to the actual prediction, and they must sum to the gap \(\hat f(x) - \mathbb{E}[\hat f]\) exactly. This is what turns a SHAP explanation into a literal audit trail: a regulator can check that the reasons sum to the decision. The force plot in Instrument V6.3 is this equation drawn as arrows. Computing EQ V6.4 exactly costs \(2^{|F|}\) coalition evaluations — fine for a handful of features, hopeless for hundreds. SHAP's practical contribution is fast estimators: KernelSHAP recovers the Shapley values as the solution of a specially weighted linear regression (the principled cousin of LIME), and TreeSHAP computes them exactly for tree ensembles in time polynomial in the tree size — which is why SHAP and gradient boosting (Chapter on boosting) are the default explainability pairing in production. A persistent subtlety experts flag: how you define "feature absent" — marginalizing with the marginal distribution (interventional) versus the conditional (observational) — changes the values when features are correlated, and the two are answering subtly different causal questions. A model's base (mean) value is \( \mathbb{E}[\hat f] = 0.30 \) and its prediction for one row is \( \hat f(x) = 0.82 \). By the efficiency axiom (EQ V6.5), what must the sum of that row's SHAP values equal, \( \hat f(x) - \mathbb{E}[\hat f] \)? Efficiency forces the attributions plus the base value to reconstruct the prediction, so the SHAP values sum to \( \hat f(x) - \mathbb{E}[\hat f] = 0.82 - 0.30 = \) 0.52. Whatever the individual feature pushes are, positive and negative, they must total exactly +0.52 — that is the property that makes the explanation an audit trail. PYTHON · RUNNABLE IN-BROWSER # Exact Shapley values for a tiny 3-feature model -- and the efficiency check. import numpy as np from itertools import permutations # Model: linear part + one pairwise interaction between x0 and x1. def f(x): return 3*x[0] + 2*x[1] - 1*x[2] + 4*x[0]*x[1] x = np.array([1.0, 1.0, 1.0]) # the instance we explain baseline = np.array([0.0, 0.0, 0.0]) # "feature absent" = baseline value def v(S): # coalition value: S use x, rest use baseline z = baseline.copy() for i in S: z[i] = x[i] return f(z) base_value = v([]) # phi_0 = f(baseline) pred = v([0, 1, 2]) # f(instance) # Shapley = average marginal contribution over ALL feature orderings (EQ V6.4). phi = np.zeros(3) orders = list(permutations(range(3))) for order in orders: seen = [] for i in order: before = v(seen); seen = seen + [i] phi[i] += v(seen) - before # marginal contribution of i in this order phi /= len(orders) print(f"base value phi_0: {base_value:.1f}") print(f"shapley values phi: {phi}") # -> [5. 4. -1.] print(f"sum of shapley values: {phi.sum():.1f}") print(f"prediction - base: {pred - base_value:.1f}") print(f"efficiency holds?: {np.isclose(phi.sum(), pred - base_value)}") print("the 4*x0*x1 interaction is split evenly: +2 to x0, +2 to x1 (symmetry).") RUN ▶ edits are live — break it on purpose INSTRUMENT V6.3 — SHAP FORCE PLOT FEATURE PUSHES FROM BASE → PREDICTION · EQ V6.5 RECENT LATE PAYMENTS 0 INCOME (k/yr) 60 CREDIT UTILISATION % 30 ACCOUNT AGE (yrs) 6 LOAN / INCOME RATIO 3.0 BASE VALUE E[f] — PREDICTED APPROVAL — Σ φ = PRED − BASE? — A toy loan-approval score (additive log-odds, so contributions are exact Shapley values). The plot is EQ V6.5 drawn as forces: every prediction starts at the base value — the average approval probability — and each feature pushes it right (mint, toward approval) or left (red, toward denial) by its SHAP value. The arrows always land exactly on the prediction, and the bottom readout confirms Σφ = pred − base to the decimal. Drag RECENT LATE PAYMENTS up and watch a single red arrow grow until it alone flips the decision — that red bar is the principal adverse-action reason §6.1's regulations demand. NEXT Explanations make a model legible; governance makes it accountable. Knowing why a prediction happened is one pillar of model risk — but a deployed model also needs versioning, reproducible pipelines, monitoring against the drift of Chapter 05, audit logs, and a human chain of responsibility. Chapter 07 assembles those pieces into MLOps and governance: how to ship, watch, and answer for a model in production once the math is done. 6.R References Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017 — the SHAP framework and its uniqueness theorem (§6.5, EQ V6.4–V6.5). Ribeiro, M. T., Singh, S. & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier. KDD 2016 — LIME, local surrogate explanations (§6.4, EQ V6.3). Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5) — partial dependence plots (§6.3, EQ V6.2). Lundberg, S. M., Erion, G. G. & Lee, S.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles. arXiv — TreeSHAP, exact polynomial-time Shapley values for trees (§6.5). Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. (2015). Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. J. Computational and Graphical Statistics 24(1) — ICE curves (§6.3). Shapley, L. S. (1953). A Value for n-Person Games. Contributions to the Theory of Games II — the original Shapley value from cooperative game theory. Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High-Stakes Decisions and Use Interpretable Models Instead. Nature Machine Intelligence 1 — the case for inherently interpretable models (§6.1 caveat). Molnar, C. (2022). Interpretable Machine Learning (2nd ed.). Open textbook — the standard practical reference covering every method in this chapter. ← PREVIOUS 05 Stability & Drift NEXT CHAPTER 07 MLOps & Governance AI // ENCYCLOPEDIA — MODEL VALIDATION & RISK · CH 06 FULL CONTENTS ↗ ## MLOPS · MLOps & Model Governance (https://ai-encyclopedia.com/mlops/07-mlops-governance.html) MLOps & Model Governance — AI Encyclopedia AI // ENCYCLOPEDIA / MODEL RISK / 07 / MLOPS & GOVERNANCE INDEX NEXT: LLM FIELD MANUAL · 01 → MODEL VALIDATION & RISK · CHAPTER 07 / 07 MLOps & Model Governance Training a model is the easy part. Keeping it trustworthy after the notebook closes requires a reproducible pipeline, a registry that records which artifact is live, monitoring that catches drift, and an audit trail an examiner will accept. MLOps is the set of practices that turns a one-off model into a maintained production asset with monitoring, lineage, and sign-off. LEVEL ADVANCED READING TIME ≈ 28 MIN BUILDS ON MLOPS 01–06 · ML 06 INSTRUMENTS MATURITY · PIPELINE DAG · RETRAIN TRIGGER IN THIS CHAPTER 7.1 Notebook → production 7.2 Tracking & registries 7.3 CI/CD & retraining 7.4 Monitoring & lineage 7.5 Model risk & governance 7.R References 7.1 From notebook to production pipeline Almost every real ML failure happens outside the model. The famous diagram from Sculley et al. makes the point: the box labelled "ML code" is a small square surrounded by configuration, data collection, feature extraction, serving infrastructure, monitoring, and process management — the model is a few percent of the system. A notebook captures only that small square, and it captures it badly: hidden cell-execution order, an un-pinned environment, a CSV that was edited by hand, a random seed nobody set. None of that survives a redeploy. The discipline that fixes this is to treat the path from raw data to served prediction as a single, versioned, re-runnable pipeline — a directed acyclic graph (DAG) of typed stages. Every edge is an artifact (a dataset, a feature table, a model file, an eval report); every node is a deterministic transform pinned to a code commit and a config. The asset you ship is not the weights file — it is the recipe that regenerates the weights file. EQ V7.1 — REPRODUCIBILITY AS A FUNCTION OF INPUTS $$ \text{artifact} \;=\; f\big(\,\text{data}_{\,v},\ \text{code}_{\,c},\ \text{config}_{\,h},\ \text{env}_{\,e},\ \text{seed}_{\,s}\,\big) $$ A run is reproducible iff fixing all five inputs fixes the output. \(v\) is a content hash of the data snapshot, \(c\) a git commit, \(h\) the hyperparameter config, \(e\) the pinned environment (container digest + library versions), \(s\) the RNG seed. Drop any one and you have a story, not a result. The single most common reproducibility failure is an un-versioned \(\text{data}_v\): the same code on "today's table" silently trains a different model tomorrow. Pipelines exist to make all five explicit and to cache stages whose inputs have not changed. The payoff is concrete. If stage inputs are content-addressed, a pipeline can skip any stage whose inputs are unchanged and rerun only what is downstream of an edit — the same idea as a build system, applied to data and models. Change one feature definition and the framework knows exactly which models must be retrained and which evals must be rerun; change nothing and the whole pipeline is a cache hit. NOTEBOOK 1 machine Hidden state, manual order, un-pinned env. Reproducible by luck. SCRIPT + CONFIG N runs Deterministic given inputs, but no lineage and no caching. VERSIONED PIPELINE DAG Typed stages, content-addressed artifacts, partial reruns, full lineage. There is an honest tension here. Notebooks are unmatched for exploration — the friction of a full pipeline would kill the iteration speed that finds the model in the first place. The mature workflow is therefore not "no notebooks" but a clear promotion boundary: explore freely in a notebook, then graduate the winning recipe into pipeline stages before anything touches production. The maturity instrument below is exactly a tour of that boundary. INSTRUMENT V7.1 — MLOPS MATURITY SELF-ASSESSMENT DECISION-TREE WALKTHROUGH · LEVELS 0–4 QUESTION 1 / 4 — ↺ RESTART MATURITY LEVEL — STAGE — NEXT MOVE — Answer four yes/no questions about your own team. The path walks the standard MLOps maturity ladder — Level 0 (manual notebook) → 1 (automated pipeline) → 2 (CI/CD for the pipeline) → 3 (automated retraining) → 4 (full governance with continuous monitoring and sign-off). The "next move" is the single highest-leverage thing to build next. The lesson: maturity is a ladder, and you do not get to skip rungs — automated retraining (Level 3) is dangerous without the monitoring and registry of the levels below it. 7.2 Experiment tracking & model registries Two systems sit at the heart of any serious ML platform, and they answer two different questions. An experiment tracker answers "what did we try, and what happened?" Every run logs its parameters, its metrics, the data snapshot hash, the git commit, and the produced artifacts. Months later you can ask "which run produced this checkpoint, on what data, with what learning rate, and what was its held-out AUC?" and get an exact answer instead of an archaeology project. The tracker is the lab notebook the literal notebook never was — searchable, comparable, immutable. A model registry answers a sharper, scarier question: "which artifact is live right now, who approved it, and what do I roll back to?" The registry is not storage — it is a state machine over model versions, with explicit stages and gated transitions: EQ V7.2 — THE REGISTRY STATE MACHINE $$ \texttt{None} \;\xrightarrow{\text{register}}\; \texttt{Staging} \;\xrightarrow{\;\text{eval + sign-off}\;}\; \texttt{Production} \;\xrightarrow{\;\text{superseded}\;}\; \texttt{Archived} $$ Each arrow is a guarded transition: a model may only enter Production when it passes the gate (offline evals clear thresholds, a human with the right role approves, the deployment config is pinned). The registry records who pulled the lever and when. The one invariant that matters: at most one version is Production per deployment slot, and you can name it in one query. A team that cannot answer "what is live?" in seconds does not have a registry — it has a folder. The registry is what makes a rollback a one-line operation instead of a 2 a.m. incident. Because every version's full lineage (EQ V7.1) is attached, reverting to the previous Production model is just re-pointing the serving slot at an immutable, already-validated artifact — no rebuild, no retrain, no guessing. The same machinery powers champion/challenger rollouts (§7.3) and multi-tenant serving where many model versions coexist behind one gateway. System Answers Keyed on Failure if absent Experiment tracker What did we try & what happened? run id Can't reproduce or compare past results Model registry What is live, who approved, roll back to what? model version No fast rollback; "what's in prod?" is unanswerable Artifact / data store Where are the bytes, by content hash? content digest Lineage breaks; artifacts mutate under you A pragmatic caveat: in 2026 the tracker and registry are often the same platform (MLflow, Weights & Biases, Vertex, SageMaker, and others bundle both), and for LLM/agent systems a "model version" increasingly means a tuple of base-model id, adapter or system-prompt version, and tool schema. The abstractions are unchanged; only the artifact got more interesting. By the registry invariant in EQ V7.2, how many model versions may be in the Production stage for a single deployment slot at one time? The registry is a state machine whose key invariant is that each deployment slot has at most one live version — that is precisely what lets you answer "what is in prod?" in one query and roll back deterministically. So the answer is 1. (Several versions may sit in Staging or Archived; only one is Production per slot.) 7.3 CI/CD & automated retraining Software CI/CD tests code. ML CI/CD must also test data and models — three things change independently, and any one can break production. The mature pipeline therefore runs three layers of gates, often summarized as the ML Test Score (Breck et al.): tests for the data (schema, distributions, expected-value constraints), tests for the model (does training converge, does it beat a baseline, is it robust to perturbations), and tests for the infrastructure (can it be served, rolled back, reproduced). A model never goes live just because it trained. It goes live only if it clears an offline gate against the current Production model on a frozen holdout, and — for high-stakes systems — survives an online gate (a canary or A/B test on real traffic). The offline decision is the champion/challenger rule: the newly trained challenger replaces the live champion only if it is decisively better. EQ V7.3 — CHAMPION / CHALLENGER PROMOTION RULE $$ \text{promote} \;\iff\; \big(M_{\text{chal}} - M_{\text{champ}} \;>\; \delta\big)\ \ \wedge\ \ \big(G_{\text{chal}} \;\ge\; G_{\min}\big) $$ \(M\) is the primary holdout metric (AUC, F1, revenue-per-session…), measured for both models on the same frozen evaluation set. \(\delta > 0\) is a margin that must exceed the metric's noise (recall the holdout standard error of MLOPS · EQ V1.2) so you are not promoting on a coin flip. \(G\) are guardrail metrics — latency, fairness gaps, calibration, a forbidden-behavior rate — that must each clear a floor \(G_{\min}\). The challenger is presumed guilty: it must beat the champion by a real margin and break no guardrail, or the champion stays. A challenger that wins on the headline metric while quietly regressing latency or a subgroup's error rate must not ship. The same logic, applied to a stream of automatically retrained models, gives continuous training (CT): on a schedule or a trigger (§7.4), the pipeline retrains on fresh data, runs the full test suite, and proposes a challenger to the gate. Crucially, automated retraining does not mean automated deployment — the gate (and, for regulated models, a human sign-off) stays in the loop. Fully closed-loop retraining without a gate is how a feedback bug or a poisoned data window silently degrades a model over weeks. In a champion/challenger setup, the challenger is promoted to production only if it beats the current champion on the holdout metric (by a margin, and without breaking guardrails). True or false? (Answer true or false.) This is exactly the promotion rule of EQ V7.3: \(M_{\text{chal}} - M_{\text{champ}} > \delta\) and the guardrails hold. The incumbent is the default; a challenger must earn its place by a real margin. So the statement is true. A challenger is scored on a frozen holdout of \( m = 2000 \) rows where the champion's accuracy is \( p = 0.90 \). To promote only on real signal, set the margin to the 95% half-width of the holdout estimate, \( \delta = 1.96\sqrt{p(1-p)/m} \). What is \( \delta \), to three decimals? \( p(1-p) = 0.90 \times 0.10 = 0.09 \); divide by \( m = 2000 \) → \( 4.5\times10^{-5} \); square root → \( 0.006708 \) (the standard error). Multiply by \( 1.96 \): \( 1.96 \times 0.006708 = 0.01315 \approx \) 0.013. A challenger must beat the champion by at least ~1.3 accuracy points here, or the gap is indistinguishable from sampling noise — the same \(1/\sqrt{m}\) law from EQ V1.2. PYTHON · RUNNABLE IN-BROWSER # Champion/challenger promotion from holdout metrics (EQ V7.3). import numpy as np def promote(M_champ, M_chal, delta, guardrails): # guardrails: list of (name, value, floor, higher_is_better) metric_ok = (M_chal - M_champ) > delta breaches = [] for name, val, floor, higher in guardrails: ok = (val >= floor) if higher else (val OK ("fairness_gap", 0.030, 0.050, False), # must be OK ("calibration_ece",0.021, 0.040, False), # must be OK ] dec, mok, breaches = promote(M_champ, M_chal, delta, guardrails) print(f"champion AUC: {M_champ:.3f}") print(f"challenger AUC: {M_chal:.3f} ({M_chal-M_champ:+.3f}, margin needed {delta})") print(f"beats margin?: {mok}") print(f"guardrail breaches: {breaches if breaches else 'none'}") print(f"\nDECISION: {'PROMOTE challenger' if dec else 'KEEP champion'}") # Counterfactual: same AUC win, but latency now blows the guardrail. g2 = guardrails[:]; g2[0] = ("p99_latency_ms", 240.0, 200.0, False) print("if p99 latency were 240ms ->", "PROMOTE" if promote(M_champ, M_chal, delta, g2)[0] else "KEEP champion (guardrail)") RUN ▶ edits are live — break it on purpose INSTRUMENT V7.2 — PIPELINE-DAG ANATOMY TYPED STAGES · ARTIFACTS · GATES EDIT A STAGE (DIRTIES DOWNSTREAM) STAGES TO RERUN — CACHE HITS (SKIPPED) — SELECTED STAGE none Click any stage to mark it edited. The DAG is a real ML pipeline: ingest → validate → features → train → evaluate → register → serve, with evaluate as the champion/challenger gate before register. Editing a stage dirties it and everything downstream (mint) while upstream stages stay cached (grey). Click features and watch train/evaluate/register/serve all light up; click serve and nothing upstream reruns. This is why content-addressed pipelines are cheap to iterate: you only pay for what actually changed. 7.4 Monitoring, lineage & reproducibility A deployed model decays even though its weights never change, because the world the weights describe keeps moving. Two distinct decays matter, and confusing them is a classic mistake: Data drift (covariate shift). The input distribution \(P(x)\) moves — a new traffic source, a seasonal effect, an upstream feature that started arriving null. The model is still "correct," but it is now answering questions about a population it was not trained on. Concept drift. The relationship \(P(y \mid x)\) itself changes — fraud tactics evolve, user tastes shift, a competitor changes the market. Even on identical inputs, the right answer is now different. Only concept drift necessarily degrades accuracy; data drift may or may not. Labels arrive late or never, so you cannot always watch accuracy directly. The first line of defence is therefore an unsupervised drift signal on the inputs and the predictions. The workhorse is the Population Stability Index (PSI), which compares a baseline (training) distribution against a recent production window, bucketed: EQ V7.4 — POPULATION STABILITY INDEX $$ \mathrm{PSI} \;=\; \sum_{i=1}^{B} \big(a_i - e_i\big)\,\ln\!\frac{a_i}{e_i} $$ For each of \(B\) buckets, \(e_i\) is the expected (baseline) fraction of mass and \(a_i\) the actual (recent) fraction; the sum is a symmetrized relative-entropy distance. Industry rule of thumb: PSI < 0.1 = stable, 0.1–0.25 = moderate shift (investigate), > 0.25 = significant shift (act). PSI is a symmetrized cousin of the KL divergence (INFO THEORY · EQ S2.3): each term is \((a_i-e_i)\ln(a_i/e_i)\) rather than \(a_i\ln(a_i/e_i)\), so it is always non-negative and order-insensitive. Its blind spot, which experts insist on: PSI detects marginal drift only — a change in the joint distribution that leaves every marginal unchanged is invisible to it. Drift on its own is only a warning. The decisive signal, when labels eventually land, is a service-level objective (SLO) on the live metric, with an alert that fires on a sustained breach rather than a single bad point — one noisy day is not an incident, a week below the floor is. In a PSI computation (EQ V7.4), one bucket had expected mass \( e = 0.20 \) at baseline but actual mass \( a = 0.30 \) in the recent window. What is that single bucket's contribution \( (a-e)\ln(a/e) \)? (Use \( \ln 1.5 = 0.405 \).) \( a - e = 0.30 - 0.20 = 0.10 \); \( a/e = 1.5 \), so \( \ln(a/e) = 0.405 \). The contribution is \( 0.10 \times 0.405 = \) 0.04. A handful of buckets shifting like this can push total PSI past the 0.1 "investigate" line — the trigger the retraining-policy instrument below explores. PYTHON · RUNNABLE IN-BROWSER # Model-monitoring SLA-breach flag from a daily metric stream. import numpy as np rng = np.random.default_rng(11) # 30 days of live accuracy: stable, then a drift-driven slide after day 18. days = np.arange(30) base = np.where(days = N_BREACH and fire_day is None: fire_day = i + WINDOW - 1 # map rolling index back to a calendar day fire = max(fire, run) print(f"SLO floor: {SLO:.2f} rolling window: {WINDOW}d") print(f"min rolling acc: {roll.min():.3f} (raw min {acc.min():.3f})") print(f"longest breach run: {fire} day(s) threshold: {N_BREACH}") print(f"BREACH ALERT: {'FIRE on day '+str(fire_day) if fire_day is not None else 'none'}") plot_xy(np.arange(WINDOW-1, 30), roll) # the smoothed curve crossing the SLO floor RUN ▶ edits are live — break it on purpose Behind every alert sits lineage: the graph that connects a live prediction back through the model version, the training run, the data snapshot, and the feature code that produced it (EQ V7.1). When an incident hits, lineage answers the only questions that matter at 2 a.m. — which model is responsible, what was it trained on, what changed since it was clean, and what do we roll back to? A monitor without lineage tells you the patient has a fever; lineage tells you why. 7.5 Model risk management & governance Everything so far is engineering. Governance is the layer that makes those engineering controls accountable — who is allowed to deploy, who signed off, what evidence exists, and what happens when the model causes harm. In regulated industries this is not optional. The canonical reference is the US Federal Reserve / OCC supervisory letter SR 11-7 (2011), "Guidance on Model Risk Management", which defines model risk as the potential for adverse consequences from decisions based on incorrect or misused models, and prescribes three controls that map almost one-to-one onto good MLOps. EQ V7.5 — MODEL RISK (SR 11-7 FRAMING) $$ \text{Model risk} \;=\; \underbrace{P(\text{model is wrong})}_{\text{fundamental error}} \;+\; \underbrace{P(\text{model is misused})}_{\text{wrong context / inputs}} $$ SR 11-7's central insight is that risk has two sources, not one: a model can be wrong (bad data, bad assumptions, overfitting), and a perfectly good model can be misused (applied outside its validated domain, fed inputs it never saw, trusted beyond its accuracy). Both must be managed. The guidance's three pillars are: (1) robust development & documentation — the pipeline, lineage, and reproducibility of §§7.1–7.4; (2) independent validation — a second team, not the builders, challenges the model before and after deployment; (3) governance, policies & controls — an inventory of every model, defined ownership, sign-off, and ongoing monitoring. "Effective challenge" — critical review by people with the authority and incentive to push back — is the phrase the document hangs everything on. This regulatory framing has since been generalized far beyond banking. The EU AI Act (in force from 2024, with high-risk obligations phasing in through 2026–2027) imposes risk-tiered duties — risk management systems, data governance, logging, human oversight, and post-market monitoring — that are recognisably the same controls. The NIST AI Risk Management Framework (2023) and ISO/IEC 42001 (2023, the first AI management-system standard) give voluntary but increasingly expected scaffolding. The through-line across all of them is a small set of governance artifacts every mature ML organisation now maintains: Artifact Question it answers Lineage to MLOps Model inventory Which models exist, who owns each, what is their risk tier? registry (§7.2) Model card / documentation Intended use, training data, metrics, limitations, fairness tracker + lineage Validation report Independent challenge: does it work, where does it fail? eval gate (§7.3) Sign-off / approval record Who authorized production, on what evidence, when? registry transition Monitoring & incident log How is it behaving live; what went wrong and when? monitors (§7.4) CONTESTED Governance can calcify into theatre. The honest tension in 2026: heavyweight model-risk processes designed for slow-moving credit models fit awkwardly onto fast-iterating ML and especially onto LLM/agent systems, where the "model" is a prompt-plus-tools assembly that changes weekly and whose failure modes (hallucination, prompt injection, jailbreaks) are not what SR 11-7 imagined. Two failure modes bracket the debate: too little governance ships unvalidated models into high-stakes decisions; too much produces a compliance pantomime where teams generate documents nobody reads to satisfy a checklist, while real risk goes unmonitored. The defensible middle is risk-tiered governance: match the weight of the controls to the stakes of the decision, automate the evidence-gathering so documentation is a by-product of the pipeline rather than a separate chore, and keep "effective challenge" genuinely effective. US SR 11-7 is regulatory supervisory guidance on model risk management (development & documentation, independent validation, and governance/controls). True or false? (Answer true or false.) SR 11-7 is the 2011 supervisory letter issued by the US Federal Reserve and the OCC, "Guidance on Model Risk Management." It defines model risk and lays out the three pillars in EQ V7.5. So the statement is true — and it is the document most ML governance programs still trace their lineage to. INSTRUMENT V7.3 — RETRAINING-TRIGGER POLICY EXPLORER PSI · METRIC SLO · SCHEDULE · EQ V7.4 INPUT DRIFT — PSI 0.12 LIVE ACCURACY 0.90 DAYS SINCE LAST RETRAIN 14 POLICY DECISION — FIRED TRIGGER(S) — ACTION — Three independent triggers can each demand a retrain: input drift (PSI past the 0.25 act-line, or 0.1 watch-line), a performance SLO breach (live accuracy below the 0.88 floor), or a staleness deadline (a max-age schedule). Slide each control and watch which bars cross their threshold. The lesson is policy design: a good retraining policy is the OR of a few cheap, observable signals — and even when a trigger fires, the action is "retrain & propose a challenger to the gate," never "auto-deploy." Drift alone never ships a model; the gate of §7.3 still has to say yes. NEXT You now have the operational backbone — pipelines, registries, monitoring, and the governance that makes a model an accountable asset. That closes the Model Validation & Risk track. From here the manual turns to the model itself: the LLM Field Manual opens with foundations — tokens, embeddings, and the next-token objective that everything in production is ultimately serving. 7.R References Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015 — the "ML code is a small box" argument behind §7.1. Board of Governors of the Federal Reserve System & OCC (2011). SR 11-7: Guidance on Model Risk Management. Supervisory letter — the model-risk framework and three pillars of §7.5 (EQ V7.5). Breck, E., Cai, S., Nielsen, E., Salib, M. & Sculley, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data 2017 — the data/model/infra test layers of §7.3. Kreuzberger, D., Kühl, N. & Hirschl, S. (2022). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. arXiv:2205.02302 — a current reference architecture for pipelines, CI/CD, and CT. National Institute of Standards and Technology (2023). AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1 — the Govern/Map/Measure/Manage scaffolding generalizing §7.5. European Union (2024). Regulation (EU) 2024/1689 — the Artificial Intelligence Act. Official Journal — risk-tiered obligations (risk management, data governance, logging, human oversight) phasing in through 2026–2027. ← PREVIOUS 06 Explainability NEXT CHAPTER 01 LLM Field Manual · Foundations AI // ENCYCLOPEDIA — MODEL VALIDATION & RISK · CH 07 FULL CONTENTS ↗ ======================================================================== DEEP LEARNING ======================================================================== ## DL · Deep Learning Foundations (https://ai-encyclopedia.com/dl/01-foundations.html) Deep Learning Foundations — Init, Norm & Residuals — AI Encyclopedia AI // ENCYCLOPEDIA / DEEP LEARNING / 01 / FOUNDATIONS INDEX NEXT: 02 CNNs → DEEP LEARNING · CHAPTER 01 / 07 Deep Learning Foundations A network with enough layers can in principle represent almost any function, yet for years deep stacks could not be trained. Activations and gradients are multiplied repeatedly as they pass through depth, so they explode or vanish geometrically. Stacking layers only works once you control how the signal flows through depth, which careful initialization, normalization, and residual connections together achieve. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON NEURAL NETS · ML 07–08 INSTRUMENTS INIT · BATCHNORM · RESIDUAL IN THIS CHAPTER 1.1 From MLP to deep networks 1.2 Initialization — Xavier & He 1.3 Batch normalization 1.4 Residual connections 1.5 Regularization 1.R References 1.1 From MLP to deep networks A multilayer perceptron (MLP) is an alternating stack of affine maps and pointwise nonlinearities. Each layer takes the previous activation \(h^{(\ell-1)}\), applies a learned weight matrix and bias, then a nonlinearity \(\phi\): EQ N1.1 — A FORWARD LAYER $$ z^{(\ell)} = W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}, \qquad h^{(\ell)} = \phi\!\big(z^{(\ell)}\big), \qquad \ell = 1, \ldots, L $$ \(z^{(\ell)}\) is the pre-activation, \(h^{(\ell)}\) the activation. Stack \(L\) of these and the network composes \(L\) nonlinear maps. The universal approximation theorem says even a single sufficiently wide hidden layer can approximate any continuous function on a compact set — but it is silent on how wide and gives no recipe for finding the weights. Depth is the practical answer: deep networks build features hierarchically and represent many functions exponentially more compactly than a shallow one of equal parameter count. The promise of depth is compositional structure: early layers learn edges, later layers learn objects; early layers learn phonemes, later layers learn meaning. The obstacle is that the same composition that builds rich features also compounds the scale of whatever flows through it. Consider the backward pass. Backpropagation (ML 08) sends the loss gradient through the chain rule, so the gradient at layer \(\ell\) is a product of Jacobians from the output back to \(\ell\): EQ N1.2 — WHY DEPTH IS HARD: THE JACOBIAN PRODUCT $$ \frac{\partial \mathcal{L}}{\partial h^{(\ell)}} = \left(\prod_{k=\ell+1}^{L} J^{(k)}\right)^{\!\top} \frac{\partial \mathcal{L}}{\partial h^{(L)}}, \qquad J^{(k)} = \frac{\partial h^{(k)}}{\partial h^{(k-1)}} = \mathrm{diag}\!\big(\phi'(z^{(k)})\big)\, W^{(k)} $$ The gradient is multiplied by one Jacobian per layer. If the typical singular value of these Jacobians is below 1, the product shrinks geometrically toward zero — the vanishing-gradient problem, which leaves early layers learning nothing. If it is above 1, the product blows up — the exploding-gradient problem, which makes training diverge. A network with sigmoid/tanh units is doubly cursed: \(\phi'\) saturates to near zero in the tails, so the diagonal factor alone kills the signal. The whole chapter is a campaign to keep that product near 1. The same compounding hits the forward pass: an activation passing through many layers is repeatedly scaled, so its variance can balloon or collapse before it ever reaches the output. The first historical fix, switching from saturating sigmoids to the non-saturating ReLU \(\phi(z) = \max(0, z)\), removed the worst of the diagonal saturation. But ReLU alone does not control the weight factor \(W^{(k)}\), and that is where the next section begins. INTUITION Think of a deep network as a chain of amplifiers. If each amplifier has gain 0.9, then 50 of them in series have gain \(0.9^{50}\approx 0.005\) — the signal is gone. Gain 1.1 gives \(1.1^{50}\approx 117\) — it saturates. Only a chain tuned to gain \(\approx 1\) passes signal cleanly through depth. Init, normalization, and residuals are three ways to lock that gain near one. 1.2 Weight initialization — Xavier & He Before a single gradient step, the random weights you start from already decide whether signal survives the forward pass. The goal is a variance-preserving initialization: each layer should pass activations forward without systematically growing or shrinking their variance. Treat the weights as independent zero-mean random variables and propagate variance through EQ N1.1. For a layer with \(n_{\text{in}}\) inputs, the pre-activation variance is the sum of \(n_{\text{in}}\) independent terms: EQ N1.3 — VARIANCE PROPAGATION (LINEAR REGIME) $$ \mathrm{Var}\big(z^{(\ell)}\big) = n_{\text{in}}\,\mathrm{Var}\big(W^{(\ell)}\big)\,\mathrm{Var}\big(h^{(\ell-1)}\big) $$ If \(n_{\text{in}}\,\mathrm{Var}(W) > 1\), variance grows layer by layer and activations explode; if it is below 1, they vanish. The fix is to choose \(\mathrm{Var}(W)\) so the factor is exactly 1. The naive default — \(W \sim \mathcal{N}(0, 1)\), variance 1 — multiplies variance by \(n_{\text{in}}\) at every layer, which for a width-256 network is a factor of 256 per layer. That single bad constant is enough to make a deep net untrainable. Setting the forward factor to 1 gives \(\mathrm{Var}(W) = 1/n_{\text{in}}\). The backward pass wants \(\mathrm{Var}(W) = 1/n_{\text{out}}\) for the same reason (gradients propagate through \(W^\top\)). You cannot satisfy both unless the layer is square, so Glorot (Xavier) initialization takes the harmonic compromise — the average of the two fan counts: EQ N1.4 — XAVIER / GLOROT INITIALIZATION $$ \mathrm{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}} \qquad\Longrightarrow\qquad W \sim \mathcal{U}\!\left[-\sqrt{\tfrac{6}{n_{\text{in}}+n_{\text{out}}}},\; \sqrt{\tfrac{6}{n_{\text{in}}+n_{\text{out}}}}\right] $$ The uniform bound comes from the fact that a uniform distribution on \([-a, a]\) has variance \(a^2/3\); setting \(a^2/3 = 2/(n_{\text{in}}+n_{\text{out}})\) gives \(a = \sqrt{6/(n_{\text{in}}+n_{\text{out}})}\). Glorot & Bengio derived this assuming a roughly linear activation around zero — true for \(\tanh\), whose slope at the origin is 1. It is the right default for symmetric, zero-centered nonlinearities. ReLU breaks the linear assumption: it zeros out the negative half of its inputs, so on average it halves the variance of what passes through. He (Kaiming) initialization compensates by doubling the weight variance, keying off \(n_{\text{in}}\) alone since the rectifier is the dominant correction: EQ N1.5 — HE / KAIMING INITIALIZATION (FOR ReLU) $$ \mathrm{Var}(W) = \frac{2}{n_{\text{in}}} \qquad\Longrightarrow\qquad \mathrm{std}(W) = \sqrt{\frac{2}{n_{\text{in}}}} $$ The extra factor of 2 over the naive \(1/n_{\text{in}}\) exactly cancels ReLU's variance-halving. This is the default in essentially every modern framework for ReLU-family networks ( kaiming_normal_ in PyTorch). The lesson is general: the right init depends on the nonlinearity, because what you must preserve is the variance after the activation, not before it. A ReLU layer has \(n_{\text{in}} = 128\) inputs. Using He initialization (EQ N1.5), what standard deviation should you draw its weights from? (\(\sqrt{2/n_{\text{in}}}\).) \(\sqrt{2/n_{\text{in}}} = \sqrt{2/128} = \sqrt{0.015625} = \) 0.125. A width-128 ReLU layer should start with weights of standard deviation 0.125 — far below the naive \(1.0\) that would explode the forward pass. PYTHON · RUNNABLE IN-BROWSER # Activation variance across depth: naive vs Xavier vs He init (ReLU net) import numpy as np rng = np.random.default_rng(0) n, depth, batch = 256, 25, 1024 h0 = rng.standard_normal((batch, n)) # unit-variance input def run(std_fn, relu=True): h, var = h0.copy(), [h0.var()] for _ in range(depth): W = rng.standard_normal((n, n)) * std_fn(n) h = h @ W # EQ N1.1 (no bias) if relu: h = np.maximum(h, 0.0) # ReLU halves variance var.append(h.var()) return var naive = run(lambda n: 1.0) # std = 1: explodes xavier = run(lambda n: np.sqrt(1.0/n)) # tuned for linear/tanh he = run(lambda n: np.sqrt(2.0/n)) # tuned for ReLU print(" layer naive xavier he") for L in (0, 5, 12, 25): print(f" {L:5d} {naive[L]:11.2e} {xavier[L]:12.4f} {he[L]:9.4f}") print("\nnaive blows up; xavier (1/n) decays under ReLU; he (2/n) holds near 1.") plot_xy(list(range(depth + 1)), [min(v, 1e3) for v in he]) # He stays flat RUN ▶ edits are live — break it on purpose INSTRUMENT N1.1 — INIT EXPLORER ACTIVATION VARIANCE ACROSS DEPTH · EQ N1.3–N1.5 WIDTH n 256 DEPTH L 30 INIT SCHEME NAIVE (1) XAVIER HE VAR AT LAYER 1 — VAR AT FINAL LAYER — VERDICT — A ReLU network of the chosen width and depth is run forward on unit-variance input; the curve is \(\log_{10}\) of the activation variance at each layer (the dashed line is variance = 1, the target). NAIVE shoots off the top of the chart within a few layers — the \(256\times\) blow-up of EQ N1.3. XAVIER decays toward zero because under ReLU its \(1/n\) is a factor of 2 too small. HE tracks the dashed line: variance preserved through arbitrary depth. Drop to NAIVE and watch the verdict flip to EXPLODES. 1.3 Batch normalization A good initialization keeps variance under control at step zero — but weights move during training, and the distribution of each layer's inputs drifts as the layers below it update. Ioffe & Szegedy named this drift internal covariate shift and proposed fixing it directly: standardize each layer's pre-activations to zero mean and unit variance, using statistics computed over the current mini-batch: EQ N1.6 — BATCH NORMALIZATION $$ \hat{z}_i = \frac{z_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, \qquad y_i = \gamma\,\hat{z}_i + \beta, \qquad \mu_{\mathcal{B}} = \frac{1}{m}\sum_{i=1}^{m} z_i,\;\; \sigma_{\mathcal{B}}^2 = \frac{1}{m}\sum_{i=1}^{m} (z_i - \mu_{\mathcal{B}})^2 $$ For each feature channel, subtract the batch mean and divide by the batch standard deviation (with \(\epsilon \approx 10^{-5}\) for numerical safety), then re-scale and re-shift with learned parameters \(\gamma, \beta\). Those two learnable parameters are crucial: normalization alone would force every layer into the same fixed distribution, but \(\gamma, \beta\) let the network recover any mean and variance it actually needs — including, if \(\gamma=\sigma_{\mathcal B}\) and \(\beta=\mu_{\mathcal B}\), the identity. Normalization is a default the network can override, not a straitjacket. The payoff is large and somewhat over-determined. BatchNorm lets you use higher learning rates without divergence, makes training far less sensitive to the choice of initialization, and acts as a mild regularizer because each example's normalization depends on the random composition of its mini-batch. The original paper credited the reduction of internal covariate shift; later work (Santurkar et al., 2018) argued the real mechanism is a smoother loss landscape — BatchNorm bounds how fast the loss and its gradients can change, so optimization steps behave more predictably. The mechanism is still debated; the empirical win is not. Train vs. inference — the classic footgun. At training time BatchNorm uses the live mini-batch statistics. At inference you have no batch (or want determinism), so it switches to a running average of mean and variance accumulated during training. Forgetting to put the model in eval mode — so it normalizes a single test example by its own degenerate statistics — produces the most common BatchNorm bug. BatchNorm also couples examples within a batch and degrades at very small batch sizes; that weakness is exactly why LayerNorm (normalize across features of one example, batch-independent) won in Transformers, where it sits inside every block (Vol II · Ch 02). A BatchNorm layer sees the mini-batch of pre-activations \(\{2, 2, 6, 6\}\) for one channel. Take \(\epsilon = 0\). What is the normalized value \(\hat{z}\) (EQ N1.6, before the \(\gamma,\beta\) re-scale) of an element with \(z = 5\)? Mean \(\mu_{\mathcal{B}} = (2+2+6+6)/4 = 4\). Variance \(\sigma_{\mathcal{B}}^2 = \tfrac14[(2{-}4)^2+(2{-}4)^2+(6{-}4)^2+(6{-}4)^2] = \tfrac14(4+4+4+4) = 4\), so \(\sigma_{\mathcal{B}} = 2\). Then \(\hat{z} = (5-4)/2 = \) 0.5 — the element sits half a standard deviation above the batch mean. PYTHON · RUNNABLE IN-BROWSER # Forward pass with & without BatchNorm; print per-layer activation stats import numpy as np rng = np.random.default_rng(0) n, depth, batch = 128, 12, 512 x = rng.standard_normal((batch, n)) def bn(z, eps=1e-5): # EQ N1.6, gamma=1, beta=0 mu = z.mean(0); var = z.var(0) return (z - mu) / np.sqrt(var + eps) def forward(use_bn): h = x.copy() stats = [] for _ in range(depth): W = rng.standard_normal((n, n)) * np.sqrt(2.0 / n) # He init z = h @ W if use_bn: z = bn(z) # re-center & re-scale each layer h = np.maximum(z, 0.0) # ReLU stats.append((h.mean(), h.std())) return stats print(" layer no-BN mean / std with-BN mean / std") plain, normed = forward(False), forward(True) for L in (0, 5, 11): p, q = plain[L], normed[L] print(f" {L:5d} {p[0]:+.3f} / {p[1]:6.3f} {q[0]:+.3f} / {q[1]:6.3f}") print("\nBatchNorm pins each layer's distribution; without it the std drifts.") RUN ▶ edits are live — break it on purpose INSTRUMENT N1.2 — BATCHNORM & TRAINING STABILITY LOSS CURVES · ON vs OFF · EQ N1.6 LEARNING RATE η 0.30 DEPTH L 12 FINAL LOSS · NO BN — FINAL LOSS · WITH BN — STABLE η CEILING (NO BN) — A toy deep net is trained for 60 steps at the chosen learning rate; the mint curve normalizes activations each layer, the muted grey curve does not. Push η up: the no-BN curve diverges (spikes off the top, loss explodes), while the BatchNorm curve keeps descending — the higher-learning-rate tolerance that made BN famous. Increase depth and the gap widens, since the un-normalized net compounds instability over more layers. 1.4 Residual connections Init and normalization keep variance in line, but they do not remove the fundamental fragility of EQ N1.2: a gradient still has to survive a product of \(L\) Jacobians. By 2015, even well-initialized, batch-normalized networks showed a degradation problem — adding more layers made training accuracy worse, not just test accuracy. The deeper net could in principle copy the shallower one by setting extra layers to identity, yet optimization could not find that solution. He et al.'s answer was to make the identity the default, by adding a skip connection around each block: EQ N1.7 — THE RESIDUAL BLOCK $$ h_{\ell+1} = h_\ell + F\big(h_\ell; \theta_\ell\big) $$ The block learns a residual \(F\) — the correction to add to its input — rather than a fresh representation. If the optimal map is close to identity, the network just drives \(F \to 0\), which is far easier than learning identity from scratch. \(F\) is typically two or three weight layers with normalization and a nonlinearity. The skip is the load-bearing idea: it is what lets networks go from tens of layers to hundreds (and ResNets to over a thousand) and is structurally identical to the residual stream that runs through every Transformer block (Vol II · Ch 02). Why does the skip rescue the gradient? Differentiate EQ N1.7. The Jacobian of a residual block is the identity plus the block's own Jacobian, so the backward product gains an additive shortcut at every layer: EQ N1.8 — GRADIENT FLOW THROUGH A SKIP $$ \frac{\partial h_{\ell+1}}{\partial h_\ell} = I + \frac{\partial F}{\partial h_\ell} \qquad\Longrightarrow\qquad \frac{\partial \mathcal{L}}{\partial h_\ell} = \frac{\partial \mathcal{L}}{\partial h_L}\prod_{k=\ell}^{L-1}\!\Big(I + \tfrac{\partial F}{\partial h_k}\Big) $$ Expand the product and one term is the bare identity \(I\): the gradient at the output reaches layer \(\ell\) undiminished, no matter how many layers lie between, plus higher-order corrections through the \(F\) paths. Where a plain net multiplies the gradient by something \( Depth stops being a multiplicative tax on the gradient and becomes additive. The standard practice puts BatchNorm (or LayerNorm) inside \(F\), so the two fixes compose. A residual block computes its output as the block input \(h\) plus the transformation \(F\) applied to that input — that is, \(h + F(h)\), the skip connection of EQ N1.7. True or false? (Answer true or false.) The defining equation of a residual block is exactly \(h_{\ell+1} = h_\ell + F(h_\ell)\): the input is carried forward unchanged and the block only learns the correction \(F\) to add. The statement is true. INSTRUMENT N1.3 — RESIDUAL vs PLAIN: GRADIENT FLOW ‖∂L/∂h‖ BY LAYER · EQ N1.8 DEPTH L 40 BLOCK GAIN ‖∂F/∂h‖ 0.80 GRAD AT LAYER 1 · PLAIN — GRAD AT LAYER 1 · RESIDUAL — RATIO (RESIDUAL / PLAIN) — The gradient norm starts at 1 at the output and is propagated back to layer 1. The grey plain net multiplies by the block gain at every layer (EQ N1.2), so for a gain below 1 it collapses geometrically — by layer 1 the early weights see almost no signal. The mint residual net follows EQ N1.8: the \(+I\) shortcut keeps a path of magnitude 1 alive all the way down, so the gradient barely decays. Set the gain above 1 and the plain net explodes instead — the residual net still stays bounded. PYTHON · RUNNABLE IN-BROWSER # Gradient flow: plain stack vanishes, residual stack survives (EQ N1.8) import numpy as np rng = np.random.default_rng(0) n, depth = 64, 50 g = rng.standard_normal((depth, n, n)) * np.sqrt(0.7 / n) # block Jacobians, gain RUN ▶ edits are live — break it on purpose 1.5 Regularization — dropout & weight decay The first three fixes make a deep net trainable; the last makes it generalize. A network with millions of parameters can memorize its training set outright, so we add pressure toward simpler solutions. Two techniques dominate, and they attack overfitting from opposite directions. Dropout randomly zeros each activation with probability \(p\) on every training step, then rescales the survivors so the expected activation is unchanged: EQ N1.9 — DROPOUT (INVERTED, TRAINING TIME) $$ \tilde{h}_i = \frac{m_i}{1-p}\,h_i, \qquad m_i \sim \mathrm{Bernoulli}(1-p) $$ Each forward pass trains a different random sub-network; at test time dropout is off and the full network acts as an implicit ensemble of all those sub-networks. The \(1/(1-p)\) factor ( inverted dropout) keeps the expected activation constant, so no scaling is needed at inference. By preventing units from co-adapting — relying on a specific partner always being present — dropout forces redundant, robust features. Typical \(p\): 0.1–0.5 for dense layers. It is largely absent from large Transformers, where data scale and other regularizers do the work. Weight decay instead penalizes large weights, adding an \(L_2\) term to the loss that pulls every weight toward zero: EQ N1.10 — L2 / WEIGHT DECAY $$ \mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2}\sum_j w_j^2 \qquad\Longrightarrow\qquad w_j \leftarrow w_j - \eta\Big(\frac{\partial \mathcal{L}}{\partial w_j} + \lambda\,w_j\Big) $$ The penalty's gradient is just \(\lambda w_j\), so each step shrinks every weight by a constant fraction before the data-driven update — hence "decay". Smaller weights mean a smoother, lower-variance function that is harder to overfit. A subtlety that matters in practice: with adaptive optimizers like Adam, classical \(L_2\) and true weight decay are not the same, because Adam rescales the gradient; AdamW (Loshchilov & Hutter, 2019) decouples the decay from the gradient step and is the modern default. \(\lambda\) typically sits in \(10^{-4}\) to \(10^{-1}\). The honest picture. The four fixes overlap and partly substitute for one another. BatchNorm already regularizes, which is why dropout and BN are often redundant together. Good initialization reduces — but does not eliminate — the need for normalization. Residual connections plus normalization are now so reliable that very deep training is routine, and the field's frontier has moved from can we train it to can we afford it. None of these is a law of nature; each is an engineering fix to the same underlying disease — signal that compounds geometrically through depth — and each will be revisited, sharpened, or replaced as architectures evolve. NEXT Now that signal can flow through depth, the question is what structure to give the layers. Chapter 02 specializes the dense layer for images: convolutions share weights across space, pooling builds translation tolerance, and the same init/norm/residual toolkit you just met powers the ResNets that dominated computer vision. 1.R References Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS — the variance-preserving (Xavier/Glorot) initialization of §1.2. He, K., Zhang, X., Ren, S. & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV — He/Kaiming initialization for ReLU networks (EQ N1.5). Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML — batch normalization (§1.3, EQ N1.6). He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR — residual connections / ResNet (§1.4, EQ N1.7–N1.8). Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — dropout regularization (EQ N1.9). Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR — AdamW, decoupling weight decay from the adaptive step (EQ N1.10). Santurkar, S., Tsipras, D., Ilyas, A. & Mądry, A. (2018). How Does Batch Normalization Help Optimization? NeurIPS — the loss-smoothing reinterpretation of BatchNorm cited in §1.3. ← PREVIOUS § INDEX NEXT CHAPTER 02 CNNs AI // ENCYCLOPEDIA — DEEP LEARNING · CH 01 FULL CONTENTS ↗ ## DL · Convolutional Neural Networks (https://ai-encyclopedia.com/dl/02-cnn.html) Convolutional Neural Networks — AI Encyclopedia AI // ENCYCLOPEDIA / DEEP LEARNING / 02 / CNNs INDEX NEXT: SEQUENCE MODELS → DEEP LEARNING · CHAPTER 02 / 07 Convolutional Neural Networks A photograph carries structure that a dense layer discards: nearby pixels belong together, and an object keeps its identity wherever it sits in the frame. Sharing a small filter across an image bakes in translation invariance, the inductive bias that made computer vision practical. This chapter builds the convolution from its arithmetic, then traces the lineage from LeNet to ResNet and the transfer-learning recipe that now ships most production vision. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON DEEP LEARNING 01 INSTRUMENTS KERNEL EXPLORER · FEATURE MAPS · RECEPTIVE FIELD IN THIS CHAPTER 2.1 The convolution operation 2.2 Pooling & invariance 2.3 Channels, stride & padding 2.4 LeNet to ResNet 2.5 Transfer learning 2.R References 2.1 The convolution operation A fully-connected layer treats an image as a flat vector: a \(224\times 224\) RGB picture becomes \(150{,}528\) inputs, and a single hidden unit reading all of them owns that many weights. That is wasteful on two counts. It ignores locality — the pixels that matter for detecting an edge are right next to each other, not scattered across the frame — and it ignores repetition — an edge in the top-left corner is the same visual pattern as an edge in the bottom-right, yet a dense layer must relearn it for every position. A convolutional layer fixes both by sliding one small set of weights, a kernel (or filter), across every location and reusing it everywhere. The arithmetic is a sum of element-wise products between the kernel and the patch of input it currently overlaps. (Deep-learning libraries implement cross-correlation — no kernel flip — and call it convolution; since the kernel is learned, the flip is irrelevant.) For a 2D input \(I\) and a \(k\times k\) kernel \(W\), the output at position \((i,j)\) is: EQ N2.1 — DISCRETE 2D CONVOLUTION (CROSS-CORRELATION) $$ S(i,j) \;=\; (I * W)(i,j) \;=\; \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} I(i+m,\, j+n)\, W(m,n) \;+\; b $$ The kernel \(W\) is a tiny learnable stencil — \(3\times 3\) is the modern default — and \(b\) a scalar bias. Sliding it across the whole image produces a feature map: a 2D record of where the kernel's pattern occurs. The two structural commitments are local connectivity (each output sees only a \(k\times k\) window, not the whole image) and weight sharing (the same \(k^2{+}1\) parameters are reused at every position). Weight sharing is what gives convolution its defining property — translation equivariance: shift the input and the feature map shifts identically. The parameter savings are dramatic. A dense layer mapping a \(32\times 32\) single-channel image to a same-sized output needs \(1024 \times 1024 \approx 10^6\) weights; a \(3\times 3\) convolution producing the same map needs nine (plus a bias), regardless of image size. That economy is not just cheaper — it is a prior. By forcing every location to share weights, convolution declares in advance that visual patterns are position-independent, and that prior is close enough to true that CNNs generalize from far less data than an unconstrained network would need. The boundary deserves a word. With no padding, a \(k\times k\) kernel cannot center on the outermost pixels, so the feature map shrinks: an \(N\times N\) input yields an \((N-k+1)\times(N-k+1)\) output. Section 2.3 makes this precise and shows how padding and stride control it. A \(32\times 32\) image is convolved with a \(5\times 5\) kernel using stride \(1\) and no padding. What is the side length of the output feature map? The output side is \(\left\lfloor \dfrac{W - K + 2P}{S}\right\rfloor + 1 = \dfrac{32 - 5 + 0}{1} + 1 = 27 + 1 = \) 28. Each \(5\times 5\) window must fit entirely inside the image, so the kernel's top-left corner can sit in only \(28\) positions per axis — giving a \(28\times 28\) map. PYTHON · RUNNABLE IN-BROWSER # EQ N2.1: 2D convolution (cross-correlation) from scratch in numpy import numpy as np # a tiny 6x6 "image": a bright vertical bar down the middle img = np.zeros((6, 6)) img[:, 2:4] = 1.0 # a vertical-edge detector (Sobel-x): fires where left/right brightness differ K = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype=float) kh, kw = K.shape H, W = img.shape out = np.zeros((H - kh + 1, W - kw + 1)) # valid padding -> shrinks for i in range(out.shape[0]): for j in range(out.shape[1]): patch = img[i:i+kh, j:j+kw] # the window under the kernel out[i, j] = np.sum(patch * K) # element-wise product, summed np.set_printoptions(precision=1, suppress=True) print("input image (the bar):\n", img) print("\nfeature map (vertical edges):\n", out) print("\nleft edge of bar -> +, right edge -> -. flat regions stay 0.") RUN ▶ edits are live — break it on purpose INSTRUMENT N2.1 — CONVOLUTION KERNEL EXPLORER 9×9 INPUT · 3×3 KERNEL · VALID PADDING · EQ N2.1 KERNEL EDGE SHARPEN BLUR IDENTITY KERNEL (3×3) — OUTPUT SIZE 7 × 7 PARAMS REUSED 49× Left grid is the input (a hand-drawn "7"); right grid is the feature map after applying the chosen kernel everywhere. EDGE highlights boundaries, BLUR averages neighbors, SHARPEN exaggerates contrast, IDENTITY copies the center pixel. The same nine weights produce every output cell — that single fact is weight sharing, and the source of the 49× reuse count. 2.2 Pooling & translation invariance Convolution is equivariant: move the cat one pixel right and its feature map moves one pixel right too. For classification we usually want something stronger — invariance: the answer "cat" should not change at all when the cat shifts. Pooling is the classic mechanism that converts a little equivariance into a little invariance. It slides a window over the feature map and summarizes each window with a single number — usually the maximum (max-pooling) or the average: EQ N2.2 — MAX-POOLING $$ P(i,j) \;=\; \max_{\substack{0\le m
shift absorbed.") print("output went from 4x4 to 2x2: a quarter of the area, for free.") RUN ▶ edits are live — break it on purpose 2.3 Channels, stride & padding Real images are not flat grids — a color photo is \(H\times W\times 3\), three channels (red, green, blue) stacked in depth. Convolution generalizes by giving each kernel the same depth as its input: a kernel applied to a 3-channel image is itself \(k\times k\times 3\), it dots over all input channels at once, and it produces a single output map. To get a richer representation you simply run \(C_{\text{out}}\) such kernels in parallel — one per output channel — so a conv layer is parameterized by a 4D weight tensor: EQ N2.3 — A CONV LAYER'S PARAMETERS & OUTPUT GEOMETRY $$ \#\text{params} = (k \cdot k \cdot C_{\text{in}} + 1)\, C_{\text{out}}, \qquad H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} - k + 2P}{S} \right\rfloor + 1 $$ The weight tensor has shape \(C_{\text{out}}\times C_{\text{in}}\times k\times k\), plus one bias per output channel. Padding \(P\) adds a border of zeros so the kernel can center on edge pixels; "same" padding (\(P=\lfloor k/2\rfloor\) for odd \(k\), stride 1) keeps the spatial size unchanged. Stride \(S\) is how far the kernel hops between applications; \(S=2\) downsamples like a pool but with learned weights. The width formula applies independently to height. The depth dimension is where representational capacity lives: early layers carry tens of channels of low-level features (edges, blobs), deep layers carry hundreds or thousands of channels of abstract parts (eyes, wheels, text). Notice what the layer trades away: a conv with \(C_{\text{in}}=C_{\text{out}}=256\) and a \(3\times 3\) kernel has \((9\cdot 256 + 1)\cdot 256 \approx 590{,}000\) parameters — independent of image size, because the same kernels run at every location. The familiar CNN rhythm follows directly: as pooling and strided convs shrink the spatial grid, channel counts rise to compensate, trading "where" for "what" as you go deeper. A typical backbone might run \(224^2\times 3 \to 56^2\times 64 \to 28^2\times 128 \to 14^2\times 256 \to 7^2\times 512\): the map gets small and deep until a global pool and a linear layer read off the answer. A practical refinement worth naming: the \(1\times 1\) convolution. With \(k=1\) it does no spatial mixing at all — it is a per-pixel linear layer across channels, used to cheaply change channel depth (a "bottleneck") before an expensive \(3\times 3\). Depthwise-separable convolutions (each channel convolved on its own, then \(1\times 1\) mixed) push the same idea to mobile-scale efficiency, and are the engine of the MobileNet/EfficientNet family. INSTRUMENT N2.2 — FEATURE-MAP VISUALIZER SPATIAL SIZE ↓ · CHANNELS ↑ · EQ N2.3 INPUT SIZE 224 STAGES (conv+pool) 4 BASE CHANNELS 64 FINAL MAP 14 × 14 FINAL CHANNELS 512 ACTIVATIONS / STAGE — Each stage is one "same" conv (size-preserving) followed by a stride-2 pool (size-halving) that also doubles the channel count. Watch the volumes flip from wide-and-shallow to small-and-deep — the canonical CNN shape. The activation count per stage (height × width × channels) shows that early layers, despite few channels, hold the most numbers; this is why feature-map memory, not parameters, often dominates training. 2.4 Classic architectures — LeNet to ResNet The CNN's history is a tight sequence of ideas, each fixing the previous generation's ceiling. Reading it in order is the fastest way to understand what every modern backbone is made of. Architecture Year Depth The idea it introduced LeNet-5 1998 7 The template itself: conv → pool → conv → pool → dense, trained by backprop to read handwritten digits (MNIST/checks). AlexNet 2012 8 The same idea at GPU scale on ImageNet: ReLU, dropout, data augmentation. Halved the error and started the deep-learning era. VGG 2014 16–19 Depth from uniformity: stacks of small \(3\times 3\) convs only. Two \(3\times 3\)s see a \(5\times 5\) region with fewer params and more nonlinearity. GoogLeNet / Inception 2014 22 Multi-scale "Inception" blocks (\(1\times 1\), \(3\times 3\), \(5\times 5\) in parallel) and \(1\times 1\) bottlenecks to stay cheap. ResNet 2015 50–152 The residual / skip connection — the breakthrough that let networks go past ~20 layers without degrading. The decisive jump is ResNet. Through 2014 the field believed deeper was better, yet past about twenty layers accuracy got worse — not from overfitting (training error rose too) but from an optimization failure: gradients had to thread through too many transformations to reach early layers, and very deep plain stacks could not even learn the identity function reliably. He et al. solved it with a one-line change to the building block — add the input back to the output: EQ N2.4 — THE RESIDUAL BLOCK $$ \mathbf{y} \;=\; \mathcal{F}(\mathbf{x}; \{W_i\}) \;+\; \mathbf{x} $$ Instead of asking a block to learn the desired mapping \(H(\mathbf{x})\) outright, ask it to learn only the residual \(\mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}\); the original input is carried forward by the skip connection and added at the end. Two consequences follow. If a layer is unneeded, driving \(\mathcal{F}\to 0\) recovers the identity for free — so extra depth can never hurt. And the additive shortcut gives the gradient a direct path back to early layers (the \(+\mathbf{x}\) contributes a clean \(+1\) to the derivative), defeating the vanishing-gradient barrier. This single trick enabled 152-layer networks, won ImageNet 2015, and the residual stream it created is now the backbone of essentially every deep architecture — Transformers included (Vol II · EQ 2.x). It is worth noting where this story stands in 2026, honestly. CNNs no longer hold the absolute accuracy crown on large-scale benchmarks — Vision Transformers (ViT) match or exceed them given enough data, by replacing the convolutional prior with attention over image patches. But the contest is closer than headlines suggest: convnets modernized with the same training recipes (the "ConvNeXt" line) remain competitive, and CNNs still dominate where data is limited or latency and edge deployment matter, precisely because their built-in inductive bias substitutes for data the way a ViT cannot. Convolution did not lose; it became one well-understood tool among several. WHY 3×3 VGG's quiet lesson: two stacked \(3\times 3\) convolutions have the same \(5\times 5\) receptive field as one \(5\times 5\) conv, but use \(2\cdot(3^2)=18\) weights per channel instead of \(25\), and insert an extra nonlinearity between them. Three \(3\times 3\)s match a \(7\times 7\) (27 vs 49 weights). Deeper-but-thinner won, and \(3\times 3\) became the field's default kernel — the receptive-field calculator below shows exactly how that field grows with depth. 2.5 Transfer learning The single most practically important fact about CNNs is that the features they learn transfer. A network trained on ImageNet's 1.2 million labelled photos learns, in its early layers, a near-universal visual vocabulary: oriented edges, color contrasts, textures, then corners, then object parts. Those low-level detectors are not specific to "is this a Labrador" — they are what any natural-image task needs. So rather than train a CNN from random weights on your few thousand images (which would overfit badly), you start from a pre-trained backbone and adapt it. Two regimes: Feature extraction (frozen backbone). Freeze all convolutional weights, discard the original 1000-class head, and train only a fresh classifier on top of the final feature vector. The CNN becomes a fixed feature function; you are fitting a small linear model on excellent features. This is the right move when your dataset is small and/or similar to ImageNet — it cannot overfit the backbone because the backbone does not move. Fine-tuning. Unfreeze some or all of the backbone and continue training on your data at a small learning rate (often 10–100× lower than from-scratch), so the pre-trained weights are nudged, not erased. Best when you have more data or a domain that drifts from natural photos (medical scans, satellite imagery). A common recipe trains the new head first, then unfreezes the top blocks; the lowest layers — those universal edge detectors — are usually left frozen or barely touched. The empirical pattern that justifies the whole approach: feature transferability decreases with depth. Early layers transfer almost perfectly across tasks; the last layers are the most task-specific and benefit most from adaptation. That is why "freeze the bottom, retrain the top" is the default, and why transfer learning routinely reaches strong accuracy with hundreds of examples instead of millions — the expensive representation learning was already paid for, once, by whoever trained the backbone. INSTRUMENT N2.3 — RECEPTIVE-FIELD CALCULATOR 3×3 CONVS + 2×2 POOLS · CUMULATIVE RF 3×3 CONV LAYERS 3 2×2 POOLS (after every Nth conv) 2 RECEPTIVE FIELD 18 px FEATURE STRIDE (jump) 4 TOTAL LAYERS 5 Each unit in a deep feature map "sees" a window of the original image — its receptive field. With only \(3\times 3\) convs the field grows by 2 px per layer (linearly); insert a stride-2 pool and the jump doubles, so every later conv now reaches twice as far — the field grows much faster. Slide the pool count up and watch a stack of tiny \(3\times 3\) kernels come to cover the whole image. The RF recursion: \(r_\ell = r_{\ell-1} + (k-1)\,j_{\ell-1}\), with jump \(j_\ell = j_{\ell-1}\cdot s\). A subtle gotcha transfer learning shares with the receptive field: a backbone's effective receptive field is often smaller than its theoretical one (activations near the window's center dominate), and the input statistics it expects — resolution, normalization, channel order — must match what it was trained on. Feed a medical grayscale scan to an RGB-ImageNet backbone without reconciling these and the transfer quietly underperforms. Match the preprocessing first; it is the most common silent failure. NEXT Convolution shares weights across space; the next idea shares them across time. Chapter 03 turns to sequence models — RNNs, LSTMs, and the gating that lets a network carry information across hundreds of steps — the line of work that the Transformer would eventually overtake. 2.R References LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11) — LeNet-5; the conv → pool → dense template trained end-to-end by backprop. Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 25 — AlexNet; ReLU, dropout, and GPU training that ignited the deep-learning era. Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015 — VGG; depth via stacks of \(3\times 3\) convolutions. Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015 — GoogLeNet / Inception; multi-scale blocks and \(1\times 1\) bottlenecks. He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016 — ResNet (EQ N2.4); the skip connection that unlocked very deep networks. Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks?. NeurIPS 27 — the empirical basis for transfer learning; transferability falls with depth. Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021 — the Vision Transformer, the chief modern challenger to the convolutional prior. ← PREVIOUS 01 Foundations NEXT CHAPTER 03 Sequence Models AI // ENCYCLOPEDIA — DEEP LEARNING · CH 02 FULL CONTENTS ↗
## DL · Sequence Models (https://ai-encyclopedia.com/dl/03-sequence-models.html)
Sequence Models — RNN, LSTM & GRU — AI Encyclopedia AI // ENCYCLOPEDIA / DEEP LEARNING / 03 / SEQUENCE MODELS INDEX NEXT: SEQ2SEQ & ATTENTION → DEEP LEARNING · CHAPTER 03 / 07 Sequence Models — RNN, LSTM & GRU A feed-forward network takes one fixed-size input and retains nothing between examples. A recurrent network reads a sequence one step at a time and carries a hidden state forward, so the present can depend on the past. Recurrence lets a network carry memory across time, but vanishing gradients made long-range training fail until gating cells restored it. This chapter builds the vanilla RNN, shows why training over long sequences breaks down, then derives the LSTM and GRU cells that addressed it. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON DEEP LEARNING 01–02 INSTRUMENTS RNN UNROLL · LSTM GATES · GRADIENT DECAY IN THIS CHAPTER 3.1 Recurrent networks 3.2 Vanishing/exploding gradients 3.3 LSTM — gates & cell state 3.4 GRU — a lighter gate 3.5 Backprop through time 3.R References 3.1 Recurrent networks — weights shared over time Many of the things we want a model to read have no fixed length and no fixed structure except order: a sentence, an audio clip, a stock-price tape, a stream of sensor readings. A recurrent neural network (RNN) processes such a sequence \(x_1, x_2, \ldots, x_T\) one step at a time, maintaining a hidden state \(h_t\) — a running summary of everything seen so far — and updating it with the same weights at every step. EQ N3.1 — THE RECURRENT CELL $$ h_t = \tanh\!\big( W_{xh}\,x_t + W_{hh}\,h_{t-1} + b_h \big), \qquad \hat{y}_t = W_{hy}\,h_t + b_y $$ \(W_{hh}\) feeds the previous state back into the present — the loop that makes the network recurrent. Crucially, \(W_{xh}, W_{hh}, W_{hy}\) are shared across all \(T\) time steps: the cell that reads token 1 is the identical cell that reads token 1,000. That weight tying is what lets one fixed-size model consume sequences of any length, and what makes the gradient a long product (§3.2). \(h_0\) is usually the zero vector. The \(\tanh\) keeps each state bounded in \((-1,1)\). It is easier to reason about an RNN once you unroll it: copy the cell once per time step and lay the copies in a row, threading \(h_t\) from each copy into the next. The unrolled graph is just a very deep feed-forward network — depth \(T\) — whose layers happen to share parameters. Everything we know about training deep nets (Deep Learning 02) applies, including the failure mode that dominates §3.2. UNROLLED RNN — ONE SHARED CELL, COPIED PER STEP CELL · t=1 CELL · t=2 CELL · t=3 CELL · t=4 x₁ x₂ x₃ x₄ ŷ₁ ŷ₂ ŷ₃ ŷ₄ h₁→ h₂→ h₃→ The output head is flexible. A many-to-one RNN reads the whole sequence and emits a single prediction from \(h_T\) (sentiment of a review). A many-to-many RNN emits a label at every step (part-of-speech tags). A one-to-many RNN runs a single input forward into a generated sequence (image captioning). The recurrence is the same; only where you read the head changes. A scalar RNN has \(W_{xh}=2\), \(W_{hh}=0.5\), \(b_h=0\), and starts from \(h_0=0\). The first input is \(x_1=0.5\). What is \(h_1=\tanh(W_{xh}x_1 + W_{hh}h_0)\)? (Use \(\tanh(1)=0.7616\).) With \(h_0=0\) the recurrent term vanishes: \(W_{xh}x_1 + W_{hh}h_0 = 2(0.5) + 0.5(0) = 1\). So \(h_1=\tanh(1)=\) 0.7616. PYTHON · RUNNABLE IN-BROWSER # EQ N3.1: a vanilla RNN cell, forward pass over a toy sequence import numpy as np rng = np.random.default_rng(0) H, D, T = 4, 3, 6 # hidden size, input size, seq length Wxh = rng.normal(0, 0.5, (H, D)) # input -> hidden Whh = rng.normal(0, 0.5, (H, H)) # hidden -> hidden (the recurrent loop) bh = np.zeros(H) X = rng.normal(0, 1, (T, D)) # a length-6 input sequence h = np.zeros(H) # h_0 = 0 print("step ||h_t|| (running summary grows then stabilises)") for t in range(T): h = np.tanh(Wxh @ X[t] + Whh @ h + bh) # the recurrence, same weights each step print(f" t={t} {np.linalg.norm(h):.4f}") print("\nfinal hidden state h_T:", h.round(3)) print("every step reused the SAME Wxh, Whh -- that is what 'recurrent' means.") RUN ▶ edits are live — break it on purpose INSTRUMENT N3.1 — RNN UNROLL VISUALIZER SCALAR CELL · EQ N3.1 · LIVE INPUT WEIGHT W xh 1.00 RECURRENT WEIGHT W hh 0.60 SEQUENCE LENGTH 10 FINAL STATE h₁₀ — REGIME — The cell reads a fixed input pulse (1 at \(t=1\), then 0) so you can watch memory decay. Push \(W_{hh}\) toward 0 and the state forgets the pulse within a step or two; push it toward \(\pm1\) and the memory lingers across the whole sequence — but go past 1 and \(\tanh\) saturates and the state pins to its extreme. This single knob previews the stability problem of §3.2. 3.2 The vanishing / exploding gradient problem The promise of recurrence is long-range memory: a model that, after reading "I grew up in France … so I speak fluent ___", can reach back hundreds of tokens to fill the blank with French. In practice a vanilla RNN cannot. The reason is in the gradient. To train, we backpropagate the loss at step \(T\) through every earlier step. By the chain rule, the gradient of \(h_T\) with respect to a distant \(h_k\) is a product of Jacobians: EQ N3.2 — GRADIENT IS A LONG PRODUCT $$ \frac{\partial h_T}{\partial h_k} \;=\; \prod_{t=k+1}^{T} \frac{\partial h_t}{\partial h_{t-1}} \;=\; \prod_{t=k+1}^{T} \operatorname{diag}\!\big(\tanh'(a_t)\big)\, W_{hh}^{\top}, \qquad a_t = W_{xh}x_t + W_{hh}h_{t-1} + b_h $$ Each factor multiplies by \(W_{hh}\) (through its transpose) and by the diagonal of \(\tanh'\), which never exceeds 1 and is usually well below it. Multiplying \(T-k\) such factors makes the whole product behave like a power. If the factors are typically smaller than 1, the gradient vanishes geometrically with distance; if larger, it explodes. Either way the model cannot learn dependencies that span many steps. The controlling quantity is the largest singular value (spectral norm) of the recurrent Jacobian. Bound it loosely: \(\tanh'\le 1\), so each factor has norm at most \(\|W_{hh}\|\). A sufficient condition for vanishing is therefore \(\|W_{hh}\| < 1\); a necessary condition for exploding is \(\|W_{hh}\| > 1/\max\tanh'\). In the clean scalar case the whole product collapses to \((w\,\tanh'(a))^{\,T-k}\), which is exactly geometric in the distance \(T-k\). EQ N3.3 — SCALAR DECAY RATE $$ \left| \frac{\partial h_T}{\partial h_k} \right| \;\approx\; \big| w_{hh} \cdot \overline{\tanh'} \big|^{\,T-k} \;=\; \lambda^{\,T-k}, \qquad \lambda < 1 \Rightarrow \text{vanish}, \quad \lambda > 1 \Rightarrow \text{explode} $$ \(\lambda\) is the effective per-step gain. With \(\lambda=0.8\), the gradient from 50 steps away is scaled by \(0.8^{50}\approx 1.4\times10^{-5}\) — for all practical purposes zero. This is why a plain RNN's effective memory is only a handful of steps, no matter how long the sequence. Bengio, Simard & Frasconi proved in 1994 that this trade-off is fundamental: the same contraction that gives a vanilla RNN stable dynamics is what starves the long-range gradient. Exploding gradients have a cheap fix; vanishing ones do not. When \(\lambda > 1\) the gradient blows up to NaN, but you can simply clip its norm to a ceiling before the optimizer step — a standard, robust trick. Vanishing gradients are insidious because nothing crashes: training proceeds, the loss even falls, but the model is silently blind to anything more than a few steps back. No clipping can manufacture a signal that has decayed to numerical zero. The real cure is architectural — change the cell so the gradient has a path that does not get multiplied down. That is §3.3. An RNN has effective per-step gain \(\lambda = 0.9\) (from EQ N3.3). By what factor is the gradient scaled when it travels from step \(k\) to step \(T\) that are \(T-k = 44\) steps apart, i.e. \(0.9^{44}\)? \(0.9^{44} = e^{44\ln 0.9} = e^{44(-0.10536)} = e^{-4.636} \approx\) 0.0097 (≈ 0.01). A signal one percent of its original size is, for learning purposes, gone — even though only 44 steps separate the two positions. PYTHON · RUNNABLE IN-BROWSER # Vanishing vs exploding: measure backprop gradient norm vs distance import numpy as np def grad_norm(w_hh, length): # scalar RNN, fixed input pulse; product of Jacobian factors (EQ N3.2) h, a = 0.0, [] for t in range(length): x = 1.0 if t == 0 else 0.0 pre = w_hh * h + 1.0 * x # W_xh = 1 h = np.tanh(pre); a.append(pre) g = 1.0 # d h_T / d h_0 for t in range(length - 1, -1, -1): g *= (1 - np.tanh(a[t])**2) * w_hh # tanh'(a) * W_hh return abs(g) for w in (0.5, 0.9, 1.1): norms = [grad_norm(w, L) for L in range(1, 61)] tag = "VANISH" if norms[-1] 1e3 else "ok") print(f"W_hh={w}: grad@1={norms[0]:.3f} grad@60={norms[-1]:.3e} [{tag}]") print("\n|d h_T / d h_0| over distance for W_hh = 0.9 (geometric decay):") plot_xy(list(range(1, 61)), [grad_norm(0.9, L) for L in range(1, 61)]) RUN ▶ edits are live — break it on purpose INSTRUMENT N3.2 — VANISHING-GRADIENT DECAY |∂hₜ/∂h₀| VS DISTANCE · EQ N3.3 RECURRENT GAIN W hh 0.90 SEQUENCE LENGTH 60 EFFECTIVE GAIN λ — GRADIENT AT FULL DISTANCE — REGIME — The curve is the magnitude of the gradient flowing back from the final step to step 0, plotted on a log axis against distance. Below \(W_{hh}=1\) it plunges to the floor (white "VANISH" line at \(10^{-3}\)) within a few dozen steps; above 1 it climbs off the top (EXPLODE). Only a hairline near \(W_{hh}\approx 1\) keeps the gradient alive across the whole sequence — and a vanilla RNN cannot stay on that knife-edge while also fitting the data. That dilemma is the entire motivation for gates. 3.3 LSTM — gates & the cell state Hochreiter & Schmidhuber's 1997 Long Short-Term Memory attacks EQ N3.2 head-on. It adds a second, parallel memory track — the cell state \(c_t\) — whose update is (mostly) additive rather than a repeated matrix multiply. Information can ride that track across many steps almost untouched, so the gradient flowing back along it is multiplied by numbers near 1, not by a contracting Jacobian. Three learned gates — sigmoids in \((0,1)\) that act as soft, differentiable valves — decide what to keep, what to add, and what to read out. EQ N3.4 — THE THREE GATES & CANDIDATE $$ \begin{aligned} f_t &= \sigma\!\big(W_f[h_{t-1},x_t]+b_f\big) &\text{(forget)} \\ i_t &= \sigma\!\big(W_i[h_{t-1},x_t]+b_i\big) &\text{(input)} \\ o_t &= \sigma\!\big(W_o[h_{t-1},x_t]+b_o\big) &\text{(output)} \\ \tilde{c}_t &= \tanh\!\big(W_c[h_{t-1},x_t]+b_c\big) &\text{(candidate)} \end{aligned} $$ Each gate is a sigmoid, so its entries live in \((0,1)\): 0 means "block this channel completely", 1 means "let it through untouched". \([h_{t-1},x_t]\) is the previous state concatenated with the current input. \(\tilde c_t\) is the candidate new content (a \(\tanh\), so in \((-1,1)\)) that the input gate may write. An LSTM has exactly three gates — forget, input, output — plus this candidate, which is not itself a gate. EQ N3.5 — CELL & HIDDEN UPDATE (THE HIGHWAY) $$ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, \qquad h_t = o_t \odot \tanh(c_t) $$ \(\odot\) is element-wise product. The cell update is the heart of the design: the old memory \(c_{t-1}\) is scaled by the forget gate and the candidate is scaled by the input gate, then they are added. When \(f_t\approx 1\) and \(i_t\approx 0\), \(c_t\approx c_{t-1}\) — memory persists and \(\partial c_t/\partial c_{t-1}\approx 1\), so the gradient flows back with no geometric decay. This near-identity path is the "constant error carousel" that defeats EQ N3.2. The hidden state \(h_t\) is a gated, squashed view of the cell — what the rest of the network gets to see. Read the gates as operations on memory. The forget gate \(f_t\) erases: an entry near 0 zeroes that slot of \(c_{t-1}\). The input gate \(i_t\) writes: it decides how much of the fresh candidate \(\tilde c_t\) to commit. The output gate \(o_t\) reads: it exposes a filtered copy of the cell as the hidden state. A practical detail that matters in real training: the forget-gate bias \(b_f\) is usually initialized to \(+1\) or higher, so the network defaults to remembering and only learns to forget when the data demands it. Counting the sigmoid valves in EQ N3.4 — forget, input, and output — how many gates does a standard LSTM cell have? (The \(\tanh\) candidate \(\tilde c_t\) is content, not a gate.) The three sigmoid gates are the forget gate \(f_t\), the input gate \(i_t\), and the output gate \(o_t\). The candidate \(\tilde c_t\) is a \(\tanh\), not a gate. So an LSTM has 3 gates. True or false: in EQ N3.5 the previous cell state enters as \(f_t \odot c_{t-1}\), so a forget gate whose entries are near 0 erases (nearly zeroes) the cell's stored memory. (Answer true or false.) Multiplying \(c_{t-1}\) element-wise by a gate near 0 drives those entries toward 0, discarding the corresponding memory before the new candidate is added. The statement is true — that is precisely the forget gate's job, and why a stuck-closed forget gate is a known cause of memory loss. PYTHON · RUNNABLE IN-BROWSER # EQ N3.4-N3.5: one LSTM cell forward step; print gates and cell state import numpy as np rng = np.random.default_rng(1) H, D = 3, 2 def sig(z): return 1 / (1 + np.exp(-z)) # stacked weights for [forget, input, output, candidate]; bias_f starts at +1 Wx = rng.normal(0, 0.6, (4 * H, D)) Wh = rng.normal(0, 0.6, (4 * H, H)) b = np.zeros(4 * H); b[:H] = 1.0 # forget-gate bias = +1 (default remember) x = np.array([1.0, -0.5]) # one input vector h = np.zeros(H); c = np.array([0.4, -0.2, 0.9]) # carried-in state z = Wx @ x + Wh @ h + b f, i, o = sig(z[:H]), sig(z[H:2*H]), sig(z[2*H:3*H]) g = np.tanh(z[3*H:]) # candidate c~ c_new = f * c + i * g # the additive highway h_new = o * np.tanh(c_new) np.set_printoptions(precision=3, suppress=True) print("forget gate f:", f) print("input gate i:", i) print("output gate o:", o) print("candidate g~:", g) print("old cell c:", c) print("new cell c' = f*c + i*g~:", c_new) print("hidden h' = o*tanh(c'):", h_new) RUN ▶ edits are live — break it on purpose INSTRUMENT N3.3 — LSTM GATE EXPLORER ONE SCALAR CELL · EQ N3.5 · LIVE FORGET GATE f 0.90 INPUT GATE i 0.50 OUTPUT GATE o 0.80 CELL HALF-LIFE (STEPS) — FINAL CELL c — FINAL HIDDEN h — A single value is written into the cell at \(t=1\) (with gate \(i\)), then the cell runs free under the forget gate \(f\) while the output gate \(o\) controls what leaks into \(h\). Set \(f=1,\ i=0\): the memory is held flat forever — the constant error carousel, half-life \(\infty\). Drop \(f\) to 0.5 and the memory halves every step. Close \(o\) and the cell still remembers internally while \(h\) shows nothing — memory and exposure are separate, which a vanilla RNN cannot do. A common worry: doesn't the additive cell state grow without bound? It can — which is why \(h_t=o_t\odot\tanh(c_t)\) squashes the readout, and why a well-trained forget gate occasionally dips below 1 to bleed off stale magnitude. Modern variants add a forget on the candidate too, and peephole connections let the gates see \(c_{t-1}\) directly; both are refinements, not changes to the core highway. 3.4 GRU — a lighter gate The LSTM works, but it carries two state vectors and four weight matrices per cell. Cho et al. (2014) asked how much of that machinery is essential and arrived at the Gated Recurrent Unit: a single state vector, two gates, and a clever trick that ties "forget" and "input" into one decision. EQ N3.6 — THE GRU CELL $$ \begin{aligned} z_t &= \sigma\!\big(W_z[h_{t-1},x_t]\big) &\text{(update gate)} \\ r_t &= \sigma\!\big(W_r[h_{t-1},x_t]\big) &\text{(reset gate)} \\ \tilde{h}_t &= \tanh\!\big(W_h[\,r_t\odot h_{t-1},\,x_t\,]\big) &\text{(candidate)} \\ h_t &= (1-z_t)\odot h_{t-1} + z_t \odot \tilde{h}_t &\text{(blend)} \end{aligned} $$ The update gate \(z_t\) interpolates between keeping the old state and overwriting it with the candidate — one knob does the work of the LSTM's separate forget and input gates, which is why their weights sum to 1 by construction \((1-z_t)+z_t\). The reset gate \(r_t\) decides how much past state feeds the candidate, letting the cell drop irrelevant history when composing new content. There is no separate cell state and no output gate: \(h_t\) is the memory. When \(z_t\approx 0\), \(h_t\approx h_{t-1}\) — the same near-identity skip that protects the gradient. Property Vanilla RNN LSTM GRU Gates 0 3 (f, i, o) 2 (z, r) State vectors 1 (h) 2 (h, c) 1 (h) Params / cell ~1× ~4× ~3× Long-range gradient vanishes protected protected Separate read-out gate — yes (o) no Which to use? On many tasks GRU and LSTM are statistically indistinguishable, and GRU's smaller parameter count trains a little faster and needs less data — so it is often the better first choice. LSTM tends to edge ahead when very long-range memory or precise readout control matters, partly because the output gate and dedicated cell state give it one more degree of freedom. The honest answer, repeated across the literature since the 2014–2017 comparisons: there is no universal winner; the gap is task-dependent and usually small. What both share — and what actually mattered — is the additive, gated state path that keeps the gradient alive. Historical footnote, important for honesty: this entire family has been largely displaced for language by the Transformer (Deep Learning 04 onward), whose attention removes recurrence and parallelizes across the sequence. RNNs persist where streaming or strict left-to-right causality with small state is an advantage — on-device speech, low-latency control, some time-series — and the gating idea itself resurfaced in 2023–2025 in linear-recurrent and state-space models (S4, Mamba) that reclaim \(O(T)\) inference while approaching Transformer quality. 3.5 Backpropagation through time How is any of this trained? By unrolling the network into its depth-\(T\) feed-forward equivalent (§3.1) and running ordinary backpropagation through it — a procedure named backpropagation through time (BPTT). Because the weights are shared across steps, the gradient with respect to a weight is the sum of its contributions at every step: EQ N3.7 — THE BPTT GRADIENT $$ \frac{\partial \mathcal{L}}{\partial W} \;=\; \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial W}, \qquad \frac{\partial \mathcal{L}_T}{\partial W} \;=\; \sum_{k=1}^{T} \frac{\partial \mathcal{L}_T}{\partial h_T}\, \frac{\partial h_T}{\partial h_k}\, \frac{\partial h_k}{\partial W} $$ The outer sum collects the loss from every output step; the inner sum routes each loss back through every earlier state via the Jacobian product \(\partial h_T/\partial h_k\) — the same product as EQ N3.2. So BPTT is exactly where vanishing and exploding gradients are born: the cure of §3.3 (an additive cell path) makes the \(k\)-distant term survive instead of decaying to zero. Truncated BPTT is the practical version. Backpropagating through a 100,000-token sequence would cost prohibitive memory (every intermediate state must be stored for the backward pass) and re-incur the gradient pathologies. So we cut the sequence into chunks of length \(k\) (say 64–256), backpropagate only within each chunk, but carry the hidden state forward between chunks so the forward pass still sees unlimited context. The model can therefore remember arbitrarily far while only being trained on gradients that span \(k\) steps — a deliberate bias toward shorter-range credit assignment that keeps training tractable. KEY Three ideas, one thread. (1) Recurrence shares weights over time, turning a sequence into a deep net. (2) That depth makes the gradient a long product, which vanishes or explodes (EQ N3.2–N3.3). (3) Gates with an additive state path (LSTM/GRU) give the gradient a near-identity highway, and truncated BPTT makes training that highway affordable. Everything else in sequence modeling is variation on this. NEXT Gates let a single state carry the past; the next leap lets every step look back at every other step directly. Chapter 04 builds the encoder–decoder (seq2seq) framework and the attention mechanism that frees a model from squeezing a whole sequence through one fixed-size bottleneck — the idea that, taken to its limit, becomes the Transformer. 3.R References Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8) — introduces the LSTM cell, the constant error carousel, and the gating scheme of EQ N3.4–N3.5. Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 — introduces the GRU (EQ N3.6) and the encoder–decoder framing carried into Chapter 04. Bengio, Y., Simard, P. & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5(2) — the formal analysis of vanishing/exploding gradients behind EQ N3.2–N3.3. Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. ICML 2013 — the spectral-norm view of the gradient product and the gradient-clipping remedy for explosion (§3.2). Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10) — systematic ablation of LSTM components, including the value of the forget gate and forget-bias initialization (§3.3). Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning Workshop — the LSTM-vs-GRU comparison underpinning the "no universal winner" claim (§3.4). Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2312.00752 — the modern selective state-space model reviving gated linear recurrence at scale (§3.4 footnote). ← PREVIOUS 02 CNNs NEXT CHAPTER 04 Seq2Seq & Attention AI // ENCYCLOPEDIA — DEEP LEARNING · CH 03 FULL CONTENTS ↗
## DL · Seq2Seq & the Birth of Attention (https://ai-encyclopedia.com/dl/04-seq2seq-attention.html)
Seq2Seq & the Birth of Attention — AI Encyclopedia AI // ENCYCLOPEDIA / DEEP LEARNING / 04 / SEQ2SEQ INDEX NEXT: AUTOENCODERS → DEEP LEARNING · CHAPTER 04 / 07 Seq2Seq & the Birth of Attention An encoder reads a sentence and a decoder writes its translation. The 2014 design made both recurrent networks and passed a single state vector between them. Compressing a whole sentence into one fixed vector was the bottleneck, and letting the decoder look back at every input word, weighting them on demand, removed it. That mechanism is attention, and it leads directly to the Transformer. LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON DEEP LEARNING 03 INSTRUMENTS HEATMAP · BOTTLENECK · ALIGNMENT IN THIS CHAPTER 4.1 The encoder-decoder framework 4.2 The fixed-vector bottleneck 4.3 Bahdanau (additive) attention 4.4 Luong (multiplicative) attention 4.5 The bridge to the Transformer 4.R References 4.1 The encoder-decoder framework Machine translation poses a hard problem for a plain recurrent net: the input and output are both sequences, but of different lengths, in different languages, with no word-by-word alignment. Sutskever, Vinyals and Le (2014) cut the knot with a deceptively simple architecture, now called sequence-to-sequence (seq2seq): one RNN to read, a second RNN to write. The encoder consumes the source tokens \(x_1, \ldots, x_{T_x}\) one at a time, updating a hidden state. Its final hidden state \(h_{T_x}\) is taken as a summary of the whole sentence — the context vector \(c\). The decoder is a language model conditioned on \(c\): it starts from \(c\), emits a token, feeds that token back in, and repeats until it produces an end-of-sequence symbol. EQ N4.1 — THE SEQ2SEQ OBJECTIVE $$ c = h_{T_x}, \qquad p(y_1, \ldots, y_{T_y} \mid x) \;=\; \prod_{i=1}^{T_y} p\!\left( y_i \,\middle|\, y_{ the bottleneck. import numpy as np rng = np.random.default_rng(0) d = 6 # hidden width def encode(x_embeds): # a stand-in RNN: c = tanh(W h + U x) Wh = rng.normal(0, 0.4, (d, d)); Ux = rng.normal(0, 0.4, (d, d)) h = np.zeros(d) for x in x_embeds: # read left to right, keep ONLY the last state h = np.tanh(Wh @ h + Ux @ x) return h # c = h_{T_x} for T in (3, 9, 27): # short, medium, long source sentences src = rng.normal(0, 1, (T, d)) # T token embeddings c = encode(src) print(f"source length {T:2d} tokens -> context vector c has width {c.size} " f"(norm {np.linalg.norm(c):.2f})") print("\nThe vector NEVER grows. 27 words must fit in the same 6 numbers as 3.") RUN ▶ edits are live — break it on purpose 4.2 The fixed-vector bottleneck The architecture's elegance is also its flaw. Every nuance of a 40-word source sentence — who did what to whom, every clause, every named entity — must be squeezed into one fixed-dimensional vector \(c\) and held there, unchanged, while the decoder unspools a translation that may itself be 40 words long. The encoder's last state is a lossy, length-blind summary. The symptom is unmistakable: seq2seq BLEU is fine on short sentences and falls off a cliff as length grows. Cho et al. (2014) documented the decay directly; the longer the input, the more the single vector saturates and the earlier source words it must remember fade. This is an information-theoretic ceiling, not a tuning problem — you cannot store an arbitrarily long sentence in a constant number of bits without loss. INTUITION Imagine reading a paragraph, then writing its translation from memory without looking back at the page. That is the fixed-vector decoder. Attention is being allowed to glance back at the source — at whichever word you need, exactly when you need it. INSTRUMENT N4.1 — BOTTLENECK vs ATTENTION TRANSLATION QUALITY vs SOURCE LENGTH CONTEXT WIDTH d 512 DECODER READS FIXED c ATTENTION QUALITY @ 10 WORDS — QUALITY @ 50 WORDS — REGIME — A stylized model of the empirical curve from Bahdanau et al. (Fig. 2). With a fixed context vector, quality decays past the length the width can hold — widen d and the cliff moves right but never disappears. Switch to attention and the curve goes flat: the decoder reads the source afresh at every step, so length stops mattering. 4.3 Bahdanau (additive) attention Bahdanau, Cho and Bengio (2014) made the decisive move. Keep all the encoder hidden states — one per source word, \(h_1, \ldots, h_{T_x}\), now called annotations (and produced by a bidirectional RNN so each \(h_j\) summarizes the whole sentence centered on word \(j\)). At every decoding step \(i\), build a different context vector \(c_i\) by taking a weighted average of those annotations — with weights the decoder chooses on the fly. The weights come from an alignment model: a tiny feedforward net that scores how well decoder state \(s_{i-1}\) matches each annotation \(h_j\). Because the score is computed with a sum inside a \(\tanh\), this is called additive attention. EQ N4.2 — ADDITIVE ALIGNMENT SCORE $$ e_{ij} \;=\; v_a^{\top} \tanh\!\left( W_a\, s_{i-1} + U_a\, h_j \right) $$ \(s_{i-1}\) is the decoder's previous state (the "query"); \(h_j\) is the \(j\)-th source annotation (a "key"). \(W_a, U_a\) project both into a shared space; \(\tanh\) mixes them; \(v_a\) collapses the result to one scalar relevance score. Crucially, \(W_a, U_a, v_a\) are learned jointly with the whole translator — alignment is never supervised, it emerges from the translation loss. This is the original attention mechanism, three years before "Attention Is All You Need." Softmax over the source positions turns scores into a probability distribution — the attention weights \(\alpha_{ij}\) — and the context vector is their weighted sum of annotations: EQ N4.3 — WEIGHTS & CONTEXT VECTOR $$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \qquad c_i = \sum_{j=1}^{T_x} \alpha_{ij}\, h_j, \qquad \sum_{j=1}^{T_x} \alpha_{ij} = 1 $$ For each output position \(i\), the weights \(\alpha_{i\cdot}\) form a convex combination over the source — they sum to exactly 1, so \(c_i\) is a soft, differentiable lookup into the encoder. The context now varies per output step: translating the verb pulls weight onto the source verb; translating the object pulls weight onto the object. There is no longer one frozen \(c\). The fixed-length bottleneck is gone — the decoder's "memory" is the whole source, re-addressed every step. WORKED EXAMPLE ▾ 01 Suppose for output step \(i\) the alignment net produces raw scores over four source words: \(e_{i\cdot} = (1.0,\ 0.0,\ 0.5,\ -0.5)\). 02 Exponentiate: \(e^{1.0}=2.718\), \(e^{0.0}=1.000\), \(e^{0.5}=1.649\), \(e^{-0.5}=0.607\). Sum \(= 5.974\). 03 Softmax (EQ N4.3): \(\alpha_{i\cdot} = (0.455,\ 0.167,\ 0.276,\ 0.102)\). They sum to \(1.000\) — a valid distribution over the source. 04 Context \(c_i = 0.455\,h_1 + 0.167\,h_2 + 0.276\,h_3 + 0.102\,h_4\): mostly the first source word, but a genuine blend — never a hard pick. That softness is what makes the whole thing differentiable. RESULT: attention weights = (0.455, 0.167, 0.276, 0.102), sum = 1 A decoder attends over \(T_x = 4\) encoder annotations with weights \( \alpha_{i1}, \alpha_{i2}, \alpha_{i3}, \alpha_{i4} \) produced by softmax (EQ N4.3). What does \( \sum_{j=1}^{4} \alpha_{ij} \) equal? Softmax normalizes its outputs by their own sum, so they always form a probability distribution: \( \sum_{j} \alpha_{ij} = \) 1. This is why \(c_i\) is a convex combination — a true weighted average — of the annotations. PYTHON · RUNNABLE IN-BROWSER # EQ N4.2: additive attention scores from scratch, then softmax to weights. import numpy as np rng = np.random.default_rng(1) d, a = 5, 4 # hidden width d, alignment width a Tx = 4 # four source words H = rng.normal(0, 1, (Tx, d)) # encoder annotations h_1..h_Tx s_prev = rng.normal(0, 1, d) # decoder state s_{i-1} Wa = rng.normal(0, 0.5, (a, d)) # project the query Ua = rng.normal(0, 0.5, (a, d)) # project each key va = rng.normal(0, 0.5, a) # collapse to a scalar e = np.array([va @ np.tanh(Wa @ s_prev + Ua @ h) for h in H]) # EQ N4.2 alpha = np.exp(e - e.max()); alpha /= alpha.sum() # softmax, EQ N4.3 np.set_printoptions(precision=3, suppress=True) print("raw alignment scores e_ij:", e) print("attention weights alpha:", alpha) print("weights sum to:", round(float(alpha.sum()), 6), " RUN ▶ edits are live — break it on purpose INSTRUMENT N4.2 — ATTENTION-WEIGHT HEATMAP EN → FR · ROWS = OUTPUT · COLS = SOURCE · EQ N4.3 SOFTMAX TEMPERATURE 1.00 OUTPUT TOKEN (HOVER ROW) the ALIGNED SOURCE WORD the PEAK WEIGHT — A toy alignment for "the agreement on the economic area" → "l'accord sur la zone économique". Each row is one output word's distribution over the source (each row sums to 1). The bright near-diagonal band is monotonic translation; the off-diagonal cells are real reordering — French zone économique flips the adjective order of English economic area, exactly the case where a fixed vector fails. Hover a row; drop the temperature to sharpen each lookup toward a hard pick, raise it to blur toward a uniform average. 4.4 Luong (multiplicative) attention A year later, Luong, Pham and Manning (2015) simplified and systematized the idea. Their headline observation: the \(\tanh\) feedforward scorer is more machinery than you need. If query and key live in the same space, a plain dot product already measures their alignment — and a dot product is a single, GPU-friendly matrix multiply rather than a small MLP. Hence multiplicative (a.k.a. dot-product) attention. EQ N4.4 — LUONG SCORING FUNCTIONS $$ \mathrm{score}(s_i, h_j) = \begin{cases} s_i^{\top} h_j & \textbf{dot} \\[4pt] s_i^{\top} W_a\, h_j & \textbf{general} \\[4pt] v_a^{\top}\tanh\!\left(W_a [\,s_i;\,h_j\,]\right) & \textbf{concat} \end{cases} $$ Three variants, increasing in flexibility. dot assumes encoder and decoder share a space — zero new parameters. general inserts one learned matrix \(W_a\) to bridge mismatched spaces — the usual default. concat (≈ Bahdanau) recovers the additive form. Luong also used the current decoder state \(s_i\) (not \(s_{i-1}\) as in Bahdanau), and reframed it as "global vs local" attention — local restricting the window to a few source positions for very long inputs. Two architectures, one essential idea. The differences are practical: additive attention is marginally more robust when query and key dimensions differ; multiplicative attention is faster and more memory-efficient, and at large dimension it needs the now-famous \(1/\sqrt{d_k}\) rescaling to keep softmax out of saturation. That scaled dot product is exactly the score function the Transformer would adopt — Luong's general form, with the projections renamed \(W_Q\) and \(W_K\), is scaled dot-product attention. Property Bahdanau (2014) Luong (2015) Score additive (tanh MLP) dot / general / concat Decoder state used \(s_{i-1}\) (previous) \(s_i\) (current) Encoder bidirectional RNN top LSTM layer Cost / extra params MLP per pair one matmul (dot: none) Descendant — scaled dot-product attn True or false: attention removes the fixed-length context bottleneck of plain seq2seq, because the decoder rebuilds a fresh context vector \(c_i\) from all encoder states at every output step. (Answer true or false.) The whole point of EQ N4.3 is that \(c_i = \sum_j \alpha_{ij} h_j\) is recomputed for each \(i\) over the entire source. Nothing is forced through a single constant-width vector, so the length-blind bottleneck disappears. The answer is true. PYTHON · RUNNABLE IN-BROWSER # EQ N4.3/N4.4: the context vector as the attention-weighted sum of encoder states, # scored with Luong dot-product attention. Verify it is a convex combination. import numpy as np rng = np.random.default_rng(2) d, Tx = 5, 4 H = rng.normal(0, 1, (Tx, d)) # encoder states (rows = source words) s_i = rng.normal(0, 1, d) # current decoder state scores = H @ s_i # EQ N4.4 "dot": one matmul, no params alpha = np.exp(scores - scores.max()); alpha /= alpha.sum() # softmax c_i = alpha @ H # EQ N4.3: weighted sum of states np.set_printoptions(precision=3, suppress=True) print("attention weights alpha:", alpha, " (sum", round(float(alpha.sum()),3), ")") print("context vector c_i:", c_i) # A convex combo must lie inside the per-dim min/max of the states it blends: lo, hi = H.min(0), H.max(0) print("c_i within state hull?:", bool(np.all(c_i >= lo - 1e-9) and np.all(c_i RUN ▶ edits are live — break it on purpose INSTRUMENT N4.3 — ALIGNMENT VISUALIZER SOFT WORD-TO-WORD LINKS · DOT-PRODUCT SCORE OUTPUT STEP i 1 / 5 SHARPNESS (1/τ) 1.0× EMITTING l'accord STRONGEST LINK agreement ENTROPY (bits) — Step through the output one token at a time and watch the soft links re-aim at the source words that matter. The line opacity is \(\alpha_{ij}\); raising sharpness collapses the fan toward a single hard link (low entropy ≈ a dictionary lookup), lowering it spreads attention across the sentence (high entropy ≈ averaging). Notice the crossing lines at zone économique: attention reorders without being told the alignment. 4.5 The bridge to the Transformer By 2016 attention was bolted onto every competitive RNN translator. But it still rode on top of recurrence: the encoder and decoder remained sequential RNNs, and that sequentiality — each step waiting on the last — capped how much you could parallelize on a GPU and how far gradients reached across long sentences. Vaswani et al. (2017) asked the obvious next question: if attention is doing the real work of moving information, do we need the RNN at all? "Attention Is All You Need" answered no. Three moves complete the bridge from this chapter: Self-attention. Bahdanau and Luong attention is cross -attention — the decoder attending to the encoder. Point the same mechanism at a sequence's own positions and you get self-attention, which replaces recurrence entirely. Every token can mix with every other in one parallel step. Scaled dot-product, multi-head. Luong's dot/general score, divided by \(\sqrt{d_k}\) (EQ N4.4 plus the variance fix), becomes the core operation; running \(h\) of them in parallel subspaces gives multi-head attention. The query/key/value vocabulary is just the alignment-model query and the annotation keys/values, renamed and made symmetric. Positional encodings. Drop recurrence and the model loses all sense of order, so position is injected directly into the embeddings — the one piece the RNN used to supply for free. EQ N4.5 — FROM ALIGNMENT SCORE TO SCALED DOT-PRODUCT $$ \underbrace{e_{ij} = v_a^{\top}\tanh(W_a s_i + U_a h_j)}_{\text{Bahdanau, EQ N4.2}} \;\longrightarrow\; \underbrace{e_{ij} = \frac{(W_Q s_i)^{\top}(W_K h_j)}{\sqrt{d_k}}}_{\text{Transformer (Vol II · EQ 3.1)}} $$ Same skeleton — score every key against the query, softmax, take a weighted sum of values — with the learned MLP scorer swapped for a cheap scaled dot product and the recurrent backbone deleted. Everything that followed (BERT, GPT, and modern LLMs) is this idea scaled up. The 2014 bottleneck and the 2017 Transformer are two ends of one short, straight line; the full mechanism, multi-head, KV cache and all, is the subject of Vol II · Chapter 03. NEXT Attention gave the decoder a memory; the next chapter asks what a network learns when it has no labels at all. Chapter 05: autoencoders — the encoder-decoder shape turned inward to compress, denoise, and discover latent structure, and the variational twist that makes those latents generate. 4.R References Sutskever, I., Vinyals, O. & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. NeurIPS 2014 — the encoder-decoder LSTM framework (EQ N4.1) and the source-reversal trick. Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 — additive attention (EQ N4.2/N4.3); the birth of the mechanism and the length-decay figure. Luong, M.-T., Pham, H. & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015 — multiplicative (dot/general/concat) and global-vs-local attention (EQ N4.4). Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014 — the RNN encoder-decoder and GRU; documents the fixed-vector length decay (§4.2). Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 — drops recurrence for pure self-attention; the destination of EQ N4.5. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 — soft/hard attention beyond translation; shows the mechanism generalizes (§4.1). ← PREVIOUS 03 Sequence Models NEXT CHAPTER 05 Autoencoders AI // ENCYCLOPEDIA — DEEP LEARNING · CH 04 FULL CONTENTS ↗
## DL · Autoencoders & VAEs (https://ai-encyclopedia.com/dl/05-autoencoders.html)
Autoencoders & VAEs — AI Encyclopedia AI // ENCYCLOPEDIA / DEEP LEARNING / 05 / AUTOENCODERS INDEX NEXT: 06 GANs → DEEP LEARNING · CHAPTER 05 / 07 Autoencoders & VAEs Force a network to reconstruct its input through a narrow bottleneck and it learns the data's hidden coordinates, the few axes along which the data actually varies. The variational form replaces that single code with a probability distribution and regularizes it toward a known prior, turning the bottleneck into something you can sample from: a generative model. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON DEEP LEARNING 01–04 INSTRUMENTS BOTTLENECK · VAE SAMPLER · DENOISER IN THIS CHAPTER 5.1 Learning to compress 5.2 Denoising & overcomplete 5.3 The latent space 5.4 Variational autoencoders 5.5 Representation & uses 5.R References 5.1 Autoencoders — learning to compress An autoencoder is a network trained to copy its input to its output — a task that sounds trivial until you choke the path between them. An encoder \(f_\theta\) maps the input \(x \in \mathbb{R}^d\) down to a code \(z \in \mathbb{R}^k\) with \(k \ll d\); a decoder \(g_\phi\) maps the code back up to a reconstruction \(\hat{x}\). Both are trained jointly to make \(\hat{x}\) look like \(x\): EQ N5.1 — RECONSTRUCTION OBJECTIVE $$ z = f_\theta(x), \qquad \hat{x} = g_\phi(z), \qquad \mathcal{L}(\theta,\phi) = \frac{1}{N}\sum_{i=1}^{N} \big\lVert x^{(i)} - g_\phi\!\big(f_\theta(x^{(i)})\big) \big\rVert_2^2 $$ Mean squared error is the default for continuous inputs (pixels, embeddings); binary cross-entropy per pixel is standard for \([0,1]\) images. The whole trick is the bottleneck: the code \(z\) is narrower than \(x\), so a perfect copy is impossible and the network must spend its few code dimensions on whatever explains the most variance. Nothing in the loss says "find structure" — structure is the only way to win at copying through a constriction. The label is the input itself, so autoencoders are self-supervised: they need no annotations, just data. What they learn is a coordinate system — a chart of the low-dimensional manifold that the high-dimensional data lives near. A 28×28 image has 784 pixels, but the set of handwritten digits occupies a far thinner sheet inside that 784-dimensional cube; the bottleneck is the network's estimate of how thin. The cleanest case is fully linear. Let the encoder be a matrix \(W \in \mathbb{R}^{k\times d}\) and the decoder \(W^\top\), with mean-centered data and squared-error loss. The optimum is not unique, but the subspace it spans is: it is exactly the span of the top \(k\) principal components of the data (Baldi & Hornik, 1989). A linear autoencoder rediscovers PCA from scratch — gradient descent on reconstruction error walks straight to the eigenvectors of the covariance matrix. WHY IT MATTERS The linear case is the Rosetta stone. It tells you an autoencoder's job is dimensionality reduction, and that the bottleneck width \(k\) is choosing how many directions of variance to keep. Nonlinear encoders simply bend PCA's flat hyperplane into a curved manifold — same goal, more expressive chart. See Vol I · EQ 4.x for PCA via the SVD. A single-hidden-layer linear autoencoder with code width \(k\), trained to minimize squared reconstruction error on mean-centered data, recovers the same subspace as the top \(k\) principal components. True or false? (Enter true or false.) With linear \(f,g\) and MSE loss, the global optimum projects onto the span of the top-\(k\) eigenvectors of the data covariance — exactly PCA's subspace. Individual weights differ (any invertible mixing of the \(k\) directions reconstructs equally well), but the subspace is identical. Answer: true. PYTHON · RUNNABLE IN-BROWSER # Linear autoencoder == PCA: train by gradient descent, compare to eigenvectors import numpy as np rng = np.random.default_rng(0) d, k, N = 8, 2, 600 # data on a 2D plane (k=2) embedded in 8D, plus small noise -> intrinsic rank 2 basis = np.linalg.qr(rng.normal(size=(d, k)))[0] # true 2D subspace X = rng.normal(size=(N, k)) @ basis.T + 0.02 * rng.normal(size=(N, d)) X -= X.mean(0) # center (PCA assumes this) # PCA: top-k right-singular vectors = the "answer" subspace _, _, Vt = np.linalg.svd(X, full_matrices=False) P = Vt[:k] # k x d principal axes # Linear AE: encoder We (d->k), decoder Wd (k->d). Minimize ||X - (X We) Wd||^2 We = 0.1 * rng.normal(size=(d, k)) Wd = 0.1 * rng.normal(size=(k, d)) for step in range(6000): Z = X @ We # codes: N x k R = Z @ Wd - X # residual: N x d We -= 0.08 * (X.T @ (R @ Wd.T) / N) # dL/dWe Wd -= 0.08 * (Z.T @ R / N) # dL/dWd err = np.linalg.norm(X - (X @ We) @ Wd) / np.linalg.norm(X) Wq = np.linalg.qr(We)[0] # orthonormal basis of code space overlap = np.linalg.svd(P @ Wq, compute_uv=False) # cos(principal angles); 1 == aligned print(f"AE relative reconstruction error: {err:.4f}") print(f"cos(principal angles) AE vs PCA: {np.round(overlap, 4)}") print("error tiny, cosines ~1 -> the AE found the PCA plane, just rotated in it") RUN ▶ edits are live — break it on purpose INSTRUMENT N5.1 — BOTTLENECK EXPLORER LATENT WIDTH k vs RECONSTRUCTION · PCA SURROGATE LATENT WIDTH k 8 INPUT DIM d 64 VARIANCE KEPT — COMPRESSION d / k — A synthetic 64-dim dataset whose variance decays across components (the usual heavy-headed spectrum). The bars show how much of each component a width-\(k\) code can keep; the mint curve is cumulative variance retained — the best possible reconstruction at that bottleneck. Slide \(k\) low to feel the squeeze: the first handful of axes carry most of the signal, and every dimension past the manifold's intrinsic rank buys almost nothing. 5.2 Denoising & overcomplete variants A bottleneck is one way to stop an autoencoder from learning the useless identity map. It is not the only way, and not always the best. If you let \(k \ge d\) — an overcomplete code — a vanilla autoencoder can cheat by copying the input straight through, learning nothing. Three families of regularizer break that shortcut while keeping a wide, expressive code. Denoising autoencoders (DAE) corrupt the input, then demand a clean reconstruction. The network sees \(\tilde{x} = x + \varepsilon\) (added Gaussian noise, or random pixel masking) and must produce the original \(x\): EQ N5.2 — DENOISING OBJECTIVE $$ \tilde{x} \sim q(\tilde{x}\mid x), \qquad \mathcal{L}_{\text{DAE}} = \mathbb{E}_{x}\,\mathbb{E}_{\tilde{x}\sim q(\cdot\mid x)} \big\lVert x - g_\phi\!\big(f_\theta(\tilde{x})\big) \big\rVert_2^2 $$ Copying is now impossible — the noisy input is not the target. To undo corruption the network must learn the shape of the data: it pushes corrupted points back onto the clean manifold. Vincent et al. (2008) showed the denoiser implicitly learns the score — the gradient of the log-density, \(\nabla_x \log p(x)\) — pointing toward where real data lives. That same insight is the seed of modern diffusion models (Chapter 07): a diffusion model is, in essence, a denoising autoencoder trained at every noise level at once. Two cousins regularize differently. Sparse autoencoders allow a wide code but penalize how many units fire at once — an \(L_1\) penalty or a KL term that pins each unit's average activation to a small target. The code stays overcomplete, but any single input lights up only a few dimensions, so each one specializes into an interpretable feature. (This is exactly the mechanism behind today's sparse-autoencoder interpretability work, which decomposes an LLM's dense activations into thousands of monosemantic features.) Contractive autoencoders add \(\lVert J_f(x)\rVert_F^2\), the squared Frobenius norm of the encoder's Jacobian, forcing the code to be insensitive to small input perturbations — flat along directions that don't matter, responsive only along the manifold. Variant What stops the identity map What the code becomes Undercomplete narrow bottleneck \(k < d\) Top directions of variance (PCA-like). Denoising corrupt input, clean target A projection back onto the data manifold; learns the score. Sparse \(L_1\) / KL activation penalty Overcomplete but few-active; specialized, often interpretable features. Contractive Jacobian-norm penalty Locally invariant code, flat off the manifold. All four share one moral: an autoencoder is only as good as the pressure you put on its code. Remove every constraint and it learns the identity; impose the right one and it learns the data's geometry. INSTRUMENT N5.2 — DENOISING AUTOENCODER CORRUPT → ENCODE → RECONSTRUCT · 1D SIGNAL NOISE σ 0.30 CODE WIDTH k 4 CORRUPTED MSE — DENOISED MSE — NOISE REMOVED — The clean signal is a smooth manifold spanned by a few low-frequency basis functions (the "data"). We add noise σ, then project the corrupted signal onto the top-\(k\) basis — the linear denoiser an autoencoder converges to. Watch the reconstruction snap back toward the clean curve: a \(k\)-dimensional code can't represent the high-frequency noise, so the noise is discarded. Raise σ and the denoised MSE stays far below the corrupted MSE — that gap is the autoencoder doing its job. 5.3 The latent space The code \(z\) is not just a compressed file — it is a place. The set of all codes the encoder can produce is the latent space, and its geometry is where autoencoders earn their keep. Distances in latent space correspond to perceptual or semantic distances in data space far better than raw pixel distance does: two photos of the same face under different lighting are far apart in pixels but close in a good latent. This is what makes the latent space useful for more than compression. You can interpolate: decode \(g_\phi\big((1-t)\,z_a + t\,z_b\big)\) and sweep \(t\) from 0 to 1 to morph smoothly from one example to another. You can cluster in latent space, where classes separate cleanly. You can do nearest-neighbour retrieval on codes instead of inputs. And you can detect anomalies: a point the autoencoder reconstructs poorly is, by construction, off the manifold it learned — high reconstruction error is an unsupervised novelty score. THE CATCH A plain autoencoder's latent space has holes. Training only constrains the codes the encoder actually emits; the space between and around them is unconstrained. Decode a random point — or a midpoint between two clusters — and you often get garbage, because the decoder was never asked to make that region meaningful. The latent is a scatter of trained islands in an empty sea. You cannot reliably sample new data from it. Fixing this hole is the entire motivation for the variational autoencoder. PLAIN AE — ISLANDS & HOLES ? sample here = garbage VAE — FILLED GAUSSIAN CLOUD sample anywhere = valid 5.4 Variational autoencoders The variational autoencoder (VAE) of Kingma & Welling (2013) closes the holes by making two changes. First, the encoder no longer outputs a single point — it outputs a distribution: a mean \(\mu(x)\) and a (log-)variance \(\log\sigma^2(x)\) defining \(q_\phi(z\mid x) = \mathcal{N}(\mu, \sigma^2 I)\). Second, the loss regularizes that distribution toward a standard normal prior \(p(z) = \mathcal{N}(0, I)\). Encode a point and you get a fuzzy ball, not a dot; train the whole dataset and the balls overlap to tile a smooth, gap-free Gaussian cloud you can sample from at will. The objective is a lower bound on the data log-likelihood, the evidence lower bound (ELBO): EQ N5.3 — THE ELBO (VAE LOSS) $$ \log p_\theta(x) \;\ge\; \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big]}_{\text{reconstruction}} \;-\; \underbrace{D_{\mathrm{KL}}\!\big(q_\phi(z\mid x)\,\Vert\,p(z)\big)}_{\text{regularizer}} $$ The VAE maximizes the ELBO, equivalently minimizes \(-\text{ELBO}\). Two terms in tension: the first says "encode enough of \(x\) that the decoder can rebuild it"; the second says "keep \(q(z\mid x)\) close to the prior so the latent stays a tidy, sampleable Gaussian." The VAE loss is reconstruction error plus the KL divergence to the prior — that single sentence is the whole model. Crank a weight \(\beta\) on the KL term and you get the \(\beta\)-VAE (Higgins et al., 2017), which trades reconstruction sharpness for more disentangled, axis-aligned latents. For diagonal Gaussians the KL term has a clean closed form — no sampling needed to compute it: EQ N5.4 — KL OF DIAGONAL GAUSSIAN TO N(0, I) $$ D_{\mathrm{KL}}\!\big(\mathcal{N}(\mu,\sigma^2 I)\,\Vert\,\mathcal{N}(0,I)\big) \;=\; \frac{1}{2}\sum_{j=1}^{k}\Big(\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\Big) $$ Each latent dimension contributes independently. The term is zero exactly when \(\mu_j = 0\) and \(\sigma_j = 1\) — i.e. when that dimension is the prior. It penalizes a code for drifting from the origin (\(\mu_j^2\)) or collapsing to a spike (\(-\log\sigma_j^2\) blows up as \(\sigma_j \to 0\)). This pressure is what fills the gaps: every encoded ball is pushed to overlap the others around the origin. One obstacle remains. The ELBO contains an expectation over \(z \sim q_\phi(z\mid x)\), and sampling \(z\) is not differentiable — you can't backpropagate through a random draw. The fix is the reparameterization trick: move the randomness outside the network. Instead of sampling \(z\) directly, sample a fixed-noise \(\varepsilon \sim \mathcal{N}(0, I)\) and build \(z\) as a deterministic, differentiable function of \(\mu\), \(\sigma\), and \(\varepsilon\): EQ N5.5 — REPARAMETERIZATION TRICK $$ z = \mu(x) + \sigma(x) \odot \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I) $$ Now \(z\) is a smooth function of the parameters \((\mu,\sigma)\) with the stochasticity quarantined in \(\varepsilon\), so gradients flow through \(\mu\) and \(\sigma\) cleanly. \(\odot\) is elementwise product. This one line is what makes the VAE trainable end-to-end by ordinary backprop — arguably the paper's most reused idea, now standard far beyond VAEs (it powers the policy-gradient reparameterizations in Vol III and the noise schedules of diffusion). The VAE training objective (the negative ELBO it minimizes) is the reconstruction loss plus the KL divergence from the approximate posterior \(q_\phi(z\mid x)\) to the prior \(p(z)\). True or false? (Enter true or false.) EQ N5.3 maximizes \(\mathbb{E}_q[\log p(x\mid z)] - D_{\mathrm{KL}}(q\Vert p)\). Flipping sign to a loss: minimize \((-\text{reconstruction}) + D_{\mathrm{KL}}(q\Vert p)\) — i.e. reconstruction loss plus KL to the prior. Answer: true. A one-dimensional VAE latent has \(\mu = 1\) and \(\sigma = 1\). Using EQ N5.4, what is the KL divergence \(D_{\mathrm{KL}}\big(\mathcal{N}(1,1)\,\Vert\,\mathcal{N}(0,1)\big)\)? (Recall \(\log 1 = 0\).) \(\tfrac12\big(\sigma^2 + \mu^2 - 1 - \log\sigma^2\big) = \tfrac12\big(1 + 1 - 1 - \log 1\big) = \tfrac12(1 + 1 - 1 - 0) = \tfrac12 \cdot 1 = \) 0.5 nats. The mean is one standard deviation off the prior; the variance already matches, so all the cost comes from \(\mu^2\). PYTHON · RUNNABLE IN-BROWSER # VAE reparameterization trick: z = mu + sigma*eps, plus the closed-form KL import numpy as np rng = np.random.default_rng(0) mu = np.array([1.0, -0.5, 0.0]) # encoder mean for one input log_var = np.array([0.0, 0.0, 2.0]) # encoder log-variance (sigma^2) sigma = np.exp(0.5 * log_var) # -> sigma = [1, 1, e] # draw many z via the trick; randomness lives only in eps ~ N(0, I) eps = rng.normal(size=(20000, 3)) z = mu + sigma * eps # broadcast: deterministic in (mu, sigma) print("sample mean ~ mu:", np.round(z.mean(0), 3), " target", mu) print("sample std ~ sigma:", np.round(z.std(0), 3), " target", np.round(sigma, 3)) # closed-form KL( N(mu,sigma^2) || N(0,I)) = 0.5 * sum(sigma^2 + mu^2 - 1 - log sigma^2) kl = 0.5 * np.sum(sigma**2 + mu**2 - 1.0 - log_var) print(f"\nKL to prior (nats): {kl:.4f}") print("check dim 0 (mu=1, sig=1): 0.5*(1+1-1-0) =", 0.5*(1+1-1-0), "-> 0.5") print("dim 2 (mu=0, sig=e) pays for an over-wide variance, log_var=2") RUN ▶ edits are live — break it on purpose INSTRUMENT N5.3 — VAE LATENT SAMPLER 2D LATENT GRID → DECODED OUTPUTS · EQ N5.5 LATENT RANGE (± std) 2.5 KL WEIGHT β 1.0 PRIOR MASS COVERED — GRID KL (mean) — DISENTANGLEMENT — We walk a grid across the 2D latent and decode each cell — the classic VAE "latent atlas." Because the prior is \(\mathcal{N}(0,I)\), the centre is dense (common samples) and the corners are rare. Each tile shows a synthetic decoded shape whose two factors of variation (curvature, orientation) are driven by the two latent axes. Raise β and the axes become more independent — disentangled — at the cost of blurrier, lower-contrast outputs. The dashed contours are the prior's 1σ and 2σ rings: anything inside them is a plausible sample. 5.5 Representation learning & uses Autoencoders matter today less as standalone generators and more as representation learners — machinery for turning raw data into compact, structured codes that everything downstream consumes. Pretraining & transfer. Train an encoder unsupervised on a mountain of unlabelled data, then attach a small classifier head and fine-tune on a little labelled data. The masked-autoencoding idea (mask patches, reconstruct them) is the visual analogue of masked language modelling — MAE (He et al., 2022) made it the dominant self-supervised recipe for vision transformers. Anomaly detection. Train on normal data only; flag inputs the model reconstructs poorly. High reconstruction error means "off the learned manifold" — fraud, defects, intrusions, equipment faults. The latent backbone of generative AI. The single most consequential use: a VAE compresses images into a small latent grid, and a diffusion model (Chapter 07) does its expensive denoising in that latent space instead of in pixels. This is the "VAE" inside Stable Diffusion and its descendants — latent diffusion is why a consumer GPU can generate megapixel images. The autoencoder isn't the generator; it's the compression layer that makes the generator affordable. Discrete codes for sequence models. The VQ-VAE (van den Oord et al., 2017) replaces the Gaussian latent with a learned codebook, turning images, audio, or video into sequences of discrete tokens that an autoregressive transformer can then model exactly like text — the basis of many modern image and audio generators. It is worth being honest about the trade-offs experts actually argue over. VAE samples are notoriously blurry compared to GANs (Chapter 06) and diffusion: the Gaussian decoder and the averaging implied by the ELBO smear high-frequency detail. Posterior collapse is the classic failure — when the decoder is powerful enough to ignore \(z\), the KL term drives \(q(z\mid x)\) all the way to the prior and the latent carries no information; KL-annealing, free-bits, and weaker decoders are the usual countermeasures. And the ELBO is a bound, not the likelihood itself: a higher ELBO does not guarantee better samples, and "good representation" and "good generation" are not the same objective. The VAE's enduring win is not photorealism — it is a well-organized, sampleable latent space, which is exactly what the rest of the generative stack needed. NEXT The VAE buys a smooth latent at the price of blur. The next chapter takes the opposite bet: drop the explicit likelihood entirely and learn to generate by competition. Chapter 06 — GANs: a generator and a discriminator locked in a minimax game, the sharpest samples in deep learning and the hardest training dynamics to tame. 5.R References Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) — deep autoencoders, trained layer-wise, beat PCA at nonlinear dimensionality reduction (§5.1). Kingma, D. P. & Welling, M. (2013). Auto-Encoding Variational Bayes. ICLR 2014 — the VAE, the ELBO (EQ N5.3), and the reparameterization trick (EQ N5.5). Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008 — the denoising autoencoder (EQ N5.2) and the manifold-projection view that seeds diffusion. Baldi, P. & Hornik, K. (1989). Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima. Neural Networks 2(1) — proves the linear autoencoder optimum spans the top-k PCA subspace (§5.1). Higgins, I. et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017 — weighting the KL term to encourage disentangled latent factors (§5.4). van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS 2017 — discrete codebook latents that let autoregressive models generate over autoencoder tokens (§5.5). Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022 — the VAE-compressed latent space that makes Stable Diffusion affordable (§5.5). He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022 — masked autoencoding as a strong self-supervised pretext for vision transformers (§5.5). ← PREVIOUS 04 Seq2Seq & Attention NEXT CHAPTER 06 GANs AI // ENCYCLOPEDIA — DEEP LEARNING · CH 05 FULL CONTENTS ↗
## DL · Generative Adversarial Networks (https://ai-encyclopedia.com/dl/06-gans.html)
Generative Adversarial Networks — AI Encyclopedia AI // ENCYCLOPEDIA / DEEP LEARNING / 06 / GANs INDEX NEXT: TRAINING DEEP NETS → DEEP LEARNING · CHAPTER 06 / 07 Generative Adversarial Networks Most generative models estimate how likely the data is and climb that gradient. GANs discard the likelihood entirely. Adversarial training pits a generator against a discriminator that improve together, and it produced the first photorealistic generators. The generator never sees a real image directly; it learns from the verdicts of an opponent that is itself learning to catch it. A learned, moving loss function is what brought faces, fonts, and textures into focus where fixed objectives had blurred. LEVEL ADVANCED READING TIME ≈ 26 MIN BUILDS ON DL 03 · 05 INSTRUMENTS TRAINING SIM · MODE COLLAPSE · LATENT WALK IN THIS CHAPTER 6.1 The adversarial game 6.2 The minimax objective 6.3 Instability & mode collapse 6.4 DCGAN, WGAN & Wasserstein 6.5 StyleGAN & after 6.R References 6.1 The adversarial game A generative model wants to turn cheap noise into samples that look like real data. The autoencoders of the previous chapter did this by reconstructing inputs through a bottleneck and minimizing a pixel-wise reconstruction loss; the trouble is that pixel-wise losses reward blur — averaging two plausible faces gives a low error and a smeared ghost. GANs replace that hand-chosen loss with a second network whose only job is to tell real from fake, and they train the generator to defeat it. The setup is two players. A generator \(G\) maps a latent vector \(z\), drawn from a fixed simple prior \(p_z\) (usually a unit Gaussian), to a sample \(G(z)\) in data space. A discriminator \(D\) takes any sample \(x\) and outputs \(D(x) \in (0,1)\), its estimated probability that \(x\) is real rather than generated. Goodfellow's 2014 metaphor has stuck because it is exact: \(G\) is a counterfeiter printing banknotes, \(D\) is the police learning to spot forgeries, and the two improve in lockstep until the fakes are indistinguishable from currency. z ~ p(z) GENERATOR G(z) REAL DATA x DISCRIMINATOR D(x) → (0,1) REAL? FAKE? The asymmetry of information is the whole trick. \(D\) is trained on labelled examples — it sees real data and generated data and knows which is which. \(G\) is never shown a single real example directly; its only learning signal is the gradient that flows back through \(D\) telling it which direction would have made its sample look more real. The loss function is therefore not fixed: it is \(D\) itself, and \(D\) is moving. This is the conceptual leap that separates GANs from everything before — the objective the generator climbs is learned and adversarial, sharpening exactly where the generator is currently weak. Because there is no explicit density, a vanilla GAN cannot tell you the likelihood of a held-out image — it is an implicit generative model. You can sample from it freely but cannot score samples. That is a feature for image realism (no blur-inducing likelihood term) and a liability for evaluation, which is why the field leans on proxy metrics like the Fréchet Inception Distance instead of held-out log-likelihood. 6.2 The minimax objective Write down what each player wants and you get a single two-player value function. The discriminator wants \(D(x)\) near 1 on real data and near 0 on fakes; the generator wants the opposite. Goodfellow et al. packaged both into one minimax game: EQ N6.1 — THE MINIMAX GAME $$ \min_{G}\,\max_{D}\; V(D,G) \;=\; \mathbb{E}_{x \sim p_{\text{data}}}\!\big[\log D(x)\big] \;+\; \mathbb{E}_{z \sim p_z}\!\big[\log\big(1 - D(G(z))\big)\big] $$ The first term rewards \(D\) for scoring real data high; the second rewards \(D\) for scoring fakes low and rewards \(G\) (which minimizes) for pushing \(D(G(z))\) back toward 1. It is one objective, optimized in opposite directions by the two networks. In practice you alternate: a (few) gradient ascent step(s) on \(D\), then one gradient descent step on \(G\), each on a fresh minibatch of noise and data. For a fixed generator, the inner maximization has a closed-form optimum. Treating \(V\) pointwise in \(x\), the optimal discriminator is the posterior probability that a sample is real under the two densities: EQ N6.2 — THE OPTIMAL DISCRIMINATOR $$ D^{*}_{G}(x) \;=\; \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{g}(x)} $$ \(p_g\) is the (implicit) density the generator induces over data space. Where real and fake densities are equal, \(D^{*}(x) = \tfrac{1}{2}\): the detective is reduced to a coin flip. This is the fixed point of the whole game — when the generator's distribution matches the data, no discriminator can do better than chance, and the most informed possible verdict on every input is exactly 0.5. Substitute \(D^{*}_G\) back into \(V\) and the generator's objective collapses to a recognizable distance between distributions. Up to constants it becomes the Jensen–Shannon divergence: EQ N6.3 — WHAT G ACTUALLY MINIMIZES $$ \max_{D} V(D,G) \;=\; 2\,\mathrm{JSD}\!\big(p_{\text{data}} \,\|\, p_g\big) - \log 4, \qquad \mathrm{JSD}(p\|q) = \tfrac{1}{2}\mathrm{KL}\!\big(p \,\|\, m\big) + \tfrac{1}{2}\mathrm{KL}\!\big(q \,\|\, m\big),\;\; m = \tfrac{p+q}{2} $$ With the optimal \(D\) plugged in, the generator is minimizing the Jensen–Shannon divergence between the data and its own samples. The global minimum is \(\mathrm{JSD} = 0\), reached only when \(p_g = p_{\text{data}}\), giving value \(-\log 4 \approx -1.386\). The objective is principled — but JSD is also where the trouble starts (§6.3): when the two distributions barely overlap, JSD saturates to the constant \(\log 2\) and its gradient vanishes. One practical wrinkle ships in every real implementation. The generator term \(\log(1 - D(G(z)))\) has almost no gradient early in training, exactly when \(D\) easily rejects the generator's garbage (\(D(G(z)) \approx 0\)). So Goodfellow recommended the non-saturating reformulation: instead of minimizing \(\log(1 - D(G(z)))\), the generator maximizes \(\log D(G(z))\). Same fixed point, far stronger gradients when the generator is losing — the version everyone actually trains. At the global optimum of the GAN game the generator has matched the data distribution, \(p_g = p_{\text{data}}\). Using EQ N6.2, what value does the optimal discriminator \(D^{*}(x)\) output for every input \(x\)? Substituting \(p_g = p_{\text{data}}\) into \(D^{*}_G(x) = \dfrac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} = \dfrac{p_{\text{data}}(x)}{2\,p_{\text{data}}(x)} = \dfrac{1}{2} = \) 0.5. The discriminator is reduced to a coin flip on every input — it can no longer tell real from fake, which is the definition of the generator having won. When the data and generator distributions have disjoint support, the Jensen–Shannon divergence \(\mathrm{JSD}(p_{\text{data}}\|p_g)\) hits its maximum value. What is that maximum, in nats? (It is \(\ln 2\).) For disjoint \(p\) and \(q\), the mixture \(m = (p+q)/2\) equals \(p/2\) wherever \(p\) lives and \(q/2\) wherever \(q\) lives, so each KL term is \(\int p \log\frac{p}{p/2} = \log 2\), giving \(\mathrm{JSD} = \tfrac12\log 2 + \tfrac12\log 2 = \log 2 = \ln 2 \approx \) 0.693 nats. Because this is a flat ceiling, its gradient is zero — the vanishing-gradient failure of §6.3. PYTHON · RUNNABLE IN-BROWSER # 1D GAN toy: a 1-param generator fits a target distribution; print D accuracy import numpy as np rng = np.random.default_rng(0) def sig(x): return 1 / (1 + np.exp(-np.clip(x, -30, 30))) real = rng.normal(2.0, 0.5, 2000) # target: mean 2.0, std 0.5 a, b, mu, s = 1.0, 0.0, 0.0, 1.0 # D(x)=sig(a x + b); G(z)=mu + s z lr = 0.05 for it in range(400): z = rng.normal(0, 1, 2000); fake = mu + s * z pr, pf = sig(a*real + b), sig(a*fake + b) # D step: ascend V a += lr * (np.mean((1-pr)*real) - np.mean(pf*fake)) b += lr * (np.mean(1-pr) - np.mean(pf)) z = rng.normal(0, 1, 2000); fake = mu + s * z # G step: non-saturating pf = sig(a*fake + b) mu += lr * np.mean((1-pf) * a) s += lr * np.mean((1-pf) * a * z) print(f"target: mean 2.00 std 0.50") print(f"learned: mean {mu:5.2f} std {abs(s):4.2f}") zf = rng.normal(0, 1, 2000); fake = mu + s*zf acc = 0.5*(np.mean(sig(a*real+b) > 0.5) + np.mean(sig(a*fake+b) RUN ▶ edits are live — break it on purpose INSTRUMENT N6.1 — ADVERSARIAL TRAINING SIMULATOR G & D LOSSES · EQ N6.1–N6.3 · DETERMINISTIC D LEARNING RATE 1.00× D STEPS PER G STEP 1 OBJECTIVE SATURATING NON-SAT G LOSS (FINAL) — D LOSS (FINAL) — D(G(z)) — DETECTOR CONFIDENCE — A deterministic simulation of the two losses as the game runs (fixed seed, so it renders identically with zero interaction). The mint curve is the generator loss, the blue curve the discriminator loss; both oscillate around an equilibrium rather than monotonically falling — that is healthy adversarial training, not divergence. Crank the D learning rate or D-steps-per-G-step and watch the discriminator overpower the generator: D loss crashes toward 0, D(G(z)) toward 0, and G's gradient starves. Switch to SATURATING to see the early-training flat spot the non-saturating loss was invented to fix. 6.3 Instability & mode collapse The theory of §6.2 assumes the inner maximization is solved exactly and the two distributions overlap. Reality grants neither, and the gap is where GANs earned their reputation for being temperamental. Three failure modes dominate. Vanishing gradients. EQ N6.3 says \(G\) minimizes a JSD. But early on, \(p_g\) and \(p_{\text{data}}\) live on nearly disjoint low-dimensional manifolds inside a high-dimensional space — natural images occupy a vanishingly thin sliver of pixel space, and a fresh generator's outputs occupy a different one. Where the supports do not overlap, JSD is pinned at its maximum \(\log 2\) and is locally flat, so a near-optimal discriminator hands the generator a gradient of essentially zero. The detective becomes too good, and the forger stops learning. This is the precise sense in which a perfectly trained discriminator is bad for training. Mode collapse. The minimax objective rewards the generator for fooling \(D\) on the current batch, not for covering the whole data distribution. A generator can win by mapping every \(z\) to a single hyper-realistic output — one perfect "7" for an MNIST GAN, one face. \(D\) eventually learns to reject that point, the generator hops to another single mode, \(D\) chases, and the two play whack-a-mole forever. The pathology is that the generator's loss has no term demanding diversity; covering one mode flawlessly scores as well as covering all of them. WHY COLLAPSE IS STRUCTURAL Mode collapse is not a bug in the optimizer — it is in the objective. Compare to maximum likelihood, whose KL term \(\mathrm{KL}(p_{\text{data}}\|p_g)\) is mode-covering: it explodes if \(p_g\) assigns near-zero probability anywhere the data has mass, forcing the model to spread out (and blur). The adversarial game has no such penalty for dropping a mode entirely, so it is free to be mode-seeking — crisp where it commits, blind to what it abandons. Sharper samples and dropped modes are two faces of the same coin. Non-convergence and oscillation. Even with overlapping supports, simultaneous gradient descent on a minimax game is not guaranteed to converge — the dynamics can orbit the equilibrium indefinitely, like two players in rock-paper-scissors each best-responding to the other's last move. The losses you watch during GAN training oscillate by design; a generator loss that falls smoothly to zero usually means the discriminator has collapsed, not that you have won. A fresh generator's samples and the real data occupy near-disjoint manifolds, so \(\mathrm{JSD}(p_{\text{data}}\|p_g)\) sits at its ceiling and the generator's effective loss \(2\,\mathrm{JSD} - \log 4\) is flat. What constant value (in nats) is \(\mathrm{JSD}\) stuck at, killing the gradient? Disjoint support pins \(\mathrm{JSD}\) at its maximum \(\log 2 = \ln 2 \approx \) 0.693 nats — a flat plateau whose derivative is zero. No matter how the generator nudges its output, the loss does not move, so no useful gradient flows back. This is the mathematical core of the vanishing-gradient failure, and the precise problem the Wasserstein distance (§6.4) was designed to remove. PYTHON · RUNNABLE IN-BROWSER # Mode collapse: a single-Gaussian generator covers only ONE mode of a mixture import numpy as np rng = np.random.default_rng(1) # target: 70% mass near +3, 30% near -3 (two well-separated modes) heavy = rng.random(3000) RUN ▶ edits are live — break it on purpose INSTRUMENT N6.2 — MODE-COLLAPSE DEMO 2D MIXTURE OF 8 GAUSSIANS · GENERATOR COVERAGE TRAINING PROGRESS step 0 DIVERSITY PRESSURE off MODES COVERED — COVERAGE — REGIME — Eight real modes sit on a ring ( grey rings); the generator's samples are the mint cloud. With diversity pressure off, scrub training forward and watch the generator hop from mode to mode — at any moment it parks on one or two and abandons the rest, the signature of collapse. Raise diversity pressure (a stand-in for minibatch discrimination / unrolled-GAN style fixes) and the cloud spreads to cover the full ring. The lesson: nothing in the bare objective rewards coverage; you have to add it. 6.4 DCGAN, WGAN & the Wasserstein fix The original GAN paper used multilayer perceptrons on small images and trained precariously. Two papers turned the idea into something that worked reliably — one architectural, one about the loss. DCGAN (Radford, Metz & Chintala, 2015) is the architecture that made image GANs reproducible. Its recipe became boilerplate: replace pooling with strided convolutions (let the network learn its own up/down-sampling); use transposed convolutions in \(G\) to grow spatial resolution; apply batch normalization in both networks to stabilize activations; drop fully-connected hidden layers; use ReLU in \(G\) and LeakyReLU in \(D\). Beyond crisp 64×64 samples, DCGAN demonstrated that the learned latent space was structured — vector arithmetic on \(z\) (the famous "man with glasses − man + woman = woman with glasses") moved meaningfully in image space, the first hint that GANs learn a disentangled representation, not just a lookup table. WGAN (Arjovsky, Chintala & Bottou, 2017) attacked the loss. The diagnosis was exactly §6.3: JSD gives no usable gradient when supports are disjoint. The fix is to measure the distance between distributions with the Wasserstein (earth-mover's) distance instead, which stays smooth and informative even when the distributions do not overlap. WGAN replaces the JS divergence with the Wasserstein distance. EQ N6.4 — WASSERSTEIN-1 (EARTH MOVER'S) DISTANCE $$ W_1(p_{\text{data}}, p_g) \;=\; \inf_{\gamma \in \Pi(p_{\text{data}}, p_g)} \;\mathbb{E}_{(x,y)\sim\gamma}\big[\,\lVert x - y \rVert\,\big] $$ \(\Pi\) is the set of all transport plans \(\gamma\) with the right marginals; \(W_1\) is the minimum average "dirt × distance" to reshape one pile of probability into the other. Unlike JSD, \(W_1\) varies continuously with the generator's parameters even when supports are disjoint — move a far-away blob closer and \(W_1\) drops smoothly, giving a gradient where JSD gave a flat plateau. The infimum over transport plans is intractable, so WGAN uses the Kantorovich–Rubinstein duality, which turns it into a maximization over 1-Lipschitz functions \(f\). The discriminator is repurposed as this \(f\) — now called a critic, because it outputs an unbounded real score, not a probability: EQ N6.5 — KANTOROVICH–RUBINSTEIN DUAL (THE CRITIC OBJECTIVE) $$ W_1(p_{\text{data}}, p_g) \;=\; \sup_{\lVert f \rVert_{L} \le 1}\; \mathbb{E}_{x \sim p_{\text{data}}}[\,f(x)\,] \;-\; \mathbb{E}_{z \sim p_z}\big[\,f(G(z))\,\big] $$ The critic \(f\) maximizes the gap between its average score on real and on fake; the generator minimizes it. The constraint \(\lVert f\rVert_L \le 1\) (1-Lipschitz: \(f\) cannot change faster than its input) is what makes the dual equal \(W_1\). The original WGAN enforced it crudely by weight clipping; WGAN-GP replaced that with a gradient penalty pushing \(\lVert \nabla f \rVert\) toward 1 — far more stable, and the version in wide use. The payoff is practical. Because EQ N6.5 estimates a genuine distance, the critic's value correlates with sample quality — for the first time a GAN's loss curve meant something you could read. WGAN tolerates a strong critic (train it to near-optimality between generator steps, the opposite of the vanilla advice), is far less prone to mode collapse, and removed much of the black-magic hyperparameter fiddling. It did not make GANs trivial, but it made them debuggable. Variant Distribution distance Output network Headline contribution Vanilla GAN Jensen–Shannon discriminator → (0,1) The adversarial game itself (2014) DCGAN Jensen–Shannon conv discriminator Stable conv architecture; structured latent space WGAN Wasserstein-1 critic → ℝ (clipped) Meaningful loss; far less collapse WGAN-GP Wasserstein-1 critic + gradient penalty Lipschitz via penalty, not clipping True or false: WGAN replaces the Jensen–Shannon divergence of the original GAN objective with the Wasserstein (earth-mover's) distance, precisely because the latter gives useful gradients even when the real and generated distributions do not overlap. (Answer true or false.) This is exactly WGAN's thesis. Vanilla GANs minimize JSD (EQ N6.3), which is flat at \(\log 2\) for disjoint supports and hands the generator no gradient. \(W_1\) (EQ N6.4) instead varies continuously with how far apart the distributions are, so moving generated mass toward real mass always lowers the loss. The discriminator becomes a 1-Lipschitz critic (EQ N6.5). The statement is true. INSTRUMENT N6.3 — LATENT-INTERPOLATION VISUALIZER WALK z FROM A → B IN LATENT SPACE · DCGAN-STYLE INTERPOLATION t (A → B) 0.50 PATH LINEAR SLERP |z| (LATENT NORM) — DECODED PATTERN — ENDPOINTS A · B Two latent codes \(z_A, z_B\) decode (via a small fixed toy generator) to two distinct procedural "textures"; slide \(t\) to walk between them. A smooth, gradual morph with no jumps is the signature of a well-trained generator — the latent space is continuous, so nearby codes give nearby images. Switch LINEAR → SLERP: straight-line interpolation in a Gaussian latent dips through the low-density origin (the \(|z|\) readout sags at \(t=0.5\)), giving washed-out midpoints, while spherical interpolation keeps \(|z|\) on the typical-radius shell and the morph stays crisp — the reason practitioners slerp. 6.5 StyleGAN & where GANs went By 2018 GANs could generate small images reliably; the open question was control and resolution. Progressive growing (Karras et al., 2017) trained GANs by adding resolution layers one at a time, reaching 1024×1024. StyleGAN (Karras, Laine & Aila, 2019) then redesigned the generator itself and produced the photorealistic faces — thispersondoesnotexist.com — that put GANs in the popular imagination. StyleGAN's central move was to stop feeding the latent code in at the bottom. Instead a learned mapping network turns \(z\) into an intermediate latent \(w\), and \(w\) controls the image by modulating the statistics of feature maps at every resolution via adaptive instance normalization. Coarse layers set pose and face shape; middle layers set features; fine layers set color and micro-texture. Injecting a different \(w\) at different layers ("style mixing") cleanly transplants, say, hair color without touching pose — disentanglement by construction. Per-pixel noise inputs supply the stochastic detail (freckles, stray hairs) that the structured \(w\) need not encode. EQ N6.6 — STYLE MODULATION (AdaIN) $$ \mathrm{AdaIN}(x_i, w) \;=\; y_{s,i}(w)\,\frac{x_i - \mu(x_i)}{\sigma(x_i)} \;+\; y_{b,i}(w) $$ Each feature map \(x_i\) is normalized to zero mean and unit variance, then re-scaled and re-shifted by a per-channel style \((y_{s,i}, y_{b,i})\) computed from \(w\). The style controls the image purely through these scale/shift statistics — applied independently at each resolution, which is what separates coarse structure from fine texture. StyleGAN2 later replaced AdaIN with weight demodulation to remove the characteristic "droplet" artifacts, and StyleGAN3 fixed aliasing so features stick to surfaces under motion. Where GANs stand in 2026 is an honest mixed picture. For unconditional and class-conditional image synthesis, diffusion models largely displaced GANs after 2021: they are far easier to train (a stable denoising regression, no adversary), cover modes better, and scale to text-to-image systems where GANs never caught up. The 2021 "diffusion beats GANs on image synthesis" result marked the turn. GANs did not vanish, though. Their one decisive advantage is speed: a GAN generates in a single forward pass, while diffusion needs many denoising steps — so GANs and GAN-style adversarial losses survive wherever latency matters: real-time super-resolution, image-to-image translation, neural vocoders for speech, and as the distillation target that compresses slow diffusion models into one-step generators. Adversarial training is now a component in a larger toolbox rather than the whole story. CONTESTED "GANs are obsolete" is too strong. The claim holds for large-scale text-to-image, where diffusion (and autoregressive token models) clearly won on quality and trainability. It does not hold for latency-bound generation, and the line is blurring: state-of-the-art few-step diffusion distillation often adds an adversarial loss to keep one-step samples sharp. The adversarial idea outlived the pure-GAN architecture. Treat anyone who says GANs are simply dead, or simply fine, with equal suspicion. NEXT Every model in this volume — autoencoder, GAN, the deep classifier — is only as good as the optimization that fits it. Chapter 07 leaves architectures behind for the craft of training deep nets: initialization, normalization, the vanishing/exploding-gradient problem these adversarial games quietly battle, learning-rate schedules, and the regularization that decides whether a network that can fit the data actually generalizes. 6.R References Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS 2014 — the original adversarial game, the optimal discriminator (EQ N6.2), and the JSD reduction (EQ N6.3). Radford, A., Metz, L. & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR 2016 — DCGAN: the stable convolutional architecture and latent-space vector arithmetic. Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN. ICML 2017 — replacing JSD with the Wasserstein distance (EQ N6.4–N6.5) and the critic. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. (2017). Improved Training of Wasserstein GANs. NeurIPS 2017 — WGAN-GP: the gradient penalty that replaced weight clipping. Karras, T., Laine, S. & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR 2019 — StyleGAN: the mapping network, AdaIN style modulation (EQ N6.6), and style mixing. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. (2016). Improved Techniques for Training GANs. NeurIPS 2016 — minibatch discrimination and feature matching, the classic anti-collapse fixes behind Instrument N6.2. Dhariwal, P. & Nichol, A. (2021). Diffusion Models Beat GANs on Image Synthesis. NeurIPS 2021 — the result marking diffusion's displacement of GANs for large-scale image generation (§6.5). ← PREVIOUS 05 Autoencoders NEXT CHAPTER 07 Training Deep Nets AI // ENCYCLOPEDIA — DEEP LEARNING · CH 06 FULL CONTENTS ↗
## DL · Training Deep Networks in Practice (https://ai-encyclopedia.com/dl/07-training-deep-nets.html)
Training Deep Networks in Practice — AI Encyclopedia AI // ENCYCLOPEDIA / DEEP LEARNING / 07 / TRAINING INDEX NEXT: RL · 01 THE RL PROBLEM → DEEP LEARNING · CHAPTER 07 / 07 Training Deep Networks in Practice A network's architecture decides what it can represent; training decides whether it gets there. The optimizer, the learning-rate schedule, and the numerics determine whether a model actually converges. This chapter covers that side of the work: from plain SGD to AdamW, from warmup-and-cosine schedules to mixed precision and loss scaling, ending in a recipe and a diagnostic loop for reading a loss curve. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON DL 01–02 · backprop & SGD INSTRUMENTS OPTIMIZER RACE · LR DESIGNER · LOSS DIAGNOSER IN THIS CHAPTER 7.1 Optimizers 7.2 Learning-rate schedules 7.3 Regularization & early stopping 7.4 Mixed precision & numerics 7.5 A recipe & debugging 7.R References 7.1 Optimizers — SGD, momentum, Adam, AdamW Every optimizer answers one question: given the gradient \(g_t = \nabla_\theta \mathcal{L}\) at the current parameters, how far and in what direction do we step? The answers form a short, important lineage. Stochastic gradient descent is the bare minimum — step downhill by a fixed multiple of the gradient on a mini-batch: EQ N7.1 — SGD UPDATE $$ \theta_{t+1} \;=\; \theta_t - \eta\, g_t, \qquad g_t = \nabla_\theta\, \mathcal{L}\!\left(\theta_t;\, \mathcal{B}_t\right) $$ \(\eta\) is the learning rate; \(\mathcal{B}_t\) a random mini-batch. The mini-batch makes \(g_t\) a noisy estimate of the true gradient — the "stochastic" in SGD — and that noise is not purely a nuisance: it helps the optimizer escape sharp, brittle minima. SGD's flaw is that one scalar \(\eta\) must serve every parameter and every direction of curvature, so it crawls along flat directions and oscillates across steep ones. The first fix is momentum: accumulate an exponentially-decaying running average of past gradients (a velocity \(v_t\)) and step along that instead. Consistent directions reinforce; oscillating ones cancel. EQ N7.2 — SGD WITH MOMENTUM $$ v_{t} = \mu\, v_{t-1} + g_t, \qquad \theta_{t+1} = \theta_t - \eta\, v_{t}, \qquad 0 \le \mu < 1 $$ \(\mu\) (typically \(0.9\)) is the momentum coefficient. For a steady gradient \(g\), the velocity converges to a geometric series, \(v_\infty = g/(1-\mu)\), so the effective step grows by \(1/(1-\mu)\). At \(\mu = 0.9\) that is a 10× amplification along persistent directions — the source of momentum's speed, and the reason it can overshoot. Nesterov's variant evaluates the gradient at the look-ahead point \(\theta_t - \eta\mu v_{t-1}\) for a slightly better-anticipated correction. SGD with momentum \(\mu = 0.9\) is fed the same gradient \(g\) every step. At steady state, by what factor is the effective step size larger than plain SGD's \(\eta g\)? (Use \(v_\infty = g/(1-\mu)\).) The steady-state velocity is \(v_\infty = \dfrac{g}{1-\mu} = \dfrac{g}{1-0.9} = \dfrac{g}{0.1} = 10\,g\). The effective step is \(\eta v_\infty = 10\,\eta g\), so the amplification factor is 10. This is exactly why a momentum run often needs a smaller \(\eta\) than a plain-SGD run at the same stability. The second fix is adaptivity: give each parameter its own effective learning rate, scaled down where gradients have been large. Adam combines this with momentum. It maintains a first moment \(m_t\) (the momentum-like mean of gradients) and a second moment \(v_t\) (a mean of squared gradients), bias-corrects both, and divides the step by the root of the second moment: EQ N7.3 — ADAM $$ m_t = \beta_1 m_{t-1} + (1-\beta_1)\,g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2 $$ $$ \hat m_t = \frac{m_t}{1-\beta_1^{\,t}}, \quad \hat v_t = \frac{v_t}{1-\beta_2^{\,t}}, \qquad \theta_{t+1} = \theta_t - \eta\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} $$ Defaults: \(\beta_1 = 0.9,\ \beta_2 = 0.999,\ \epsilon = 10^{-8}\). The bias correction matters most at the start: with \(m_0 = v_0 = 0\), the raw \(m_t\) is biased toward zero, and dividing by \(1 - \beta_1^{\,t}\) undoes it. The \(\hat m_t / \sqrt{\hat v_t}\) ratio makes each coordinate's step roughly scale-invariant — large, noisy gradients are damped, tiny consistent ones are amplified — which is why Adam "just works" across wildly different layers and is the default for transformers and most modern deep nets. Adam with \(\beta_1 = 0.9\), starting from \(m_0 = 0\), takes one step with gradient \(g = 1\). What is the bias-corrected first moment \(\hat m_1\)? (Compute \(m_1 = (1-\beta_1)g\), then \(\hat m_1 = m_1 / (1 - \beta_1^{\,1})\).) \(m_1 = (1 - 0.9)\times 1 = 0.1\). Bias correction: \(\hat m_1 = \dfrac{m_1}{1 - 0.9^{1}} = \dfrac{0.1}{0.1} = \) 1. The correction exactly cancels the cold-start shrinkage, so the very first effective gradient estimate equals \(g\) — without it, Adam would take vanishingly small steps for the first dozen updates. AdamW is the variant you should actually reach for. The issue it fixes is subtle: classical weight decay was implemented as an L2 penalty added to the loss, so its gradient \(\lambda\theta\) flows through Adam's adaptive denominator and gets rescaled per-parameter — coupling the regularization strength to each coordinate's gradient history. Loshchilov & Hutter showed that decoupling the decay — applying it directly to the weights, outside the adaptive step — restores the intended behavior and consistently generalizes better: EQ N7.4 — ADAMW: DECOUPLED WEIGHT DECAY $$ \theta_{t+1} = \theta_t - \eta\left( \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} \;+\; \lambda\, \theta_t \right) $$ The decay term \(\lambda\theta_t\) is added after the adaptive rescaling, so it shrinks every weight by the same relative amount \(\eta\lambda\) each step — true weight decay, not an adaptive-gradient L2 term. This decoupling is now the default in essentially every transformer training recipe (typical \(\lambda \approx 0.01\!-\!0.1\)). Bias and normalization-scale parameters are conventionally excluded from decay. Optimizer State per parameter Strength Weakness SGD none Cheapest; flat minima; strong final accuracy on vision with a good schedule Slow on ill-conditioned loss surfaces; very LR-sensitive SGD + momentum 1 (velocity) Accelerates persistent directions, damps oscillation; the CNN workhorse Can overshoot; still one global \(\eta\) Adam 2 (\(m\), \(v\)) Per-parameter adaptive; robust across layer types; fast early progress 2× optimizer memory; L2 decay misbehaves AdamW 2 (\(m\), \(v\)) Adam with correct weight decay; default for transformers Same memory cost; still needs a schedule Adam's two extra moments cost real memory: at fp32 they add 8 bytes per parameter on top of the 4-byte weight and 4-byte gradient — the "16 bytes/param" rule that sizes training clusters (and the reason 8-bit optimizers and ZeRO sharding exist). The contested point worth flagging: on some vision benchmarks well-tuned SGD+momentum still generalizes slightly better than Adam, so "Adam always wins" is folklore, not law — it wins on convenience and on transformers, where SGD struggles. PYTHON · RUNNABLE IN-BROWSER # SGD vs momentum vs Adam on an ill-conditioned 2D quadratic # Loss = 0.5*(a*x^2 + b*y^2); steep in x (a=20), flat in y (b=1). import numpy as np a, b = 20.0, 1.0 def grad(p): return np.array([a*p[0], b*p[1]]) # gradient of the quadratic def loss(p): return 0.5*(a*p[0]**2 + b*p[1]**2) def run(kind, lr, steps=300): p = np.array([1.0, 1.0]); m = np.zeros(2); v = np.zeros(2) for t in range(1, steps+1): g = grad(p) if kind == "sgd": p = p - lr*g elif kind == "mom": m = 0.9*m + g; p = p - lr*m else: # adam m = 0.9*m + 0.1*g; v = 0.999*v + 0.001*g*g mh = m/(1-0.9**t); vh = v/(1-0.999**t) p = p - lr*mh/(np.sqrt(vh)+1e-8) return loss(p) # Each optimizer gets its own near-best stable lr (the fair way to compare them) for kind, lr in [("sgd", 0.04), ("mom", 0.02), ("adam", 0.20)]: print(f"{kind:5s} (lr={lr:.2f}) final loss after 300 steps: {run(kind, lr):.2e}") print("\nAdam reaches the lowest loss: it scales x and y independently, so the") print("steep x-direction and flat y-direction converge at the same rate -- the") print("single global step size that hobbles SGD on this surface is gone.") RUN ▶ edits are live — break it on purpose INSTRUMENT N7.1 — OPTIMIZER RACE SGD vs MOMENTUM vs ADAM ON A LOSS SURFACE · EQ N7.1–N7.3 CONDITION NUMBER (steepness ratio) 20 LEARNING RATE η 0.030 SGD FINAL LOSS — MOMENTUM FINAL LOSS — ADAM FINAL LOSS — Elliptical contours of an ill-conditioned quadratic — the canonical hard case. Three trajectories race from the same start: SGD zig-zags across the steep axis, momentum rolls through it faster, and Adam rescales each axis and heads almost straight for the minimum. Crank the condition number up and watch SGD stall while Adam barely notices; push the learning rate too high and momentum overshoots into divergence first. 7.2 Learning-rate schedules — warmup, cosine, cyclical The single learning rate \(\eta\) is the most consequential hyperparameter in deep learning, and the best value is not constant over a run. Two facts shape the schedule: early on, weights are random and gradients are large and chaotic, so a big step can blow up; late on, you want small steps to settle into a minimum. The modern default answers both with a warmup followed by a cosine decay. EQ N7.5 — WARMUP + COSINE SCHEDULE $$ \eta(t) = \begin{cases} \eta_{\max}\,\dfrac{t}{T_w} & t \le T_w \quad\text{(linear warmup)} \\[1.2em] \eta_{\min} + \tfrac{1}{2}\!\left(\eta_{\max} - \eta_{\min}\right)\!\left(1 + \cos\!\dfrac{\pi\,(t - T_w)}{T - T_w}\right) & t > T_w \quad\text{(cosine decay)} \end{cases} $$ \(T_w\) is the warmup length (commonly 1–5% of total steps \(T\)); \(\eta_{\min}\) is often \(0\) or a small floor. Warmup ramps the rate linearly from \(0\) to \(\eta_{\max}\), giving the optimizer's adaptive statistics (and a transformer's fragile early layers) time to stabilize before full-size steps. Cosine decay then eases the rate down a half-cosine: gentle at first, steepest in the middle, flattening to \(\eta_{\min}\) at the end. At \(t = T_w\), \(\cos 0 = 1 \Rightarrow \eta = \eta_{\max}\); at \(t = T\), \(\cos \pi = -1 \Rightarrow \eta = \eta_{\min}\) — the curve joins the two phases continuously. Why a cosine rather than a straight line or exponential? Empirically the cosine's slow start (it lingers near \(\eta_{\max}\)) buys more exploration before annealing, and its slow finish lets the model fine-settle — and it consistently beats step decay on large language and vision models. The cyclical / warm-restart family (SGDR) takes the idea further, resetting the schedule periodically so the rate jumps back up; each restart can knock the model out of a mediocre basin into a better one, and the snapshots make a cheap ensemble. The contested part: with a good cosine, restarts rarely help large single-run pretraining, so they have fallen out of fashion for frontier models while remaining useful for smaller budgets. True or false: after warmup, a cosine schedule decays the learning rate along a cosine curve — from \(\eta_{\max}\) down to \(\eta_{\min}\) as \(t\) goes from \(T_w\) to \(T\). (Answer true or false.) By EQ N7.5, the decay phase is \(\eta_{\min} + \tfrac12(\eta_{\max}-\eta_{\min})(1+\cos\frac{\pi(t-T_w)}{T-T_w})\) — exactly a half-period of a cosine, starting at \(\eta_{\max}\) (where \(\cos 0 = 1\)) and ending at \(\eta_{\min}\) (where \(\cos\pi = -1\)). The statement is true. You train for \(T = 10{,}000\) steps and set warmup to 3% of the run. How many steps \(T_w\) does the linear warmup phase last? \(T_w = 0.03 \times 10{,}000 = \) 300 steps. Over those 300 steps the rate climbs linearly from \(0\) to \(\eta_{\max}\); the remaining 9,700 steps follow the cosine decay down toward \(\eta_{\min}\). PYTHON · RUNNABLE IN-BROWSER # Warmup + cosine learning-rate schedule (EQ N7.5): build and inspect it import numpy as np T, Tw = 1000, 50 # total steps, warmup steps (5%) eta_max, eta_min = 1e-3, 0.0 def lr_at(t): if t < Tw: # linear warmup return eta_max * t / Tw prog = (t - Tw) / (T - Tw) # 0..1 through the decay return eta_min + 0.5*(eta_max - eta_min)*(1 + np.cos(np.pi*prog)) ts = np.arange(T) eta = np.array([lr_at(t) for t in ts]) print("step 0:", f"{lr_at(0):.2e} (warmup starts at 0)") print("step 50:", f"{lr_at(50):.2e} (peak = eta_max at end of warmup)") print("step 525:", f"{lr_at(525):.2e} (~midpoint of decay, steepest part)") print("step 999:", f"{lr_at(999):.2e} (decayed to eta_min)") print(f"\npeak step is {ts[eta.argmax()]} -> rate peaks exactly at warmup's end") plot_xy(ts, eta) # the classic ramp-then-cosine shape RUN ▶ edits are live — break it on purpose INSTRUMENT N7.2 — LR-SCHEDULE DESIGNER WARMUP + COSINE · EQ N7.5 PEAK RATE η_max 1.0e-3 WARMUP (% of run) 5% MIN RATE FLOOR (× peak) 0% WARMUP STEPS — PEAK RATE — FINAL RATE — The full schedule over a 10,000-step run: a linear warmup ramp into a cosine descent. Drag warmup to 0% and the curve starts at full rate — fine for a fine-tune, often unstable for from-scratch transformer pretraining. Raise the min-rate floor and the decay flattens above zero, which keeps the model learning if you plan to train longer than \(T\). The peak always lands exactly at the end of warmup. 7.3 Regularization & early stopping A network with millions of parameters can memorize its training set outright. Regularization is the set of pressures that push it to generalize instead — to fit the signal, not the noise. The deep-learning toolkit is small and well-understood. Weight decay (the \(\lambda\theta\) term of EQ N7.4). Shrinks weights toward zero each step, favoring simpler, smaller-norm solutions. Use the decoupled form via AdamW; exclude biases and norm scales. Dropout. During training, zero each activation independently with probability \(p\) and rescale the survivors by \(1/(1-p)\) (so the expected activation is unchanged). This prevents co-adaptation — no neuron can rely on any specific other — and approximates training an ensemble of subnetworks. At inference, dropout is off. Transformers use light dropout (\(p \approx 0.0\!-\!0.1\)); large-data pretraining often sets it to zero. Data augmentation. The cheapest regularizer: expand the effective dataset with label-preserving transforms (crops, flips, mixup/cutmix for vision; token masking for text). More data beats every other trick. Label smoothing. Replace one-hot targets with \((1-\varepsilon)\) on the true class and \(\varepsilon/K\) elsewhere, discouraging the model from becoming over-confident and improving calibration. Early stopping. Track a held-out validation loss; keep the checkpoint at its minimum and stop once it has stopped improving for a patience window. It is regularization by when you quit. EQ N7.6 — DROPOUT (TRAIN-TIME, INVERTED) $$ \tilde a_i = \frac{r_i}{1-p}\, a_i, \qquad r_i \sim \mathrm{Bernoulli}(1-p), \qquad \mathbb{E}[\tilde a_i] = a_i $$ Each activation \(a_i\) survives with probability \(1-p\) and is scaled up by \(1/(1-p)\). The expectation \(\mathbb{E}[\tilde a_i] = (1-p)\cdot \frac{a_i}{1-p} = a_i\) is preserved, so inference needs no rescaling — you simply disable dropout. The randomness forces redundant, robust representations; the rescaling keeps the forward pass's scale honest between train and test. The signature of overfitting is a validation loss that bottoms out and then rises while the training loss keeps falling — the model is now learning the training set's idiosyncrasies. Underfitting is the opposite: both losses sit high and flat, the model lacks the capacity, the right features, or enough training. Early stopping catches the first; more capacity, better features, or longer training fixes the second. A dropout layer scales the surviving activations by \(1/(1-p) = 1.25\) at train time (EQ N7.6). What is the keep probability \(1-p\)? The scale factor is \(1/(1-p) = 1.25\), so the keep probability is \(1-p = 1/1.25 = \) 0.8. That means \(p = 0.2\): one activation in five is dropped each step, and the rest are boosted by 25% to keep the expected signal unchanged. INSTRUMENT N7.3 — LOSS-CURVE DIAGNOSER TRAIN vs VALIDATION · UNDERFIT / OVERFIT / LR-TOO-HIGH FAILURE MODE HEALTHY UNDERFIT OVERFIT LR TOO HIGH DIAGNOSIS — FINAL TRAIN / VAL — FIX — Each button paints the canonical shape of a real training pathology — train loss and validation loss over epochs, with an early-stopping marker where validation bottoms out. OVERFIT: train keeps dropping, val turns back up (the classic divergence). UNDERFIT: both stay high and flat. LR TOO HIGH: loss spikes and oscillates, often blowing up. Learn the silhouettes here and you will diagnose a run from across the room. 7.4 Mixed precision & numerical stability Modern GPUs run dramatically faster in 16-bit than in 32-bit, and 16-bit tensors halve memory. Mixed-precision training captures both wins while keeping fp32 where precision is non-negotiable. The catch is dynamic range: the older float16 format has only ~5 exponent bits, so its largest representable value is about \(65{,}504\) and small gradients underflow to zero. The fix is loss scaling. EQ N7.7 — LOSS SCALING $$ \mathcal{L}_{\text{scaled}} = S \cdot \mathcal{L} \;\Rightarrow\; g_{\text{scaled}} = S \cdot g, \qquad g \;=\; \frac{1}{S}\, g_{\text{scaled}} \;\text{(unscale before the optimizer step)} $$ Multiply the loss by a large factor \(S\) (e.g. \(2^{15}\)) before backprop. By the chain rule every gradient is multiplied by the same \(S\), lifting tiny values out of fp16's underflow region. The gradients are then divided by \(S\) before the weight update, so the math is unchanged — only the representable range was borrowed. Dynamic loss scaling automates \(S\): raise it while gradients stay finite, and halve it (skipping that step) whenever an inf / NaN appears. Three practices keep mixed precision numerically safe: Keep an fp32 master copy of the weights. Updates are tiny relative to the weights; adding a small fp16 step to an fp16 weight rounds to nothing. The optimizer updates the fp32 master, then casts to fp16 for the next forward pass. Run reductions in fp32. Softmax, layer-norm statistics, and loss accumulation sum many terms; do them in fp32 to avoid catastrophic cancellation, even when the matmuls run in 16-bit. Prefer bfloat16 when the hardware has it. bf16 keeps fp32's 8 exponent bits (same ~\(10^{38}\) range) at the cost of mantissa precision, so it almost never overflows and usually needs no loss scaling — the reason it is the default for large-model training on recent accelerators. fp8 pushes further still and is now used for the heaviest matmuls, with per-tensor scaling. THE NUMERICS THAT BITE Most "my loss went to NaN" failures are numeric, not algorithmic. The usual suspects: fp16 gradient overflow (use loss scaling or switch to bf16); a learning rate high enough to send weights to inf in a few steps; \(\log(0)\) or \(0/0\) in a hand-written loss (add an \(\epsilon\), use the log-sum-exp trick); and un-clipped gradients on a spiky batch. Gradient clipping — rescale the gradient so \(\lVert g\rVert \le c\) (typically \(c = 1.0\)) — is cheap insurance against the last one and is standard in transformer recipes. The float16 format's largest representable finite value — the overflow ceiling that motivates loss scaling — is which number? (It is \((2 - 2^{-10})\times 2^{15}\).) \((2 - 2^{-10})\times 2^{15} = (2 - 0.0009765625)\times 32768 = 1.9990234375 \times 32768 = \) 65504. Any gradient (or activation) above this overflows to inf in fp16, which is exactly why loss scaling — and, better, bf16's fp32-sized exponent — exist. PYTHON · RUNNABLE IN-BROWSER # Why loss scaling exists: fp16 underflow, and how scaling rescues gradients import numpy as np FP16_MAX = 65504.0 # largest finite fp16; above this -> inf (overflow) # A batch of tiny gradients, the kind deep nets produce late in training. # fp16's smallest positive value is ~6e-8, so anything well below that vanishes. g = np.array([1e-3, 2e-5, 5e-7, 4e-8, 9e-9]) # Cast to fp16 with NO scaling -> the smallest entries flush to zero (underflow) g_fp16 = g.astype(np.float16) lost = int(np.sum((g != 0) & (g_fp16 == 0))) print("raw gradients:", g) print("naive fp16:", g_fp16.astype(np.float32)) print(f"-> {lost} of {g.size} gradients underflowed to exactly 0\n") # Loss scaling: multiply by S before fp16, divide back after (EQ N7.7) S = 2**15 scaled = g * S g_scaled = scaled.astype(np.float16).astype(np.float32) / S recovered = int(np.sum((g_fp16 == 0) & (g_scaled != 0))) overflow = bool(np.any(np.abs(scaled) > FP16_MAX)) print(f"with loss scale S={S}:", g_scaled) print(f"-> {recovered} previously-lost gradient(s) recovered; overflow? {overflow}") print("\nScaling lifts tiny gradients above fp16's underflow floor, then") print("unscales them after backprop -- same math, full dynamic range recovered.") RUN ▶ edits are live — break it on purpose 7.5 A practical recipe & debugging Theory converges; in practice the failures are mundane and repetitive. Here is a default that survives contact with reality for most supervised deep-learning tasks, followed by the debugging loop that finds the bug when it does not. # Defaults that work for most from-scratch deep-net training optimizer: AdamW · β1=0.9 · β2=0.999 (0.95 for big transformers) · ε=1e-8 weight_decay: 0.1 on weights · 0.0 on biases & norm/scale params lr: tune η_max first (it dominates); 3e-4 is a sane transformer start schedule: linear warmup 1–5% of steps → cosine decay to ~0 batch: as large as memory allows; raise lr with batch (lin/sqrt rule) precision: bf16 if available (no loss scaling); else fp16 + dynamic scaling grad_clip: global-norm clip at 1.0 — cheap insurance against spikes regularize: dropout 0.0–0.1 · augmentation · early-stop on val loss init: scaled init (He/Xavier or per-arch); verify activations don't explode When a run misbehaves, work the ladder from cheapest check to most expensive — most bugs are caught in the first three rungs: Overfit one batch. Before anything else, train on a single mini-batch until the loss hits (near) zero. If it cannot, the bug is in the model, the loss, or the data pipeline — not the hyperparameters. This one test catches a remarkable fraction of failures. Sanity-check the initial loss. For \(K\)-class classification with random weights, cross-entropy should start near \(\ln K\). If it starts far off, your labels, logits, or loss are wired wrong. Read the loss curve (Instrument N7.3). NaN/spike → lower LR, clip gradients, check for fp16 overflow. Flat-and-high → underfit: more capacity/LR/steps. Val turns up → overfit: regularize or early-stop. Do an LR sweep. The learning rate dominates every other knob. Sweep it over a few orders of magnitude (or use an LR-range test) before touching architecture. Watch gradient and activation norms. Exploding norms → clip, lower LR, check init/normalization. Vanishing norms → check residual connections, normalization placement, and activation functions. A 10-class classifier with random initial weights predicts roughly uniform probabilities. What initial cross-entropy loss should you expect — the sanity-check value \(\ln K\) for \(K = 10\)? A uniform prediction assigns probability \(1/K\) to the true class, so the loss is \(-\ln(1/K) = \ln K = \ln 10 \approx \) 2.302. If your run starts at, say, 6.0 instead, something is wrong with the labels, the logit scale, or the loss reduction — fix that before tuning anything else. NEXT You can now train a network that fits a fixed dataset; the next volume removes the dataset. Reinforcement learning replaces "minimize a loss on labeled examples" with "maximize a reward signal an agent must discover by acting" — a setting where the data is generated by the very policy you are optimizing. RL · 01 opens with the formalism that makes that tractable: the Markov decision process, states, actions, rewards, and the discounting that ties a future payoff to a present choice. 7.R References Kingma, D. P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015 — the first/second-moment adaptive optimizer with bias correction (EQ N7.3). Loshchilov, I. & Hutter, F. (2017). Decoupled Weight Decay Regularization. ICLR 2019 — AdamW; why decoupling weight decay from the adaptive step generalizes better (EQ N7.4). Micikevicius, P. et al. (2017). Mixed Precision Training. ICLR 2018 — fp16 training, the fp32 master copy, and loss scaling (EQ N7.7). Loshchilov, I. & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 — cosine annealing and cyclical warm restarts (EQ N7.5). Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — the dropout regularizer and inverted-dropout scaling (EQ N7.6). Keskar, N. S. et al. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017 — batch size, flat vs. sharp minima, and the generalization debate. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning MIT Press — Ch. 8 (optimization) and Ch. 7 (regularization), the standard textbook treatment. ← PREVIOUS 06 GANs NEXT CHAPTER 01 The RL Problem AI // ENCYCLOPEDIA — DEEP LEARNING · CH 07 FULL CONTENTS ↗
========================================================================
REINFORCEMENT LEARNING
========================================================================
## RL · The Reinforcement Learning Problem (https://ai-encyclopedia.com/rl/01-the-rl-problem.html)
The Reinforcement Learning Problem — AI Encyclopedia AI // ENCYCLOPEDIA / REINFORCEMENT LEARNING / 01 / THE PROBLEM INDEX NEXT: DYNAMIC PROGRAMMING → REINFORCEMENT LEARNING · CHAPTER 01 / 06 The Reinforcement Learning Problem Supervised learning hands the model a fixed set of labeled examples to imitate. Reinforcement learning gives an agent only a scalar reward and a world to act in, which raises a difficulty labels never do: the agent's own actions decide what data it sees next. There is no fixed dataset to fit, because the dataset follows from the policy, and improving the policy changes the dataset. This chapter states that loop precisely: the Markov decision process, the return it optimizes, the policies and value functions that summarize it, and the exploration and exploitation trade-off that has no counterpart in supervised learning. LEVEL INTRO READING TIME ≈ 24 MIN BUILDS ON STATS · MARKOV CHAINS INSTRUMENTS GRIDWORLD · MDP ANATOMY · γ EXPLORER IN THIS CHAPTER 1.1 Agents, environments & rewards 1.2 The Markov Decision Process 1.3 Returns & discounting 1.4 Policies & value functions 1.5 Exploration vs exploitation 1.R References 1.1 Agents, environments & rewards Reinforcement learning is the study of learning by interaction. There is an agent — the thing that learns and decides — and an environment — everything else, the world the agent acts in and cannot directly control. They meet in a loop that runs forever, or until the episode ends. At each step the agent observes a state, chooses an action, and the environment responds with a reward and a new state. That single sentence is the whole game; everything in this volume is built on it. AGENT policy π(a | s) ENVIRONMENT P(s′, r | s, a) action aₜ reward rₜ₊₁, state sₜ₊₁ The interaction loop. The agent emits an action; the environment returns a reward and the next state. Nothing else passes between them — no labels, no gradient, no ground-truth "correct action". The contrast with supervised learning is sharper than it first appears, and it is the reason RL is its own field rather than a corner of classification. Three differences matter: The signal is evaluative, not instructive. A label says "the answer was 7." A reward says only "that was worth +1" — it never reveals what the best action would have been. The agent must infer the better action by trying alternatives, which it can only do by acting differently. Feedback is delayed. The move that loses a chess game may have been made twenty turns earlier. Reward arrives long after the action that earned it, so the agent must solve a credit-assignment problem: which of my past decisions deserves the blame or the praise? The data is not i.i.d. — it is generated by the agent. A supervised dataset sits still while you fit it. In RL the distribution of states the agent encounters is produced by its own policy; change the policy and you change the data. An agent that never drives into the city never learns city driving. This feedback between policy and data is the defining difficulty, and it is exactly the bold idea this chapter is built around. The entire framework rests on one deceptively strong assumption, the reward hypothesis: that any goal we care about can be expressed as the maximization of expected cumulative scalar reward. Win the game; minimize fuel; keep the robot upright; finish the task a human would approve of. It is a remarkably general claim, and most of the field's hardest practical failures — reward hacking, specification gaming, agents that maximize the literal number while violating its intent — are failures of writing down the reward, not of optimizing it. We will return to that honesty repeatedly. A word on what "reward" is not. It is not a label and not a loss. It is part of the environment's definition, chosen by the problem designer, and the agent treats it as given and immutable. The agent's job is never to question the reward — only to collect as much of it, over time, as it can. 1.2 The Markov Decision Process To do anything rigorous we need to formalize that loop, and the standard formalization is the Markov Decision Process (MDP). An MDP is a Markov chain (see Stats · Markov Chains) with two additions: the transitions now depend on an action the agent chooses, and every transition emits a reward. It is the bridge between "an agent acting in a world" and "a problem we can solve with mathematics". EQ R1.1 — THE MDP TUPLE $$ \mathcal{M} = \langle\, \mathcal{S},\ \mathcal{A},\ P,\ R,\ \gamma \,\rangle $$ \(\mathcal{S}\) is the set of states; \(\mathcal{A}\) the set of actions; \(P(s' \mid s, a)\) the transition probability of landing in \(s'\) after taking \(a\) in \(s\); \(R(s, a)\) the expected immediate reward; and \(\gamma \in [0, 1]\) the discount factor (§1.3). Five objects fully specify the world. Everything an RL algorithm computes — values, policies, plans — is a function of this tuple alone. The word Markov carries the load. The state is assumed to be a sufficient statistic of the history: given the current state, the future is conditionally independent of the past. Where you go next depends only on where you are and what you do — not on how you got here. EQ R1.2 — THE MARKOV PROPERTY $$ P\big(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0\big) \;=\; P\big(s_{t+1} \mid s_t, a_t\big) $$ The full history collapses into the current state. This is not a law of nature — it is a property of how you choose to define the state. If a single video frame is not Markov (you cannot tell which way the ball is moving), stack four frames and it becomes Markov. Most of the art of applying RL is engineering a state representation that makes this assumption true enough to be useful. When the agent cannot observe the full state — only a partial observation — the problem becomes a POMDP, strictly harder, and is the subject of later chapters. The transition function \(P\) and reward function \(R\) together are called the model of the environment. A crucial fork in the field follows from whether the agent knows them. If \(P\) and \(R\) are known, the agent can plan — compute the best policy by pure thought, no interaction required, which is the dynamic programming of Chapter 02. If they are unknown, the agent must learn from samples, which is the model-free reinforcement learning of the chapters after. The MDP is the common language for both. Symbol Name What it is Gridworld example 𝒮 states every distinguishable situation each cell of the grid 𝒜 actions the agent's choices in a state up, down, left, right P(s′|s,a) transitions where actions lead (maybe stochastic) "move up" → cell above (90%), slip sideways (10%) R(s,a) reward scalar feedback per step +1 at the goal, −1 in a trap, −0.04 per step γ discount how much the future counts 0.9 — distant rewards matter, but less PYTHON · RUNNABLE IN-BROWSER # Define a tiny 3-state MDP and compute a trajectory's discounted return import numpy as np # states 0,1,2; action "go" moves you forward; reward is given on arrival # rewards collected along one episode: s0 -> s1 -> s2 (terminal) rewards = np.array([0.0, 1.0, 1.0, 1.0]) # r_1, r_2, r_3, r_4 received over time gamma = 0.9 # discounted return G = sum_t gamma^t * r_{t+1} (EQ R1.3) discounts = gamma ** np.arange(len(rewards)) G = float(np.sum(discounts * rewards)) print("reward sequence:", rewards.tolist()) print("discount weights:", discounts.round(4).tolist()) print(f"discount factor: gamma = {gamma}") print(f"discounted return G = {G:.4f}") # the famous case: rewards [1,1,1] at t=0,1,2 with gamma=0.9 g3 = sum(gamma**t * 1.0 for t in range(3)) print(f"\nreturn of [1,1,1] at gamma=0.9: {g3:.2f} (= 1 + 0.9 + 0.81)") RUN ▶ edits are live — break it on purpose INSTRUMENT R1.1 — MDP ANATOMY STATES · ACTIONS · TRANSITIONS · REWARDS SELECTED STATE START MID EDGE GOAL SLIP PROBABILITY 0.10 INTENDED ACTION → P(INTENDED) — REWARD ON ARRIVAL — A four-state chain laid bare. Pick a state and watch its action edges fan out: the mint arrow is where the agent intends to go, the faint grey arrows are where it might slip. Raise the slip probability and the world grows stochastic — the same action no longer guarantees the same outcome, which is exactly what makes the agent need a policy over states rather than a fixed plan of moves. The GOAL state is terminal; its only edge loops back to itself with the terminal reward. 1.3 Returns & discounting The agent does not maximize the immediate reward — that would be greedy and shortsighted. It maximizes the return: the total reward accumulated from now to the end of time. For an episode that terminates, the return is simply the sum of rewards. But many problems never terminate, and an infinite sum of rewards is itself infinite and meaningless to compare. The fix is to discount: weight a reward received \(k\) steps in the future by \(\gamma^k\). EQ R1.3 — THE DISCOUNTED RETURN $$ G_t \;=\; R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \;=\; \sum_{k=0}^{\infty} \gamma^k\, R_{t+k+1} $$ \(G_t\) is the return from time \(t\) onward. The discount \(\gamma \in [0, 1]\) sets the agent's horizon: a reward \(k\) steps away is worth \(\gamma^k\) of an immediate one. This single number encodes how far-sighted the agent is. If every reward is bounded by \(R_{\max}\), the geometric series guarantees the return is finite — \(|G_t| \le R_{\max}/(1-\gamma)\) — which is the whole reason discounting exists for continuing tasks. Discounting earns its place for three independent reasons, and it helps to keep them distinct. Mathematically, \(\gamma < 1\) makes the infinite sum converge, so returns are comparable numbers rather than diverging infinities. Economically, it mirrors the time value of reward — a unit now is worth more than a unit later, exactly as in interest and inflation. Behaviorally, it expresses uncertainty about the future: at each step there is effectively a \((1-\gamma)\) chance the world ends, so \(\gamma\) is the per-step survival probability. The return is the expected total reward under that geometric lifetime. EQ R1.4 — THE RECURSIVE FORM (WHY EVERYTHING IS BELLMAN) $$ G_t \;=\; R_{t+1} + \gamma\, G_{t+1} $$ Factor one \(\gamma\) out of EQ R1.3 and the return splits cleanly: this step's reward, plus the discounted return of everything after. This one-line recursion is the seed of every value-function method in the rest of the volume — the Bellman equations of Chapter 02 are nothing but EQ R1.4 with an expectation wrapped around it. Recognizing returns as self-referential is the conceptual leap from "sum up rewards" to "reinforcement learning". The endpoints of \(\gamma\) sharpen the intuition. At \(\gamma = 0\) the return collapses to \(R_{t+1}\): the agent cares only about the immediate reward and is purely myopic, blind to consequences. As \(\gamma \to 1\) every future reward counts almost fully and the agent becomes far-sighted, willing to suffer many small penalties now for a large payoff later — the patience a maze or a game of chess demands. Most problems live in between, and the choice of \(\gamma\) genuinely changes the optimal policy, not just the numbers: a maze-solver at \(\gamma = 0.99\) will take a longer, safer route that a \(\gamma = 0.8\) agent rejects as too distant to be worth it. An agent receives a reward of \(1\) at each of three consecutive steps and then the episode ends. With discount \(\gamma = 0.9\), what is the discounted return \(G\) from the first step? (Use EQ R1.3.) \(G = \gamma^0\cdot 1 + \gamma^1\cdot 1 + \gamma^2\cdot 1 = 1 + 0.9 + 0.81 = \) 2.71. Compare the undiscounted sum, which would be exactly 3 — discounting shaves off the value of the two later rewards. True or false: setting \(\gamma = 0\) makes the agent purely myopic — it optimizes only the immediate reward \(R_{t+1}\) and ignores all future consequences. (Answer true or false.) In EQ R1.3 every term except the first carries a factor of \(\gamma^k\) with \(k \ge 1\); at \(\gamma = 0\) all of those vanish (\(0^k = 0\)), leaving \(G_t = R_{t+1}\) alone. The agent values only the next reward and is blind to everything after it. The statement is true. PYTHON · RUNNABLE IN-BROWSER # How the discount factor reshapes the same reward stream (EQ R1.3) import numpy as np rewards = np.ones(40) # a steady +1 every step, 40 steps horizons = [] for gamma in (0.0, 0.5, 0.9, 0.99): w = gamma ** np.arange(len(rewards)) # discount weights G = float((w * rewards).sum()) # "effective horizon" 1/(1-gamma): steps that meaningfully count eff = np.inf if gamma >= 1 else 1.0 / (1.0 - gamma) horizons.append((gamma, G, eff)) print(f"gamma={gamma:4} -> return {G:7.3f} effective horizon ~ {eff:6.1f} steps") print("\ngamma=0 is myopic: return is exactly the first reward (1.0).") print("near gamma=1 the agent sums almost all 40 rewards and plans far ahead.") # closed form for an infinite stream of +1: 1/(1-gamma) print("infinite-stream limits 1/(1-g):", [round(1/(1-g),2) for g in (0.5,0.9,0.99)]) plot_xy([h[0] for h in horizons], [h[1] for h in horizons]) RUN ▶ edits are live — break it on purpose INSTRUMENT R1.2 — DISCOUNT-FACTOR EXPLORER γ AND THE AGENT'S HORIZON · EQ R1.3 DISCOUNT γ 0.90 EFFECTIVE HORIZON 1/(1−γ) — RETURN OF STEADY +1 — WEIGHT ON STEP 10 — Each bar is the weight \(\gamma^k\) the agent places on the reward \(k\) steps ahead. At \(\gamma = 0\) only the first bar survives — pure myopia. Slide toward 1 and the bars flatten into a long, slowly-decaying tail: the agent's gaze stretches further into the future. The effective horizon \(1/(1-\gamma)\) is a useful rule of thumb for how many steps actually matter — γ = 0.9 sees about 10 steps, γ = 0.99 about 100. Notice how violently the horizon stretches as γ approaches 1; that sensitivity is why γ is one of the most consequential knobs in all of RL. 1.4 Policies & value functions A policy is the agent's behavior: a rule that maps states to actions. It is the object RL ultimately searches for — the solution to the MDP. A policy can be deterministic, \(a = \pi(s)\), always taking the same action in a state; or stochastic, \(\pi(a \mid s)\), a probability distribution over actions. Stochastic policies matter more than they look: they are how an agent explores, and in some problems (and in every adversarial game) the optimal policy is irreducibly random. EQ R1.5 — THE POLICY $$ \pi(a \mid s) \;=\; \Pr\big(A_t = a \mid S_t = s\big), \qquad \sum_{a \in \mathcal{A}} \pi(a \mid s) = 1 $$ A policy is a conditional distribution over actions given the state. The entire goal of reinforcement learning is to find a policy that maximizes expected return. Crucially the policy depends on the state only — not on the time step or the history — which is exactly what the Markov property (EQ R1.2) buys us: in a Markov world, an optimal policy that depends only on the current state always exists. To improve a policy we need to know how good it is, and "good" means expected return. This gives the two central quantities of the field. The state-value function \(V^\pi(s)\) is the expected return from starting in state \(s\) and following \(\pi\) thereafter. The action-value function \(Q^\pi(s, a)\) is the expected return from taking action \(a\) in state \(s\) and then following \(\pi\). EQ R1.6 — STATE-VALUE AND ACTION-VALUE $$ V^\pi(s) \;=\; \mathbb{E}_\pi\!\big[\, G_t \mid S_t = s \,\big], \qquad Q^\pi(s, a) \;=\; \mathbb{E}_\pi\!\big[\, G_t \mid S_t = s,\, A_t = a \,\big] $$ \(V^\pi(s)\) answers "how much reward can I expect from here, behaving like this?" and \(Q^\pi(s,a)\) answers "and how much if I commit to action \(a\) first?". They are linked by \(V^\pi(s) = \sum_a \pi(a\mid s)\,Q^\pi(s,a)\). The difference \(Q^\pi(s,a) - V^\pi(s)\) — the advantage — is the single most useful quantity in modern policy-gradient RL, because it says whether an action beats the policy's own average without you needing to know the absolute scale. Why two functions? Because of what each one lets you do without a model. If you know \(V^\pi\) but not the transitions \(P\), you cannot act greedily — you would need to know where each action leads to compare states. But if you know \(Q^\pi\), choosing the best action is trivial: take \(\arg\max_a Q(s,a)\), no model required. This is precisely why model-free control methods (Q-learning, SARSA) learn \(Q\) rather than \(V\). The value functions are the connective tissue of the whole field: estimate them, and a good policy falls out. A subtlety experts will insist on: values are always relative to a policy. There is no such thing as "the value of a state" in the abstract — only its value under some way of behaving. The exception is the optimal value function \(V^{*}(s) = \max_\pi V^\pi(s)\), the best achievable from each state, which Chapter 02 shows how to compute. Solving an MDP means finding \(V^{*}\) (or \(Q^{*}\)) and reading the optimal policy off it. In state \(s\) the policy is \(\pi(\text{up}\mid s) = 0.7\) and \(\pi(\text{down}\mid s) = 0.3\). The action-values are \(Q(s,\text{up}) = 10\) and \(Q(s,\text{down}) = 7\). What is the state-value \(V^\pi(s) = \sum_a \pi(a\mid s)\,Q^\pi(s,a)\)? \(V^\pi(s) = 0.7 \times 10 + 0.3 \times 7 = 7 + 2.1 = \) 9.1. The state-value is just the policy-weighted average of the action-values — which is why a policy that puts more weight on the better action raises \(V\). PYTHON · RUNNABLE IN-BROWSER # V from Q, the advantage, and which action a greedy agent would pick (EQ R1.6) import numpy as np actions = ["up", "down", "left", "right"] Q = np.array([10.0, 7.0, 4.0, 9.0]) # Q(s, a) for one state s pi = np.array([0.70, 0.10, 0.05, 0.15]) # current stochastic policy pi(a|s) assert np.isclose(pi.sum(), 1.0) V = float(pi @ Q) # V(s) = sum_a pi(a|s) Q(s,a) advantage = Q - V # A(s,a) = Q(s,a) - V(s) print(f"state value V(s): {V:.3f}") for a, q, adv, p in zip(actions, Q, advantage, pi): flag = " <- above average" if adv > 0 else "" print(f" {a:5s} Q={q:5.1f} A={adv:+5.2f} pi={p:.2f}{flag}") greedy = actions[int(np.argmax(Q))] print(f"\ngreedy action (argmax Q): {greedy}") print("a policy gradient step pushes pi toward actions with positive advantage.") RUN ▶ edits are live — break it on purpose 1.5 Exploration vs exploitation Here is the dilemma that has no counterpart in supervised learning. To collect reward, the agent should exploit — take the action it currently believes is best. But its beliefs are estimates built from limited experience, and the only way to improve them is to explore — try actions it is unsure about, which may be worse. Every step forces a choice between cashing in what you know and gathering information that might let you do better later. Lean too far toward exploitation and you lock onto a mediocre habit, never discovering the better path you never tried. Lean too far toward exploration and you squander reward forever testing options you already know are bad. This tension is fundamental, not an artifact of any algorithm, and it is sharpened by the feedback we opened the chapter with: because the agent's actions decide what data it sees, an action never taken is an action never learned from. A supervised learner sees every labeled example in the dataset whether it likes it or not. An RL agent sees only the consequences of what it chose to do — so under-exploration is self-reinforcing, a blind spot the agent cannot detect from inside. The simplest workable strategy is ε-greedy: with probability \(1-\varepsilon\) take the current best (greedy) action, and with probability \(\varepsilon\) take a uniformly random action instead. It is crude but remarkably effective, and it is the exploration rule baked into the first generation of deep-RL agents. EQ R1.7 — ε-GREEDY ACTION SELECTION $$ \pi(a \mid s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & \text{if } a = \arg\max_{a'} Q(s, a') \\[1.2em] \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases} $$ With probability \(1-\varepsilon\) the agent exploits the greedy action; with probability \(\varepsilon\) it picks uniformly at random (so even the greedy action keeps a small \(\varepsilon/|\mathcal{A}|\) slice of the random mass). \(\varepsilon\) is the exploration rate, and it is almost always annealed — started high so the agent samples widely, then decayed toward zero as its value estimates sharpen and exploitation becomes safe. At \(\varepsilon = 0\) the policy is purely greedy; at \(\varepsilon = 1\) it is a uniformly random walk. ε-greedy is not the only tool, and it has a real weakness worth stating: it explores blindly, treating a clearly-terrible action and a plausibly-good-but-untested one as equally worth a random visit. Smarter schemes — optimism in the face of uncertainty (UCB, which adds a bonus for actions tried less often), Boltzmann / softmax exploration (sample in proportion to estimated value), and posterior sampling (Thompson sampling) — direct exploration toward what is genuinely uncertain rather than uniformly random. The multi-armed bandit, the simplest possible RL problem (one state, no transitions, just the explore–exploit trade-off in isolation), is where these are studied cleanly and where the regret bounds that quantify "how much you lose by not knowing the best arm" are proved. We treat bandits in their own chapter. An ε-greedy agent uses \(\varepsilon = 0.1\) over \(|\mathcal{A}| = 4\) actions. Using EQ R1.7, what total probability does it place on the single greedy action \(\arg\max_a Q(s,a)\)? \(1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} = 1 - 0.1 + \dfrac{0.1}{4} = 0.9 + 0.025 = \) 0.925. The remaining \(0.075\) is split evenly across the other three actions (\(0.025\) each), so the four probabilities sum to 1. PYTHON · RUNNABLE IN-BROWSER # Epsilon-greedy action selection, with an annealed exploration rate (EQ R1.7) import numpy as np rng = np.random.default_rng(0) Q = np.array([1.0, 5.0, 2.0, 4.0]) # estimated action-values, 4 actions nA = len(Q) greedy = int(np.argmax(Q)) # action index 1 here (value 5.0) def eps_greedy(Q, eps): if rng.random() < eps: # explore: uniform random return rng.integers(len(Q)) return int(np.argmax(Q)) # exploit: the greedy action # anneal epsilon from 1.0 down to a 0.05 floor, and report the rate print(" step epsilon P(greedy)=1-eps+eps/|A| empirical greedy share") for step in (0, 50, 200, 1000): eps = max(0.05, 1.0 * (0.995 ** step)) # exponential decay with a floor p_greedy = 1 - eps + eps / nA picks = [eps_greedy(Q, eps) for _ in range(4000)] share = np.mean(np.array(picks) == greedy) print(f" {step:5d} {eps:6.3f} {p_greedy:6.3f} {share:6.3f}") print(f"\ngreedy action = index {greedy} (Q = {Q[greedy]}).") print("as epsilon decays, the agent shifts from exploring to exploiting it.") RUN ▶ edits are live — break it on purpose INSTRUMENT R1.3 — GRIDWORLD EXPLORER SET REWARDS · WATCH A POLICY EMERGE · VALUE ITERATION DISCOUNT γ 0.90 STEP COST −0.04 PAINT GOAL +1 TRAP −1 WALL CLEAR VIEW VALUES + POLICY SWEEPS TO CONVERGE — V AT START CELL — Click cells to paint a goal (+1), a trap (−1), or a wall, then watch value iteration flood the grid: the mint shading is the state-value \(V(s)\), and the arrows are the greedy policy it implies — a plan the agent never wrote, only computed. Make the step cost more negative and the policy grows impatient, hugging the shortest path and accepting risk near the trap; soften it and the agent takes longer, safer detours. Raise γ and distant goals pull harder, reshaping arrows two and three cells away. This is the planning of Chapter 02, run live on whatever world you paint. (This solves a known MDP by dynamic programming — the agent here is given \(P\) and \(R\); the model-free chapters drop that luxury.) NEXT We can now state the problem exactly — but not yet solve it. Chapter 02 supplies the first machinery: the Bellman equations, which turn the recursive return (EQ R1.4) into a fixed-point system, and the dynamic-programming algorithms — policy evaluation, policy iteration, and value iteration (the engine behind the Gridworld instrument above) — that compute the optimal value function and policy when the model \(P\) and \(R\) is known. Everything after that is the harder, real-world case: learning the same answers from experience alone. 1.R References Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — the canonical text; the agent–environment loop, MDPs, returns, value functions, and exploration as framed in this chapter. Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics 6(5) — the formal origin of the MDP (EQ R1.1) and the recursive value relation behind EQ R1.4. Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8 — the action-value function \(Q^\pi\) (EQ R1.6) and the convergence result that grounds model-free control. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — deep Q-networks; the modern demonstration that ε-greedy exploration (EQ R1.7) scales to high-dimensional state spaces. Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 — UCB and the regret framework for principled exploration beyond ε-greedy (§1.5). Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 — an early, lucid survey of the problem formulation used throughout this chapter. ← PREVIOUS 07 Training Deep Nets NEXT CHAPTER 02 Dynamic Programming AI // ENCYCLOPEDIA — REINFORCEMENT LEARNING · CH 01 FULL CONTENTS ↗
## RL · Dynamic Programming (https://ai-encyclopedia.com/rl/02-dynamic-programming.html)
Dynamic Programming — Value & Policy Iteration — AI Encyclopedia AI // ENCYCLOPEDIA / REINFORCEMENT LEARNING / 02 / DYNAMIC PROGRAMMING INDEX NEXT: MODEL-FREE VALUE → REINFORCEMENT LEARNING · CHAPTER 02 / 06 Dynamic Programming — Value & Policy Iteration The previous chapter posed the control problem and the value functions that summarize it. This chapter solves it exactly, under one strong assumption: that you hold the environment's dynamics in hand. When you know the model, the Bellman equation turns optimal control into a fixed point you can iterate to. Policy evaluation, value iteration, and policy iteration are three ways of running that iteration. Understanding why they converge is the foundation every model-free method later imitates. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON RL · CH 01 · MARKOV CHAINS INSTRUMENTS VALUE ITERATION · POLICY ITERATION · BELLMAN BACKUP IN THIS CHAPTER 2.1 The Bellman equations 2.2 Policy evaluation 2.3 Value iteration 2.4 Policy iteration 2.5 Why DP needs the model 2.R References 2.1 The Bellman equations A value function answers one question: starting here, and acting in some way forever after, how much reward do I expect to collect? The trick that makes that infinite sum tractable is recursion. The value of a state is the immediate reward plus the discounted value of wherever you land — the future is just another instance of the same problem, one step smaller. Writing that observation down for a fixed policy \(\pi\) gives the Bellman expectation equation. EQ R2.1 — BELLMAN EXPECTATION EQUATION $$ V^\pi(s) \;=\; \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V^\pi(s') \,\Big] $$ Read it right-to-left as an average over what could happen: pick an action from the policy \(\pi(a\mid s)\), let the environment roll the transition \(p(s', r \mid s, a)\), collect reward \(r\), and add the discounted value of the next state. Because \(V^\pi\) appears on both sides, this is not a definition you evaluate once — it is a system of \(|\mathcal{S}|\) linear equations whose unique solution is the value function. Everything in this chapter is a way of solving it. For control we want the best policy, not a fixed one. Replace the average over the policy with a maximum over actions and you get the Bellman optimality equation — the equation this whole chapter exists to solve. EQ R2.2 — BELLMAN OPTIMALITY EQUATION $$ V^{*}(s) \;=\; \max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V^{*}(s') \,\Big] \;=\; \max_a Q^{*}(s, a) $$ The optimal value of a state is achieved by the single best action, assuming you continue optimally thereafter. This is no longer linear — the \(\max\) makes it a nonlinear fixed-point equation — but it still has a unique solution \(V^{*}\), and once you have it the optimal policy falls out for free: act greedily, \(\pi^{*}(s) = \arg\max_a Q^{*}(s, a)\). The deterministic-environment special case is the familiar \(V^{*}(s) = \max_a\,[\,r + \gamma\, V^{*}(s')\,]\). It is worth being precise about why a solution even exists. Define the Bellman optimality operator \(\mathcal{T}\) acting on any value table \(V\): EQ R2.3 — THE BELLMAN OPTIMALITY OPERATOR $$ (\mathcal{T}V)(s) \;=\; \max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V(s') \,\Big] $$ \(\mathcal{T}\) takes a guess at the value function and returns a better guess. \(V^{*}\) is exactly its fixed point: \(\mathcal{T}V^{*} = V^{*}\). The decisive fact — proven in §2.3 — is that for \(\gamma < 1\) the operator is a \(\gamma\)- contraction in the max-norm: it shrinks the distance between any two value tables by at least a factor \(\gamma\). The Banach fixed-point theorem then guarantees a unique fixed point that you reach by iterating from anywhere. That single property is why every algorithm below works. True or false: in a deterministic environment the optimal value satisfies \( V^{*}(s) = \max_a\,[\,r + \gamma\, V^{*}(s')\,] \), where \(s'\) and \(r\) are the state and reward that action \(a\) produces. (Answer true or false.) This is exactly EQ R2.2 specialized to a deterministic transition, where the sum \(\sum_{s',r} p(s',r\mid s,a)[\cdot]\) collapses to a single term because each action leads to one \((s', r)\) with probability 1. The answer is true. INSTRUMENT R2.1 — BELLMAN-BACKUP EXPLORER ONE STATE · TWO ACTIONS · EQ R2.2 A single state with two actions. Each action gives an immediate reward and lands in a successor whose value you already estimate. The backup computes \(Q(a) = r_a + \gamma\,V(s'_a)\) for each action and takes the max — one cell of the table that value iteration sweeps over. DISCOUNT γ 0.90 ACTION A · reward r 5 ACTION A · next-state value V(s′) 10 ACTION B · reward r 1 ACTION B · next-state value V(s′) 20 Q(A) = r + γ·V(s′) — Q(B) = r + γ·V(s′) — V(s) = max Q · GREEDY ACTION — Push γ to 0 and the backup becomes myopic — only the immediate reward matters, so action A (higher r) wins. Push γ toward 1 and the future dominates: action B's richer successor takes over. The crossover is the whole story of discounting. The greedy arrow lights up the winning action; this single cell, evaluated for every state, is one sweep of value iteration. 2.2 Policy evaluation Before improving a policy you must score it. Given a fixed \(\pi\), policy evaluation computes \(V^\pi\) — the answer to EQ R2.1. You could solve the linear system directly (invert an \(|\mathcal{S}| \times |\mathcal{S}|\) matrix), but for anything beyond toy sizes the iterative method is cheaper and is the template for everything that follows. Turn the Bellman expectation equation into an update rule: take your current estimate, plug it into the right-hand side, and read off a new estimate. EQ R2.4 — ITERATIVE POLICY EVALUATION $$ V_{k+1}(s) \;\leftarrow\; \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V_k(s') \,\Big] $$ Start from any \(V_0\) (zeros are fine), sweep this backup over every state, repeat. This is the Bellman expectation operator \(\mathcal{T}^\pi\), and it too is a \(\gamma\)-contraction, so \(V_k \to V^\pi\) geometrically. A sweep is one full pass over the state space; convergence is declared when the largest change \(\max_s |V_{k+1}(s) - V_k(s)|\) drops below a tolerance \(\theta\). In place updates — overwriting \(V(s)\) as you go rather than buffering a fresh copy — converge faster and use half the memory; this is Gauss–Seidel style and is what production code does. The canonical example is Sutton & Barto's \(4 \times 4\) gridworld: an agent moves up/down/left/right, the two opposite corners are terminal, and every move costs \(-1\) until the agent escapes (\(\gamma = 1\)). Under the equiprobable random policy — each direction chosen with probability \(0.25\) — iterative policy evaluation converges to a value surface that grows more negative the farther a cell sits from a terminal. The corner-adjacent cells settle near \(-14\); the cells deepest in the interior reach \(-20\) to \(-22\). Those numbers are not arbitrary; they are the expected number of steps a random walker takes to stumble out, negated. PYTHON · RUNNABLE IN-BROWSER # Iterative policy evaluation (EQ R2.4): 4x4 gridworld, random policy import numpy as np TERM = {0, 15} # two terminal corners def step(s, a): # 0=up 1=down 2=left 3=right; off-edge = stay r, c = divmod(s, 4) if a == 0: r = max(r - 1, 0) elif a == 1: r = min(r + 1, 3) elif a == 2: c = max(c - 1, 0) else: c = min(c + 1, 3) return r * 4 + c V, gamma = np.zeros(16), 1.0 for sweep in range(1000): delta = 0.0 for s in range(16): if s in TERM: # value of a terminal state is 0 continue v = sum(0.25 * (-1 + gamma * V[step(s, a)]) for a in range(4)) delta = max(delta, abs(v - V[s])) V[s] = v if delta < 1e-4: break print(f"converged in {sweep + 1} sweeps") print(np.round(V, 1).reshape(4, 4)) # matches Sutton & Barto Fig 4.1 RUN ▶ edits are live — break it on purpose What policy evaluation cannot do. It scores a policy; it does not improve one. By itself it would just confirm that wandering randomly is expensive. The leverage comes from pairing evaluation with a greedy step — the subject of §2.4 — or from folding the two together, which is §2.3. 2.3 Value iteration Why fully evaluate a policy you are about to throw away? Value iteration short-circuits the loop: do a single backup, but use the max over actions instead of the policy average. That is the Bellman optimality operator \(\mathcal{T}\) (EQ R2.3) applied as an update rule, and iterating it drives \(V\) straight to \(V^{*}\) without ever naming an intermediate policy. EQ R2.5 — VALUE ITERATION $$ V_{k+1}(s) \;\leftarrow\; \max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V_k(s') \,\Big] $$ One greedy backup per state per sweep; no inner evaluation loop. You can read it as policy evaluation truncated to a single step before re-improving. When the max change across a sweep falls below \(\theta\), stop and read off the greedy policy \(\pi(s) = \arg\max_a Q(s, a)\) once. On the gridworld with \(\gamma = 1\) this converges in just four sweeps to \(V^{*}(s) = -(\text{shortest-path distance to the nearest terminal})\) — information propagates outward from the goals one ring per sweep, exactly like a flood fill. The convergence guarantee rests on the contraction property promised in §2.1. For any two value tables \(U, V\), the max-norm distance after a backup shrinks: EQ R2.6 — \(\mathcal{T}\) IS A γ-CONTRACTION $$ \lVert \mathcal{T}U - \mathcal{T}V \rVert_\infty \;\le\; \gamma\, \lVert U - V \rVert_\infty $$ Because the operator differs between \(U\) and \(V\) only through the discounted future term \(\gamma V(s')\), and \(|\max_a f(a) - \max_a g(a)| \le \max_a |f(a) - g(a)|\), every backup multiplies the worst-case error by at most \(\gamma\). So after \(k\) sweeps the error is bounded by \(\gamma^{k}\) times the initial error — geometric convergence, with the rate set entirely by \(\gamma\). At \(\gamma = 0.9\) you lose roughly one digit of error every ~22 sweeps; the closer \(\gamma\) creeps to 1, the slower the crawl, which is why long-horizon problems are genuinely hard. (At \(\gamma = 1\), as in the episodic gridworld, contraction is not strict — convergence instead relies on every state reaching a terminal, i.e. a proper policy.) True or false: value iteration converges because the Bellman optimality operator \(\mathcal{T}\) is a contraction (for \(\gamma < 1\)), so the Banach fixed-point theorem guarantees a unique fixed point reached from any starting \(V_0\). (Answer true or false.) EQ R2.6 establishes \(\lVert \mathcal{T}U - \mathcal{T}V \rVert_\infty \le \gamma \lVert U - V \rVert_\infty\) with \(\gamma < 1\), which is exactly the definition of a contraction mapping. The Banach fixed-point theorem then guarantees a unique fixed point \(V^{*}\) and convergence to it from any initialization. The answer is true. PYTHON · RUNNABLE IN-BROWSER # Value iteration (EQ R2.5): converged optimal value function import numpy as np TERM = {0, 15} def step(s, a): r, c = divmod(s, 4) if a == 0: r = max(r - 1, 0) elif a == 1: r = min(r + 1, 3) elif a == 2: c = max(c - 1, 0) else: c = min(c + 1, 3) return r * 4 + c V, gamma = np.zeros(16), 1.0 for sweep in range(1000): delta = 0.0 for s in range(16): if s in TERM: continue best = max(-1 + gamma * V[step(s, a)] for a in range(4)) delta = max(delta, abs(best - V[s])) V[s] = best # in-place (Gauss-Seidel) backup if delta < 1e-9: break print(f"value iteration converged in {sweep + 1} sweeps") print("V* = -(steps to nearest terminal):") print(V.reshape(4, 4)) RUN ▶ edits are live — break it on purpose INSTRUMENT R2.2 — VALUE-ITERATION STEPPER 4×4 GRIDWORLD · γ=1 · EQ R2.5 Two terminal corners (mint). Every move costs \(-1\). Step the backup one sweep at a time and watch value flood outward from the goals — exactly one ring per sweep — then settle into \(V^{*}(s) = -(\text{distance to nearest terminal})\). CONTROL STEP SWEEP ▶ RUN TO CONVERGE ⏩ RESET ↺ SWEEP k 0 MAX CHANGE Δ THIS SWEEP — STATUS INITIAL At sweep 0 every non-terminal cell reads 0. After sweep 1, cells one step from a goal know they are worth \(-1\); after sweep 2, the next ring learns \(-2\); the deepest cells lock in by sweep 4 and Δ hits 0 — converged. The arrows show the greedy policy implied by the current values: even mid-iteration they already point roughly toward the exits. 2.4 Policy iteration Policy iteration takes the opposite tack to value iteration. Rather than interleaving one tiny backup with one tiny improvement, it alternates two full-strength phases: evaluate the current policy completely (§2.2), then make it greedy with respect to the values you just computed. Repeat until the greedy step changes nothing. EQ R2.7 — POLICY IMPROVEMENT (GREEDY STEP) $$ \pi'(s) \;=\; \arg\max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V^\pi(s') \,\Big] $$ Given \(V^\pi\), act greedily one step then follow \(\pi\) — this can only help. The policy improvement theorem guarantees \(V^{\pi'}(s) \ge V^\pi(s)\) for every state, with strict improvement somewhere unless \(\pi\) is already optimal. So the sequence of policies is monotonically non-decreasing in value, and since a finite MDP has only finitely many deterministic policies, the loop must terminate at \(\pi^{*}\) in a finite number of steps — no tolerance, no approximation. On the \(4\times 4\) gridworld it converges in four improvement rounds. The two algorithms are endpoints of a spectrum that the generalized view unifies. Policy iteration runs evaluation to convergence before improving; value iteration improves after a single evaluation backup; modified policy iteration sits in between, running \(m\) evaluation sweeps per improvement. All three are instances of generalized policy iteration (GPI): evaluation and improvement chasing each other until they agree — and where they agree, the policy is greedy with respect to its own value, which is precisely the Bellman optimality condition. GPI Two processes, one fixed point. Evaluation makes the value consistent with the policy; improvement makes the policy greedy with respect to the value. Each step undoes a little of the other's work — until the only place they both rest is the optimal pair \((V^{*}, \pi^{*})\). Almost every RL algorithm in the rest of this volume, model-free included, is a flavor of GPI with the exact backup of EQ R2.5 replaced by a sampled estimate. PYTHON · RUNNABLE IN-BROWSER # Policy iteration (EQ R2.4 + EQ R2.7): print the optimal policy import numpy as np TERM = {0, 15} def step(s, a): r, c = divmod(s, 4) if a == 0: r = max(r - 1, 0) elif a == 1: r = min(r + 1, 3) elif a == 2: c = max(c - 1, 0) else: c = min(c + 1, 3) return r * 4 + c gamma = 1.0 pi = np.zeros(16, dtype=int) # start: everyone goes "up" def evaluate(pi): # iterative policy evaluation V = np.zeros(16) for _ in range(2000): d = 0.0 for s in range(16): if s in TERM: continue v = -1 + gamma * V[step(s, pi[s])] d = max(d, abs(v - V[s])); V[s] = v if d < 1e-10: break return V for rounds in range(50): V = evaluate(pi) stable = True for s in range(16): if s in TERM: continue old = pi[s] pi[s] = int(np.argmax([-1 + gamma * V[step(s, a)] for a in range(4)])) if pi[s] != old: stable = False if stable: break arrows = np.array(list("^v<>")) # up down left right g = arrows[pi]; g[0] = "*"; g[15] = "*" # mark terminals print(f"policy iteration converged in {rounds + 1} improvement rounds") print(g.reshape(4, 4)) RUN ▶ edits are live — break it on purpose INSTRUMENT R2.3 — POLICY-ITERATION VISUALIZER EVALUATE ⇄ IMPROVE · EQ R2.7 The same gridworld. Each round runs full policy evaluation, then a greedy improvement. Watch the arrows snap toward the exits and the values lock in. The two phases alternate; convergence is when an improvement step changes no arrow. CONTROL EVALUATE π IMPROVE → π′ RESET ↺ ROUND 0 PHASE INIT ARROWS CHANGED LAST IMPROVE — Start: every cell points up — a deliberately bad policy whose values are deeply negative. Hit EVALUATE to score it, then IMPROVE to make it greedy; alternate. The policy is optimal the first time an improvement changes zero arrows. Notice the values are already exact for the current policy after each EVALUATE — that is what separates this from value iteration's single-backup steps. 2.5 Why DP needs the model Every backup in this chapter contains the same fingerprint: \(\sum_{s', r} p(s', r \mid s, a)[\cdot]\). That sum is an expectation over the environment's dynamics, and to compute it you must know \(p(s', r \mid s, a)\) — the full transition and reward model. This is the defining assumption of dynamic programming, and it is exactly what the next chapter abandons. You rarely have the model. A robot does not ship with a transition table; a game agent is not handed the opponent's policy; a recommender does not know how a user will react. The whole field of model-free RL exists because \(p\) is usually unavailable, and learning it well enough to plan against can be harder than learning to act directly. Even with the model, the sweep is expensive. Each sweep touches every state and, per state, every action and every possible successor: \(O(|\mathcal{S}|^2 |\mathcal{A}|)\) per sweep for a dense model. This is the curse of dimensionality — a robot arm with ten joints discretized into a hundred positions each has \(100^{10}\) states. Exact DP is a guarantee that does not scale; its value is conceptual and as a subroutine inside approximate methods. The escape route is sampling. Replace the exact expectation \(\sum_{s'} p(s'\mid s,a)[\cdot]\) with a sample drawn by actually taking action \(a\) and observing where you land. That single substitution — backup over a sampled transition instead of the known distribution — turns value iteration into Q-learning and policy evaluation into temporal-difference learning. Everything keeps the GPI skeleton of this chapter; only the backup target changes from a model expectation to a sampled estimate. So DP is not a dead end — it is the ground truth the rest of RL approximates. When you see TD learning's update \(V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]\) in the next chapter, recognize it as EQ R2.4 with the expectation replaced by one sample and a learning rate \(\alpha\) to smooth the noise. The Bellman equation is the destination; sampling is how you get there without a map. Method Backup Improvement Converges in Needs model? Policy evaluation expected, π-average none → V π, geometric yes Value iteration expected, max implicit, every sweep → V*, geometric (γ) yes Policy iteration expected, max full greedy step → π*, finite rounds yes TD / Q-learning sampled greedy / ε-greedy → approx, asymptotic no NEXT Drop the model, keep the Bellman equation. Chapter 03 takes the exact expectations you just iterated and replaces them with samples from experience — Monte-Carlo returns and temporal-difference learning, the first algorithms that learn a value function from interaction alone, with no transition table in sight. 2.R References Bellman, R. (1957). Dynamic Programming. Princeton University Press — the founding text; the principle of optimality and the recursive value relation behind EQ R2.1–R2.3. Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics 6(5) — the MDP formalization and the optimality equation in its original form. Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press — introduced policy iteration (EQ R2.7) and the policy improvement theorem. Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press, Ch. 4 — iterative policy evaluation, value iteration, policy iteration, and the 4×4 gridworld (Fig. 4.1) used throughout this chapter. Puterman, M. L. & Shin, M. C. (1978). Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science 24(11) — the m-sweep interpolation between value and policy iteration cited in §2.4. Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control (4th ed.). Athena Scientific — the contraction-mapping convergence analysis (EQ R2.6) in full rigor. ← PREVIOUS 01 The Problem NEXT CHAPTER 03 Model-free Value AI // ENCYCLOPEDIA — REINFORCEMENT LEARNING · CH 02 FULL CONTENTS ↗
## RL · Model-Free Value Methods (https://ai-encyclopedia.com/rl/03-model-free-value.html)
Model-Free Value Methods — TD & Q-Learning — AI Encyclopedia AI // ENCYCLOPEDIA / REINFORCEMENT LEARNING / 03 / MODEL-FREE INDEX NEXT: POLICY GRADIENTS → REINFORCEMENT LEARNING · CHAPTER 03 / 06 Model-Free Value Methods — TD & Q-Learning Dynamic programming could solve any MDP, provided you handed it the transition probabilities and rewards. Real agents are not handed the model. They are dropped into a world and must learn from what happens to them. This chapter covers learning the value of actions from experience alone: temporal-difference learning bootstraps a guess from a guess, and it works. From that one idea follow Q-learning and SARSA, the algorithms that put the learning in reinforcement learning. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON RL 01–02 · MARKOV CHAINS INSTRUMENTS Q-GRIDWORLD · TD vs MC · ε-DECAY IN THIS CHAPTER 3.1 Learning without a model 3.2 Monte-Carlo prediction 3.3 Temporal-difference — TD(0) 3.4 Q-learning (off-policy) 3.5 SARSA & exploration 3.R References 3.1 Learning without a model Chapter 02 was a luxury we will now give up. There, the agent knew the environment — the transition function \(P(s' \mid s, a)\) and the reward \(R(s, a)\) — and could compute the optimal policy by pure thought, sweeping the Bellman equations until they converged. That is planning, and it is exactly the engine behind the Gridworld instrument of Chapter 01. But knowing \(P\) and \(R\) is a strong assumption that almost never holds. A robot does not have a probability table for how its motors slip on a given floor. A game-playing agent is not handed the rules as equations. The model is missing, and the agent must work without it. This is the regime of model-free reinforcement learning: the agent never builds or is given a model of the world. Instead it learns directly from samples — actual transitions \((s, a, r, s')\) it experiences by acting. Where dynamic programming computes an expectation over all possible next states by summing against \(P\), model-free methods replace that expectation with sampled experience. They do not ask "what is the average outcome of this action?"; they take the action, see one outcome, and nudge their estimate toward it. Average enough samples and you recover the expectation the model would have given you — the law of large numbers doing the work the model used to do. MODEL-BASED (CH 02) knows P(s′|s,a), R(s,a) sum over all next states PLAN — no acting needed MODEL-FREE (THIS CH) knows nothing — only samples one (s,a,r,s′) at a time LEARN — must act to learn drop the model Same goal — the optimal value function and policy — reached two ways. Planning sums against a known model; learning averages sampled experience. This chapter lives entirely on the right. Two questions organize everything that follows, and they map onto two classical problems. Prediction (also called policy evaluation): given a fixed policy \(\pi\), estimate its value function \(V^\pi\) — how good is this way of behaving? Control: find a good policy in the first place, typically by estimating action-values \(Q^\pi\) and improving the policy toward them. Sections 3.2 and 3.3 solve prediction two different ways; Sections 3.4 and 3.5 turn the better of them into control. One design choice cuts across all of it and deserves naming now. A method is on-policy if it learns about the very policy it is using to act, and off-policy if it can learn about one policy (say, the greedy optimal one) while behaving according to another (say, an exploratory one). It sounds like a technicality. It is the single most consequential distinction in this chapter — it is exactly what separates Q-learning from SARSA — and we will see it decide how an agent behaves on the edge of a cliff. A useful sanity check on the whole enterprise: a model-free agent can become superhuman at a game without ever being able to describe the game. It learns which moves are worth what, not why. That is a strength — no modeling effort — and a weakness — no transfer, no planning ahead, every new world relearned from scratch. The model-based methods of later chapters trade sample efficiency back for exactly that missing structure. 3.2 Monte-Carlo prediction The most direct way to estimate a value from experience is to take the definition literally. The value \(V^\pi(s)\) is the expected return from \(s\) (Chapter 01, EQ R1.6). So play out complete episodes under \(\pi\), and for each state record the actual return that followed it. Average those returns and you have a sample estimate of the expectation. This is the Monte-Carlo (MC) method: estimate an expectation by averaging samples of it. EQ R3.1 — MONTE-CARLO VALUE ESTIMATE $$ V(s) \;\leftarrow\; V(s) + \alpha\,\big[\, G_t - V(s) \,\big], \qquad G_t = \sum_{k=0}^{\infty} \gamma^k\, R_{t+k+1} $$ After an episode ends, compute the actual return \(G_t\) that followed each visit to \(s\), and move the estimate a fraction \(\alpha\) of the way toward it. With \(\alpha = 1/N(s)\) — one over the number of visits — this is exactly the running sample mean. The target \(G_t\) is the true, observed return: no model, no bootstrap, nothing estimated stands in for it. That purity is MC's defining virtue and its defining limitation. The error term \(G_t - V(s)\) is worth dwelling on, because the same shape recurs in every update rule in this chapter. It is a prediction error: the gap between what actually happened (\(G_t\)) and what we predicted would happen (\(V(s)\)). Learning is nothing but repeatedly shrinking that gap, with \(\alpha\) setting how aggressively. A large \(\alpha\) chases the latest sample; a small \(\alpha\) averages patiently over many. Set \(\alpha\) too high and the estimate jitters with noise; too low and it crawls. MC's strengths are real. It is unbiased — \(G_t\) is an honest sample of the return, so the estimate converges to the true \(V^\pi\) with no systematic error. It makes no Markov assumption — it never reasons about next-state values, only whole returns, so it works even where the state representation is imperfect. And it is simple to state. But it has a hard limitation that motivates the rest of the chapter: you must wait until the episode ends to compute \(G_t\). In a long episode that is slow; in a continuing task that never terminates, it is impossible. MC also tends to have high variance, because \(G_t\) is the sum of a long chain of random rewards and random transitions — one unlucky tail can swing it wildly. An episode from state \(s\) yields rewards \(1, 1, 1\) on three successive steps and then terminates. With \(\gamma = 0.9\), what is the Monte-Carlo target \(G_t\) used to update \(V(s)\) in EQ R3.1? The MC target is the full observed return: \(G_t = 1 + 0.9\cdot 1 + 0.9^2\cdot 1 = 1 + 0.9 + 0.81 = \) 2.71. MC waits for the whole episode and uses this actual number — never an estimate of it. PYTHON · RUNNABLE IN-BROWSER # Monte-Carlo prediction on the 5-state random walk (Sutton & Barto, ex. 6.2) # True values are linear: V(s) = s / 6. Start every estimate at 0.5. import numpy as np rng = np.random.default_rng(1) def episode(): # random walk; left of 1 -> 0 (r=0), right of 5 -> 6 (r=1) s, traj = 3, [] while 1 <= s <= 5: s2 = s + (1 if rng.random() < 0.5 else -1) traj.append((s, 1.0 if s2 == 6 else 0.0)) s = s2 return traj V = np.full(7, 0.5); V[0] = V[6] = 0.0 # terminals have value 0 alpha = 0.05 for _ in range(200): traj = episode() G = traj[-1][1] # gamma = 1, so every state's return = final reward for (s, _r) in traj: V[s] += alpha * (G - V[s]) # EQ R3.1: nudge toward the actual return true = np.array([s / 6 for s in range(1, 6)]) print("MC estimate V[1..5]:", V[1:6].round(3)) print("true value V[1..5]:", true.round(3)) print("mean abs error:", round(float(np.abs(V[1:6] - true).mean()), 4)) RUN ▶ edits are live — break it on purpose 3.3 Temporal-difference learning — TD(0) Here is the idea this chapter is built around, and it is one of the most beautiful in machine learning. MC waits for the full return \(G_t\). But recall the recursive form of the return (Chapter 01, EQ R1.4): \(G_t = R_{t+1} + \gamma\, G_{t+1}\). We do not have \(G_{t+1}\) — the episode has not finished — but we do have an estimate of it: \(V(s_{t+1})\), our current guess for the value of the next state. So substitute the guess in for the unknown return. The update becomes: EQ R3.2 — TD(0) UPDATE $$ V(s_t) \;\leftarrow\; V(s_t) + \alpha\,\big[\, \underbrace{R_{t+1} + \gamma\, V(s_{t+1})}_{\text{TD target}} - V(s_t) \,\big] $$ The target is no longer the full return but one real reward plus the discounted estimate of the rest. This is bootstrapping: updating an estimate from another estimate. It means you can learn from a single step, online, before the episode is anywhere near over — even in a task that never ends. The bracket is the TD error \(\delta_t\), the surprise between what you expected and what one step of reality plus your own forecast now suggests. The bracketed quantity is named: the temporal-difference error, \(\delta_t = R_{t+1} + \gamma\, V(s_{t+1}) - V(s_t)\). It is the difference between two successive predictions of the same return — one made before seeing \(R_{t+1}\), one after — hence "temporal difference". When \(\delta_t = 0\) everywhere, predictions are self-consistent and learning stops; this is the sampled, online cousin of the Bellman fixed point from Chapter 02. The TD error is not merely an algorithmic device: dopamine neurons in the brain encode a signal strikingly close to \(\delta_t\), which is part of why TD learning is one of the rare ideas that crossed from machine learning into neuroscience. EQ R3.3 — THE TD ERROR $$ \delta_t \;=\; R_{t+1} + \gamma\, V(s_{t+1}) - V(s_t) $$ \(\delta_t > 0\): the step went better than predicted — raise \(V(s_t)\). \(\delta_t Every value method in modern RL, up to and including the critics inside today's policy-gradient and RLHF stacks, is ultimately driving a TD error to zero. MC's error \(G_t - V(s_t)\) is the special case where you wait for the whole return instead of bootstrapping after one step. The trade-off between MC and TD is a genuine one, not a free lunch, and the honest framing is bias versus variance. The TD target \(R_{t+1} + \gamma V(s_{t+1})\) is biased: it leans on \(V(s_{t+1})\), which is only a guess and is wrong early in training, so TD is "learning from a guess". But it has much lower variance than MC, because it depends on only one random reward and one transition rather than an entire random trajectory. In practice TD's lower variance usually wins — it converges faster on most problems — but MC's lack of bias and its independence from the Markov assumption keep it relevant, and the two are endpoints of a spectrum (TD(\(\lambda\)), \(n\)-step returns) that interpolates between them. There is no universal winner; which is better is genuinely problem-dependent, a point Sutton & Barto are careful to make and which remains true today. Property Monte-Carlo TD(0) Target G_t (full return) R + γ·V(s′) (one step + bootstrap) Updates at episode end only every step, online Continuing tasks cannot (needs termination) yes Bias unbiased biased (bootstraps) Variance high low Needs Markov state no yes (relies on V(s′)) A TD(0) agent in state \(s\) takes a step, gets reward \(R_{t+1} = 0\), and lands in \(s'\) with current estimate \(V(s') = 1\). The old estimate is \(V(s) = 0.5\) and \(\gamma = 1\). What is the TD error \(\delta_t\) (EQ R3.3)? \(\delta_t = R_{t+1} + \gamma\,V(s') - V(s) = 0 + 1\cdot 1 - 0.5 = \) 0.5. Positive, so the step beat expectations and \(V(s)\) gets raised by \(\alpha\cdot 0.5\) — even though the episode has not ended and no real return was ever observed. PYTHON · RUNNABLE IN-BROWSER # TD(0) value estimation vs Monte-Carlo on the same random walk import numpy as np rng = np.random.default_rng(1) def episode(): s, traj = 3, [] while 1 <= s <= 5: s2 = s + (1 if rng.random() < 0.5 else -1) traj.append((s, 1.0 if s2 == 6 else 0.0, s2)) s = s2 return traj true = np.array([s / 6 for s in range(1, 6)]) Vtd = np.full(7, 0.5); Vtd[0] = Vtd[6] = 0.0 # TD(0) Vmc = np.full(7, 0.5); Vmc[0] = Vmc[6] = 0.0 # Monte-Carlo a = 0.05 for _ in range(150): traj = episode() for (s, r, s2) in traj: # TD: bootstrap every step (EQ R3.2) Vtd[s] += a * (r + 1.0 * Vtd[s2] - Vtd[s]) G = traj[-1][1] for (s, r, s2) in traj: # MC: wait for the return (EQ R3.1) Vmc[s] += a * (G - Vmc[s]) print("state:", list(range(1, 6))) print("TD(0) estimate:", Vtd[1:6].round(3).tolist()) print("MC estimate:", Vmc[1:6].round(3).tolist()) print("true value:", true.round(3).tolist()) plot_xy(list(range(1, 6)), Vtd[1:6].tolist()) # TD curve vs the linear truth RUN ▶ edits are live — break it on purpose INSTRUMENT R3.1 — TD vs MC UPDATE ONE STEP · TD TARGET = R + γV(s′) · MC TARGET = G OLD ESTIMATE V(s) 0.50 REWARD R 0.00 NEXT-STATE V(s′) 1.00 FULL RETURN G 1.50 LEARNING RATE α 0.50 DISCOUNT γ 0.90 TD TARGET R+γV(s′) — TD ERROR δ — NEW V(s) — TD — NEW V(s) — MC — The two markers are the targets each method aims at from the same old estimate (the dashed line): mint is the bootstrapped TD target \(R+\gamma V(s')\), blue is the full Monte-Carlo return \(G\). The arrow shows the step \(\alpha\,[\text{target}-V(s)]\) each takes. Slide \(V(s')\) and only the TD target moves — that dependence on a guess is the bias TD pays for learning online. Slide \(G\) and only MC moves — that swing is the variance MC pays for being unbiased. Set \(\alpha = 1\) and each estimate jumps straight onto its target in a single step. 3.4 Q-learning (off-policy) Prediction estimates the value of a given policy. Control finds a good policy. To do control without a model we need action -values \(Q(s, a)\), not state-values \(V(s)\) — because, as Chapter 01 argued, you cannot act greedily from \(V\) without knowing where actions lead, but greedy action selection from \(Q\) is trivial: take \(\arg\max_a Q(s, a)\), no model required. So we run TD on \(Q\) instead of \(V\). The most famous instance is Q-learning (Watkins, 1989), and it has a remarkable property hidden in one symbol. EQ R3.4 — Q-LEARNING UPDATE $$ Q(s_t, a_t) \;\leftarrow\; Q(s_t, a_t) + \alpha\,\Big[\, R_{t+1} + \gamma\, \underbrace{\max_{a'} Q(s_{t+1}, a')}_{\text{greedy bootstrap}} - Q(s_t, a_t) \,\Big] $$ The bootstrap uses \(\max_{a'} Q(s_{t+1}, a')\) — the value of the best next action, regardless of what the agent actually does next. This is what makes Q-learning off-policy: it learns the optimal action-value function \(Q^*\) while behaving with any sufficiently exploratory policy. The agent can blunder around ε-greedily, even act randomly, and still converge to \(Q^*\) — provided every state–action pair keeps getting visited and \(\alpha\) decays appropriately. This decoupling of behavior from learning is the property that made deep Q-networks possible. The convergence guarantee is one of the cornerstone results of the field. Watkins & Dayan (1992) proved that for a finite MDP, tabular Q-learning converges to the optimal \(Q^*\) with probability 1, under two conditions: every state–action pair is visited infinitely often, and the learning rate satisfies the Robbins–Monro conditions \(\sum_t \alpha_t = \infty,\ \sum_t \alpha_t^2 tabular case. The moment \(Q\) is approximated by a neural network — as in deep Q-learning — the guarantee evaporates, and the combination of bootstrapping, off-policy learning, and function approximation (the "deadly triad") can diverge. Much of deep RL engineering is heuristics to tame that triad; it is contested territory, not settled. One more honest caveat that motivated a whole follow-up algorithm: the \(\max\) operator introduces maximization bias. Because \(\max\) over noisy estimates systematically picks the ones that happen to be overestimated, Q-learning tends to overestimate action-values. Double Q-learning (van Hasselt, 2010) fixes this by decoupling the action selected by the max from the value used to evaluate it — the idea behind Double DQN. Q-learning is foundational, not flawless. Apply a single Q-learning update (EQ R3.4) with \(\alpha = 0.5\), reward \(r = 1\), discount \(\gamma = 0\), current \(Q(s,a) = 0\), and \(\max_{a'} Q(s', a') = 0\). What is the new \(Q(s,a)\)? \(Q \leftarrow 0 + 0.5\big[\,1 + 0\cdot 0 - 0\,\big] = 0.5 \times 1 = \) 0.5. With \(\gamma = 0\) the bootstrap term drops out entirely, so the update is purely a half-step toward the immediate reward — exactly the myopic, one-step learning a zero discount implies. True or false: Q-learning is off-policy — it learns about the greedy optimal policy via the \(\max_{a'}\) bootstrap while behaving with a different, exploratory policy — whereas SARSA is on-policy, bootstrapping from the action it actually takes next. (Answer true or false.) Q-learning's target uses \(\max_{a'} Q(s', a')\), the value of the best next action irrespective of what the agent does — so it learns \(Q^*\) regardless of the behavior policy: off-policy. SARSA's target uses \(Q(s', a')\) for the action \(a'\) the agent will actually take under its current (e.g. ε-greedy) policy — so it learns the value of that very policy: on-policy. The statement is true. PYTHON · RUNNABLE IN-BROWSER # Tabular Q-learning on a 1x4 corridor: states 0..3, state 3 is the goal (+1) import numpy as np rng = np.random.default_rng(0) n, gamma, alpha, eps = 4, 0.9, 0.5, 0.2 Q = np.zeros((n, 2)) # actions: 0 = left, 1 = right def step(s, a): s2 = min(s + 1, n - 1) if a == 1 else max(s - 1, 0) return s2, (1.0 if s2 == 3 else 0.0), s2 == 3 for ep in range(2000): # learn by acting eps-greedily s = 0 for _ in range(50): a = rng.integers(2) if rng.random() < eps else int(np.argmax(Q[s])) s2, r, done = step(s, a) target = r + (0.0 if done else gamma * Q[s2].max()) # EQ R3.4: greedy bootstrap Q[s, a] += alpha * (target - Q[s, a]) s = s2 if done: break print("learned Q-values (rows = states 0..3, cols = [left, right]):") print(Q.round(3)) print("greedy policy:", ["L R"[int(np.argmax(Q[s])) * 2] for s in range(n)]) print("optimal V*(0):", round(gamma ** 3, 3), "(= gamma^3, 3 steps to the goal)") RUN ▶ edits are live — break it on purpose INSTRUMENT R3.2 — Q-LEARNING GRIDWORLD THE Q-TABLE & GREEDY POLICY FORM LIVE · EQ R3.4 LEARNING RATE α 0.50 EXPLORATION ε 0.20 DISCOUNT γ 0.95 RUN TRAIN 200 EPISODES +20 RESET EPISODES TRAINED 0 max_a Q AT START — GREEDY PATH LENGTH — A 4×4 gridworld with a goal (+1, top-right), a trap (−1), a step cost, and stochastic-free moves. Each cell shows its greedy action and \(\max_a Q(s,a)\); the shading is that value. Press TRAIN and watch the Q-table fill in from the goal outward — value propagates one cell per episode-batch, exactly as the bootstrap in EQ R3.4 carries reward backward through the chain. Drop ε toward 0 and the agent stops exploring: it may lock onto a decent-but-suboptimal path because it never tried the alternatives. Raise it and the table fills more completely but the agent wanders. Unlike Chapter 01's instrument, this learns purely from sampled steps — it is never told where the goal is. 3.5 SARSA (on-policy) & exploration Change one symbol in the Q-learning update and you get a different algorithm with a different personality. Instead of bootstrapping from the best next action, bootstrap from the action the agent actually takes next under its current policy. The update now uses the quintuple \((s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})\) — state, action, reward, state, action — which is exactly where the name SARSA comes from. EQ R3.5 — SARSA UPDATE $$ Q(s_t, a_t) \;\leftarrow\; Q(s_t, a_t) + \alpha\,\Big[\, R_{t+1} + \gamma\, \underbrace{Q(s_{t+1}, a_{t+1})}_{\text{action actually taken}} - Q(s_t, a_t) \,\Big] $$ The only change from EQ R3.4 is \(\max_{a'} Q(s_{t+1}, a') \to Q(s_{t+1}, a_{t+1})\), where \(a_{t+1}\) is drawn from the agent's own (e.g. ε-greedy) policy. SARSA therefore learns the value of the policy it is actually following — including the cost of its own exploration — which is exactly what "on-policy" means. Q-learning evaluates a greedy policy it does not follow; SARSA evaluates the noisy policy it does. That difference is not academic, and the cleanest illustration is the cliff-walking example from Sutton & Barto. An agent must walk from start to goal along the edge of a cliff; stepping off costs a large penalty. Both algorithms use ε-greedy exploration. Q-learning learns the optimal path — right along the cliff edge — because its \(\max\) bootstrap evaluates the greedy policy, which never falls. But because it still explores while following that knife-edge route, its random ε-steps occasionally pitch it off the cliff, so its online reward during training is worse. SARSA learns a safer path one row back from the edge, because its on-policy target accounts for the fact that it sometimes takes random actions — so it learns the value of behaving exploratorily and routes around the risk. Q-learning finds the better policy; SARSA earns more reward while learning. Which you want depends on whether failures during training are cheap or catastrophic — a real engineering decision, not a theoretical curiosity. INTUITION Q-learning is an optimist; SARSA is a realist. Q-learning assumes it will act greedily from the next state on, so it learns the value of the best-case continuation. SARSA assumes it will keep exploring like it currently does, so it learns the value of its actual, imperfect behavior. As \(\varepsilon \to 0\) the two converge — with no exploration there is no difference between "best next action" and "action actually taken". Both algorithms lean on the exploration we met in Chapter 01: ε-greedy with an annealed \(\varepsilon\). The annealing is not optional polish — it is what reconciles two requirements that pull in opposite directions. Convergence needs every state–action pair visited infinitely often (so \(\varepsilon\) must stay positive), but a good final policy needs the agent to eventually stop throwing away reward on random moves (so \(\varepsilon\) must vanish). The resolution is GLIE — Greedy in the Limit with Infinite Exploration — schedules that keep \(\varepsilon > 0\) forever but send it to 0, the textbook example being \(\varepsilon_t = 1/t\). A practical schedule decays \(\varepsilon\) from near 1 toward a small floor; the shape of that decay is one of the most-tuned knobs in applied RL. EQ R3.6 — EXPONENTIAL ε-DECAY $$ \varepsilon_t \;=\; \varepsilon_{\min} + (\varepsilon_0 - \varepsilon_{\min})\, e^{-t/\tau} $$ Start at \(\varepsilon_0\) (often 1.0), decay with time constant \(\tau\) toward a floor \(\varepsilon_{\min}\) (often 0.01–0.05). Early on the agent explores widely and fills in its Q-table; late on it exploits what it has learned. The floor matters: a small permanent \(\varepsilon\) hedges against a non-stationary world where the best action can change. Too fast a decay and the agent commits before it has seen enough — premature exploitation, the most common silent failure in applied value-based RL. An exponential ε-decay schedule (EQ R3.6) uses \(\varepsilon_0 = 1.0\), \(\varepsilon_{\min} = 0.05\), and time constant \(\tau = 500\). As \(t \to \infty\), what value does \(\varepsilon_t\) approach? As \(t \to \infty\), \(e^{-t/\tau} \to 0\), so the decaying term \((\varepsilon_0 - \varepsilon_{\min})\,e^{-t/\tau} \to 0\) and only the floor survives: \(\varepsilon_t \to \varepsilon_{\min} = \) 0.05. The agent never stops exploring entirely — it keeps a 5% random-action rate forever, a hedge against a changing world. PYTHON · RUNNABLE IN-BROWSER # Exponential epsilon-decay (EQ R3.6) and what fraction of moves stay random import numpy as np eps0, eps_min, tau = 1.0, 0.05, 500.0 t = np.arange(0, 3000) eps = eps_min + (eps0 - eps_min) * np.exp(-t / tau) # EQ R3.6 for step in (0, 250, 500, 1000, 3000 - 1): print(f"t = {step:5d} epsilon = {eps[step]:.3f} " f"P(random move) = {eps[step]:.1%}") print(f"\nfloor approached as t -> inf: {eps_min}") print(f"epsilon at one time-constant: {eps[int(tau)]:.3f} " f"(~ eps_min + 0.368*(eps0-eps_min))") plot_xy(t.tolist(), eps.tolist()) # the classic decay curve RUN ▶ edits are live — break it on purpose INSTRUMENT R3.3 — ε-DECAY EXPLORER EXPLORE-THEN-EXPLOIT · EQ R3.6 START ε₀ 1.00 FLOOR ε_min 0.05 TIME CONSTANT τ 500 ε AT STEP 0 — ε AT τ (1 e-fold) — STEP TO REACH ε = 0.1 — The curve is \(\varepsilon_t\) over training steps: a high-exploration mint region early, decaying to a thin blue floor. The shaded area under the curve is roughly the total "exploration budget" the agent spends. Shrink \(\tau\) and the agent commits fast — efficient if the world is simple, premature if it is not. Raise the floor and it never fully exploits — wasteful in a fixed world, prudent in a changing one. There is no universally right curve; this is a budget you allocate against how hard the problem is and how costly mistakes are. NEXT Value methods learn what every action is worth, then act greedily — but they choke when actions are continuous or the state space is too vast to tabulate. Chapter 04 takes the other road: policy gradients, which parameterize the policy directly and push its parameters up the gradient of expected return. We will meet REINFORCE, the variance problem that nearly sinks it, and the actor–critic methods that put a TD-learned value function (everything you just built) back to work as the critic. 3.R References Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8 — the off-policy control algorithm of EQ R3.4 and its convergence proof to \(Q^*\). Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning 3 — the original TD(0) and TD(λ) prediction methods (EQ R3.2, EQ R3.3) and the bootstrapping idea at the heart of this chapter. Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — the canonical treatment of MC, TD, Q-learning, SARSA, the cliff-walking comparison, and GLIE exploration. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — deep Q-networks; off-policy Q-learning (EQ R3.4) with a neural Q-function, experience replay, and a target network. van Hasselt, H. (2010). Double Q-learning. NeurIPS 23 — diagnoses and corrects the maximization bias of the \(\max\) operator in EQ R3.4. Singh, S., Jaakkola, T., Littman, M. L. & Szepesvári, C. (2000). Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning 38 — convergence of SARSA (EQ R3.5) and the GLIE exploration conditions of §3.5. ← PREVIOUS 02 Dynamic Programming NEXT CHAPTER 04 Policy Gradients AI // ENCYCLOPEDIA — REINFORCEMENT LEARNING · CH 03 FULL CONTENTS ↗
## RL · Policy Gradients & Actor-Critic (https://ai-encyclopedia.com/rl/04-policy-gradients.html)
Policy Gradients & Actor-Critic — AI Encyclopedia AI // ENCYCLOPEDIA / REINFORCEMENT LEARNING / 04 / POLICY GRADIENTS INDEX NEXT: DEEP RL → REINFORCEMENT LEARNING · CHAPTER 04 / 06 Policy Gradients & Actor-Critic Every method so far has been indirect: estimate how good each action is, then act greedily with respect to those estimates. Policy-gradient methods skip that step. Instead of valuing actions and then acting greedily, optimize the policy itself by gradient ascent on expected reward. Parameterize the policy, differentiate the quantity you care about, expected return, and push the parameters uphill. The result is a family that handles continuous actions, learns genuinely stochastic behavior, and forms the backbone of modern deep RL and RLHF. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON CH 01 · 03 · STATS INSTRUMENTS BANDIT ASCENT · BASELINE · ACTOR-CRITIC IN THIS CHAPTER 4.1 Optimizing the policy directly 4.2 The policy gradient theorem 4.3 REINFORCE & the baseline 4.4 Actor-critic methods 4.5 A2C / A3C 4.R References 4.1 Optimizing the policy directly The value-based methods of the previous chapters — Q-learning, SARSA — all share a shape: learn a value function, then read a policy off it by taking \(\arg\max_a Q(s,a)\). The value function is the object you fit; the policy is a side effect. Policy-gradient methods invert this. They treat the policy as the primary object, give it its own parameters \(\theta\), and optimize those parameters to maximize expected return directly. There is no \(\arg\max\) at the end — the policy is the answer. Write the policy as a differentiable function \(\pi_\theta(a \mid s)\): a neural network whose output is a probability distribution over actions, with parameters \(\theta\) you can move. The quantity we want to maximize is the expected return under that policy — the same return from Chapter 01 (EQ R1.3), now viewed as a function of \(\theta\): EQ R4.1 — THE OBJECTIVE $$ J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\!\big[\, R(\tau) \,\big] \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\, \sum_{t=0}^{T} \gamma^{t}\, r_{t+1} \,\right] $$ \(\tau = (s_0, a_0, s_1, a_1, \ldots)\) is a trajectory the policy rolls out; \(R(\tau)\) is its total discounted return. The expectation is over every source of randomness — the policy's action choices and the environment's transitions. We are no longer fitting a value; we are doing gradient ascent on the thing we actually care about. The only obstacle is that the distribution we average over, \(\pi_\theta\), is itself what we are differentiating — and that is exactly what the next section resolves. Why bother, when value methods already work? Three reasons make policy gradients indispensable, not merely an alternative. Continuous and high-dimensional action spaces. Taking \(\arg\max_a Q(s,a)\) over a continuous \(a\) — a torque, a steering angle, a 50-joint robot pose — is itself an optimization problem at every step. A policy network simply outputs the action (or its distribution), no inner search required. This is why robotics and control are policy-gradient territory. Stochastic optimal policies. The greedy policy of a value method is deterministic. But in partially-observed environments and in every game with a bluff, the optimal policy is irreducibly random — rock-paper-scissors has no good deterministic strategy. Policy gradients can represent and learn such policies natively. Smooth improvement. A small change to \(\theta\) is a small change to the policy. Value methods can flip the entire greedy policy from one \(\arg\max\) to another over an infinitesimal change in \(Q\), which makes their learning brittle. Gradient ascent on \(\pi_\theta\) moves the behavior continuously. The cost of this directness is the dominant theme of the chapter: policy-gradient estimates are unbiased but high-variance. You are estimating a gradient from noisy rollouts of a stochastic policy in a stochastic world. Taming that variance — first with baselines (§4.3), then with a learned critic (§4.4) — is most of what separates a toy from a working algorithm. 4.2 The policy gradient theorem To ascend \(J(\theta)\) we need its gradient. The difficulty is that \(\theta\) appears inside the distribution we are taking the expectation over, so we cannot just differentiate the integrand. The fix is the log-derivative trick (also called the score-function or likelihood-ratio estimator), an identity that turns the gradient of an expectation into an expectation of a gradient: EQ R4.2 — THE LOG-DERIVATIVE TRICK $$ \nabla_\theta\, \mathbb{E}_{x \sim p_\theta}[\,f(x)\,] \;=\; \mathbb{E}_{x \sim p_\theta}\!\big[\, f(x)\, \nabla_\theta \log p_\theta(x) \,\big] $$ The single identity behind every policy gradient. It follows from \(\nabla_\theta p_\theta = p_\theta\, \nabla_\theta \log p_\theta\) (because \(\nabla \log p = \nabla p / p\)). Its magic is that the right-hand side is itself an expectation under \(p_\theta\) — so it can be estimated by sampling, with no knowledge of how the distribution was generated. The environment's transition probabilities \(P\) drop out entirely, because they do not depend on \(\theta\): we never need a model of the world. Apply this to the objective. A trajectory's probability factorizes into the environment's transitions (which do not depend on \(\theta\)) and the policy's action choices (which do). When we take \(\nabla_\theta \log p_\theta(\tau)\), every transition term differentiates to zero and only the policy terms survive. The result is the policy gradient theorem: EQ R4.3 — THE POLICY GRADIENT THEOREM $$ \nabla_\theta J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\, \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\; \Psi_t \,\right] $$ \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) is the score — the direction in parameter space that makes the action just taken more likely. \(\Psi_t\) is a scalar weight that says how much we should reinforce that action. The whole zoo of policy-gradient algorithms is one choice: what to plug in for \(\Psi_t\). The full return \(R(\tau)\), the return-to-go \(G_t\), the advantage \(A^\pi(s_t,a_t)\), the TD error — each is a valid \(\Psi_t\), and they trade bias against variance differently (§4.3–4.4). The intuition is worth stating in plain language, because it is the entire algorithm. Each gradient step nudges the parameters to increase the log-probability of actions that led to high reward, and decrease it for actions that led to low reward, weighted by how good the outcome was. The policy is not told the right action — it is only told whether what it did was, on balance, worth doing more often. That is trial-and-error learning written as calculus. A softmax policy over two actions currently assigns \(\pi_\theta(a \mid s) = 0.6\) to the action the agent actually sampled. For a softmax parameterization the score with respect to that action's logit is \(1 - \pi_\theta(a \mid s)\). What is the score \(\nabla_{\theta_a} \log \pi_\theta(a \mid s)\)? For a softmax (the standard discrete policy), \(\nabla_{\theta_i} \log \pi_\theta(a \mid s) = \mathbb{1}[i = a] - \pi_\theta(i \mid s)\). For the sampled action itself \((i = a)\) the indicator is \(1\), so the score is \(1 - \pi_\theta(a \mid s) = 1 - 0.6 = \) 0.4. It is positive because raising this action's logit raises its (sub-one) probability — the update will push exactly that way if the reward weight \(\Psi_t\) is positive. Two technical notes experts will insist on. First, the theorem is exact for the discounted objective only with a subtle discounting of the state distribution that practical implementations almost universally ignore; the resulting estimator is a slightly biased but well-behaved approximation that everyone uses. Second, the score-function estimator is unbiased but, as warned, high-variance — the same trajectory return \(R(\tau)\) multiplies every action's score, so a single lucky or unlucky rollout swings the whole gradient. Fixing that is §4.3. 4.3 REINFORCE & the baseline The oldest and simplest realization of EQ R4.3 is REINFORCE (Williams, 1992): a pure Monte-Carlo policy gradient. Run an episode to completion, compute the return-to-go \(G_t\) from each step, and take one gradient step with \(\Psi_t = G_t\). No value function, no bootstrapping — just rollouts and the log-derivative trick. EQ R4.4 — REINFORCE UPDATE $$ \theta \;\leftarrow\; \theta \;+\; \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\; G_t, \qquad G_t = \sum_{k=t}^{T} \gamma^{\,k-t}\, r_{k+1} $$ \(\alpha\) is the learning rate, \(G_t\) the return-to-go from step \(t\). Note that only rewards after \(a_t\) appear in \(G_t\) — an action cannot be credited for reward that preceded it, the causality refinement that already cuts variance versus weighting by the whole-episode return. REINFORCE is unbiased and dead simple, but it learns slowly: it must wait for an entire episode, and the raw magnitude of \(G_t\) makes its gradient estimates extremely noisy. That noise has a specific and fixable cause. Suppose every reward in your environment is large and positive — say returns hover around \(+100\). Then every action gets reinforced (its log-probability pushed up), just by different amounts. The gradient is dominated by the shared offset of \(100\) rather than by the differences that actually distinguish good actions from bad. The estimator is still unbiased, but its variance is enormous and learning crawls. The cure is a baseline: subtract a reference value \(b(s)\) from the return before weighting the score. The remarkable fact — the one that makes baselines free — is that any baseline that does not depend on the action leaves the gradient unbiased, because the expected score is zero: EQ R4.5 — BASELINE LEAVES THE GRADIENT UNBIASED $$ \mathbb{E}_{a \sim \pi_\theta}\!\big[\, \nabla_\theta \log \pi_\theta(a \mid s)\; b(s) \,\big] \;=\; b(s)\, \nabla_\theta \!\sum_{a} \pi_\theta(a \mid s) \;=\; b(s)\, \nabla_\theta\, 1 \;=\; 0 $$ Because probabilities sum to one, \(\sum_a \pi_\theta(a\mid s) = 1\) is constant, so its gradient is exactly zero. Subtracting \(b(s)\) therefore adds zero in expectation — the gradient stays pointed the same way — while it can dramatically reduce variance by re-centering the returns around their typical value. The near-optimal choice for \(b(s)\) is the state-value \(V^\pi(s)\): then the weight becomes the advantage \(G_t - V^\pi(s_t) \approx A^\pi(s_t, a_t)\), which asks the only question that matters — did this action beat the policy's own average from here? So REINFORCE-with-baseline weights each score by \(G_t - b(s_t)\). With \(b(s) = V^\pi(s)\), an action that did better than expected gets reinforced and one that did worse gets suppressed — even if both produced positive raw return. This is the conceptual hinge of the chapter, and it points straight at actor-critic: if a learned \(V^\pi(s)\) is the best baseline, learn one. True or false: subtracting a baseline \(b(s)\) that depends only on the state (not the action) reduces the variance of the policy-gradient estimate without introducing any bias. (Answer true or false.) By EQ R4.5, \(\mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta \log \pi_\theta(a\mid s)\, b(s)] = b(s)\,\nabla_\theta \sum_a \pi_\theta(a\mid s) = b(s)\,\nabla_\theta 1 = 0\). The subtracted term contributes nothing in expectation, so the gradient is unchanged (no bias), while re-centering the returns can sharply cut variance. The statement is true — this is the single most important variance-reduction tool in policy-gradient RL. PYTHON · RUNNABLE IN-BROWSER # REINFORCE on a 2-armed bandit: a softmax policy ascends toward the better arm import numpy as np rng = np.random.default_rng(0) true_mean = np.array([1.0, 2.0]) # arm 1 is genuinely better theta = np.zeros(2) # policy logits (one state, no transitions) alpha = 0.1 for t in range(400): p = np.exp(theta - theta.max()); p /= p.sum() # softmax policy pi(a) a = rng.choice(2, p=p) # sample an action reward = true_mean[a] + rng.normal(0, 1) # noisy reward score = -p.copy(); score[a] += 1.0 # d log pi / d theta = 1[i=a] - p theta += alpha * reward * score # EQ R4.4, one-step bandit if t in (0, 50, 200, 399): print(f"step {t:3d}: pi = [{p[0]:.3f}, {p[1]:.3f}]") p = np.exp(theta - theta.max()); p /= p.sum() print(f"\nconverged policy pi(better arm) = {p[1]:.3f} (started at 0.500)") print("the policy climbed -- no value function, no argmax, just gradient ascent.") RUN ▶ edits are live — break it on purpose INSTRUMENT R4.1 — POLICY-GRADIENT ON A BANDIT SOFTMAX POLICY · ONLINE ASCENT · EQ R4.4 LEARNING RATE α 0.10 REWARD GAP (ARM B − ARM A) 1.0 π(BETTER ARM) — STEPS RUN — AVG REWARD — STEP ×20 RESET Two arms; the green one pays more on average. The mint curve is the policy's probability of pulling the better arm — it starts at exactly 0.5 (no preference) and climbs as ascent reinforces the actions that earned reward. Press STEP ×20 to advance 20 rollouts at a time. Raise the learning rate and it climbs faster but jitters more; shrink the reward gap to zero and the two arms become indistinguishable, so the policy has nothing to learn and the curve wanders near 0.5. This is EQ R4.4 with one state — policy gradients stripped to their skeleton. The instrument above also exposes the variance problem viscerally: with a small reward gap the curve thrashes, because the gradient signal is buried in noise. The next demonstration isolates exactly that effect — and the baseline's cure. PYTHON · RUNNABLE IN-BROWSER # Baseline = variance reduction. A large constant reward offset wrecks the # naive gradient; subtracting a running baseline restores it. (EQ R4.5) import numpy as np def train(use_baseline, seed=1, steps=300, offset=10.0): r = np.random.default_rng(seed) theta = np.zeros(2); b = 0.0; sq = [] mean = np.array([0.0, 1.0]) + offset # arm 1 better, but huge offset for t in range(steps): p = np.exp(theta - theta.max()); p /= p.sum() a = r.choice(2, p=p) reward = mean[a] + r.normal(0, 1) adv = reward - (b if use_baseline else 0.0) # baseline-subtracted weight score = -p.copy(); score[a] += 1.0 g = adv * score sq.append(g[1] ** 2) # squared gradient (one coord) theta += 0.1 * g b += 0.1 * (reward - b) # running estimate of E[return] p = np.exp(theta - theta.max()); p /= p.sum() return p[1], float(np.mean(sq)) p_no, v_no = train(False) p_yes, v_yes = train(True) print(f"no baseline: pi(best) = {p_no:.3f} mean grad^2 = {v_no:.3f}") print(f"w/ baseline: pi(best) = {p_yes:.3f} mean grad^2 = {v_yes:.3f}") print(f"\nbaseline cut gradient variance ~{v_no/v_yes:.1f}x --") print("and only the baselined run actually found the better arm.") RUN ▶ edits are live — break it on purpose INSTRUMENT R4.2 — BASELINE VARIANCE REDUCTION SAME GRADIENT, RE-CENTERED RETURNS · EQ R4.5 REWARD OFFSET (CONSTANT ADDED TO ALL ARMS) 10 DISTRIBUTION OF GRADIENT-WEIGHT (Ψ) ACROSS ROLLOUTS E[Ψ²] NO BASELINE — E[Ψ²] WITH V(s) BASELINE — SECOND-MOMENT REDUCTION — The histogram shows the scalar weight \(\Psi\) that multiplies the score, over many sampled returns. Grey is the raw return \(G\); mint is the advantage \(G - V(s)\) after subtracting the baseline. The two clouds have the same spread — but the mint one is re-centered on zero. Since the score has zero mean, the gradient estimator's variance is governed by \(\mathbb{E}[\Psi^2]\), the second moment shown in the readouts, and re-centering \(\Psi\) on zero collapses it. Crank the reward offset up: the grey weights march off to the right (every action looks "good"), inflating \(\mathbb{E}[\Psi^2]\), while the baselined weights stay parked around zero. Both give the same expected gradient — EQ R4.5 — but the mint one is far easier to estimate from a handful of samples. 4.4 Actor-critic methods REINFORCE-with-baseline still has a Monte-Carlo heart: it waits for a full episode and uses the actual return \(G_t\). That keeps it unbiased but slow and noisy. Actor-critic methods take the natural next step suggested by §4.3 — learn the baseline as its own function — and then go further, using that learned value function to bootstrap, replacing the full return with a one-step estimate. Two networks, two jobs: The actor is the policy \(\pi_\theta(a \mid s)\). It chooses actions and is updated by the policy gradient — pushed toward actions the critic judges better than average. The critic is a value function \(V_w(s)\) (or \(Q_w(s,a)\)) with its own parameters \(w\). The critic estimates the value function — it learns how much return to expect from a state, and supplies that estimate as both the baseline and the bootstrap target for the actor. ACTOR policy π_θ(a | s) CRITIC value V_w(s) ENVIRONMENT P(s′, r | s, a) action a reward r, state s′ advantage δ The actor acts; the environment returns reward and the next state; the critic scores how that step compared to its own prediction and feeds the advantage back to the actor. The actor learns what to do; the critic learns how good it is. They co-evolve. The glue between them is the TD error \(\delta\), the one-step temporal-difference signal (Chapter 03). It is the difference between a slightly-better-informed estimate of value — this step's reward plus the discounted value of where we landed — and the critic's current prediction: EQ R4.6 — THE TD ERROR AS ADVANTAGE ESTIMATE $$ \delta_t \;=\; r_{t+1} + \gamma\, V_w(s_{t+1}) - V_w(s_t) \;\approx\; A^\pi(s_t, a_t) $$ \(\delta_t\) is a low-variance, one-sample estimate of the advantage: if the step turned out better than the critic expected, \(\delta_t > 0\) and the actor reinforces the action; if worse, \(\delta_t < 0\) and it suppresses it. Bootstrapping from \(V_w(s_{t+1})\) trades a little bias for a large variance cut — the actor no longer waits for the full return, and the noisy \(G_t\) is replaced by reward plus one value lookup. This is the bias–variance dial at the heart of actor-critic. The two updates, applied every step (online, no episode boundary required), are: EQ R4.7 — ACTOR AND CRITIC UPDATES $$ \underbrace{\theta \leftarrow \theta + \alpha_\theta\, \delta_t\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)}_{\textbf{actor: policy gradient, weighted by }\delta_t} \qquad \underbrace{w \leftarrow w + \alpha_w\, \delta_t\, \nabla_w V_w(s_t)}_{\textbf{critic: TD(0) regression}} $$ The same \(\delta_t\) drives both: it tells the actor which way to push the policy and tells the critic how wrong its value estimate was. The critic update is ordinary semi-gradient TD(0) — fit \(V_w\) toward \(r_{t+1} + \gamma V_w(s_{t+1})\). The danger is that the two are learning simultaneously from each other: a biased critic biases the actor, which shifts the data the critic sees. Stability tricks — slower critic learning, target networks, careful step sizes — exist precisely to keep this coupled system from spiraling. Where this sits on the spectrum is the clean way to remember it. REINFORCE uses the full Monte-Carlo return \(G_t\): zero bias, maximum variance, must wait for the episode to end. One-step actor-critic uses \(\delta_t\): some bias from bootstrapping, much lower variance, learns online. In between sits a continuum — \(n\)-step returns and, most commonly today, Generalized Advantage Estimation (GAE), which exponentially blends advantage estimates across all horizons with a single knob \(\lambda\) to tune the bias–variance trade-off explicitly. True or false: in an actor-critic method, the critic is the component that estimates the value function (such as \(V_w(s)\)), while the actor is the policy that selects actions. (Answer true or false.) Yes. The actor is the parameterized policy \(\pi_\theta(a\mid s)\) that chooses actions; the critic is the value estimator \(V_w(s)\) (or \(Q_w(s,a)\)) that judges them. The critic's value estimate supplies the baseline and the bootstrap target — via the TD error \(\delta_t\) of EQ R4.6 — that the actor's policy gradient is weighted by. The statement is true. A critic estimates \(V_w(s) = 5.0\) for the current state and \(V_w(s') = 6.0\) for the next. The agent takes an action, receives reward \(r = 0.5\), and \(\gamma = 0.9\). What is the TD error \(\delta = r + \gamma V_w(s') - V_w(s)\) that drives both updates? \(\delta = 0.5 + 0.9 \times 6.0 - 5.0 = 0.5 + 5.4 - 5.0 = \) 0.9. Because \(\delta > 0\), the step beat the critic's expectation: the actor will make this action more likely and the critic will revise \(V_w(s)\) upward. INSTRUMENT R4.3 — ACTOR-CRITIC ARCHITECTURE TD ERROR FLOWS TO BOTH HEADS · EQ R4.6–R4.7 REWARD r 0.50 V(s) 5.0 V(s′) 6.0 DISCOUNT γ 0.90 TD ERROR δ = r + γV(s′) − V(s) — ACTOR SIGNAL — CRITIC SIGNAL — The single scalar \(\delta\) (EQ R4.6) is computed from one transition and routed to both heads (EQ R4.7). Set \(V(s')\) above \(V(s)\) and add reward and \(\delta\) goes positive — the bootstrap target \(r + \gamma V(s')\) exceeds the critic's current guess, so the actor reinforces the action and the critic raises \(V(s)\). Drag the reward negative and \(\delta\) flips: the action is suppressed and \(V(s)\) is pulled down. Watch \(\gamma\) scale how much the next state's value counts — at \(\gamma = 0\) the critic is purely myopic and \(\delta\) reduces to \(r - V(s)\). One number, two learners. 4.5 A2C / A3C Naive online actor-critic has a quiet flaw inherited from all on-policy gradient methods: consecutive samples from a single rollout are highly correlated, and that correlation inflates gradient variance and destabilizes training. Value-based deep RL (DQN) broke the correlation with a replay buffer, but a policy gradient must be estimated on-policy — from data the current policy generated — so a buffer of stale experience is off-limits. The answer DeepMind shipped in 2016 was to break correlation a different way: run many actors in parallel. A3C — Asynchronous Advantage Actor-Critic (Mnih et al., 2016) — launches many actor-learners, each with its own copy of the policy, exploring different parts of the environment simultaneously and asynchronously pushing gradients to a shared parameter server. Because the workers are in different states at any instant, the gradients they contribute are decorrelated — the parallelism itself plays the role the replay buffer played for DQN, and it does so while keeping the updates strictly on-policy. EQ R4.8 — THE ADVANTAGE ACTOR-CRITIC OBJECTIVE $$ \nabla_\theta J \;=\; \mathbb{E}\!\big[\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)\; \hat{A}_t \,\big] \;+\; \beta\, \nabla_\theta\, \mathcal{H}\!\big[\pi_\theta(\cdot \mid s_t)\big], \qquad \hat{A}_t = \sum_{i=0}^{n-1}\gamma^{\,i} r_{t+i+1} + \gamma^{\,n} V_w(s_{t+n}) - V_w(s_t) $$ \(\hat{A}_t\) is the \(n\)-step advantage — the bias–variance compromise between REINFORCE \((n = \infty)\) and one-step actor-critic \((n = 1)\). The second term is an entropy bonus: \(\mathcal{H}[\pi]\) rewards the policy for staying uncertain, which discourages premature collapse onto a single action and keeps the agent exploring. \(\beta\) tunes its strength. This objective — \(n\)-step advantage plus entropy regularization — is the template virtually every modern policy-gradient algorithm (A2C, PPO, IMPALA) builds on. A2C — Advantage Actor-Critic — is the synchronous sibling and, in practice, the one most people reach for. A2C found that the asynchrony in A3C was not the source of the benefit; the parallelism was. So A2C runs the same many environments in lockstep, batches their transitions into one large synchronized update, and gets equal or better results with simpler, more GPU-friendly code. The lesson stuck: gather diverse on-policy experience in parallel, batch it, update once. Algorithm Ψ weight Bias / variance Data collection REINFORCE G_t (full return) unbiased · high variance one episode at a time REINFORCE + baseline G_t − V(s) unbiased · lower variance one episode at a time One-step actor-critic δ_t (TD error) biased · low variance fully online A3C n-step  + entropy tunable via n parallel · asynchronous A2C n-step  + entropy tunable via n parallel · synchronous An honest caveat. Vanilla policy gradients — even with advantages and entropy — are notoriously step-size sensitive: too large a step can collapse the policy in a way it cannot recover from, because the update changes the very distribution the next batch is drawn from. The line of work that fixed this — trust regions (TRPO) and the clipped surrogate objective of PPO — is what made policy gradients robust enough to dominate, and is the natural sequel to this chapter. PPO is also, not coincidentally, the workhorse of RLHF: aligning a language model is a policy-gradient problem in disguise, with the reward model as the environment. NEXT We have the policy-gradient skeleton; now scale it with deep networks and make it stable. Chapter 05 takes these ideas into deep reinforcement learning proper — function approximation with neural networks, the deadly triad of bootstrapping, off-policy learning and approximation, DQN on the value side, and the trust-region and clipped objectives (TRPO, PPO) that turned the brittle gradient ascent of this chapter into the reliable engine behind game-playing agents, robotics, and RLHF. 4.R References Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 — the original REINFORCE estimator (EQ R4.4) and the log-derivative / score-function trick behind every policy gradient. Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS 1999 — the policy gradient theorem (EQ R4.3) and its compatibility with a learned value function, the formal basis of actor-critic. Mnih, V. et al. (2016). Asynchronous methods for deep reinforcement learning. ICML 2016 — A3C: parallel actor-learners, the n-step advantage and entropy bonus of EQ R4.8 (and the synchronous A2C that followed). Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — Chapter 13 develops policy-gradient methods, REINFORCE with baselines, and actor-critic exactly as framed here. Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR 2016 — GAE, the λ-blended advantage estimator that sets the bias–variance dial between TD and Monte-Carlo (§4.4). Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv — PPO's clipped surrogate objective, the step-size fix that made policy gradients robust and the workhorse of RLHF (the §4.5 sequel). ← PREVIOUS 03 Model-Free Value Methods NEXT CHAPTER 05 Deep Reinforcement Learning AI // ENCYCLOPEDIA — REINFORCEMENT LEARNING · CH 04 FULL CONTENTS ↗
## RL · Deep Reinforcement Learning (https://ai-encyclopedia.com/rl/05-deep-rl.html)
Deep Reinforcement Learning — DQN & PPO — AI Encyclopedia AI // ENCYCLOPEDIA / REINFORCEMENT LEARNING / 05 / DEEP RL INDEX NEXT: RL & LLMs → REINFORCEMENT LEARNING · CHAPTER 05 / 06 Deep Reinforcement Learning — DQN & PPO The tabular methods of the earlier chapters store one number per state, which is impractical the instant the state is a screen of pixels or a robot's joint angles. The fix is to replace the table with a neural network, and it introduces a new failure mode. Swapping the table for a neural net scales RL to Atari and robotics, at the cost of an instability that replay buffers and clipped objectives exist to tame. This chapter covers two algorithms that made deep RL work: DQN, which stabilized value learning with a replay buffer and a frozen target network, and PPO, whose clipped surrogate objective made policy gradients robust enough to become the field's default. LEVEL ADVANCED READING TIME ≈ 28 MIN BUILDS ON CH 04 · POLICY GRADIENTS INSTRUMENTS DQN STABILIZERS · PPO CLIP · SEED VARIANCE IN THIS CHAPTER 5.1 Function approximation & instability 5.2 Deep Q-Networks 5.3 Proximal Policy Optimization 5.4 Continuous control — DDPG & SAC 5.5 Stability & reproducibility 5.R References 5.1 Function approximation & the deadly triad Tabular RL — a separate entry in a lookup table for every state-action pair — is exact and has clean convergence guarantees. It is also useless the moment the world is large. A 210 160 RGB Atari frame has more configurations than there are atoms in the universe; a tabular agent would never visit the same state twice, let alone learn from it. The escape is function approximation: parameterize the value function or policy with a model \(f_\theta\) — a neural network — that generalizes across states, so that what it learns in one state transfers to similar states it has never seen. This single substitution is what "deep" reinforcement learning means. It is also where the guarantees fall apart. Tabular Q-learning converges; the same algorithm with a neural network in the loop can diverge spectacularly — values exploding to infinity, the policy collapsing to a single useless action. Sutton and Barto named the cause the deadly triad: instability is provoked when three ingredients are present at once. EQ R5.1 — THE DEADLY TRIAD $$ \underbrace{\text{function approximation}}_{\text{generalize across states}} \;+\; \underbrace{\text{bootstrapping}}_{\text{target uses your own estimate}} \;+\; \underbrace{\text{off-policy learning}}_{\text{train on data from another policy}} \;\Longrightarrow\; \text{risk of divergence} $$ Each ingredient is individually benign — and individually almost indispensable. Function approximation is forced on us by large state spaces. Bootstrapping (a TD target \(r + \gamma \max_{a'} Q(s', a')\) that depends on the network's own output) is what makes learning sample-efficient. Off-policy learning lets us reuse old data instead of throwing it away after one gradient step. Present all three and the value estimates can chase their own moving target into divergence. Every algorithm in this chapter is, at heart, a recipe for keeping the triad's three forces in balance rather than letting them resonate. Why does the combination misbehave? In Q-learning the regression target \(y = r + \gamma \max_{a'} Q_\theta(s', a')\) is computed using the same network \(Q_\theta\) we are updating. A gradient step that raises \(Q_\theta(s,a)\) also raises \(Q_\theta(s',a')\) for similar \((s',a')\) — function approximation guarantees the change leaks to neighbors — which raises the target, which raises the next estimate. The network is chasing a target it moves every time it takes a step toward it. With on-policy data and a fresh table this loop is damped; with off-policy data, generalization, and bootstrapping together, it can amplify without bound. The triad is a diagnosis, not a theorem: it identifies the conditions under which divergence is possible, not a guarantee that it happens. In practice well-tuned deep agents are stable far more often than the worst case suggests — but the failure mode is real, it is hard to predict in advance, and the engineering of §5.2 and §5.3 is the field's accumulated wisdom for staying out of its way. According to the deadly triad (EQ R5.1), how many ingredients must be present together for off-policy value learning with neural networks to risk divergence? The triad names exactly three: function approximation, bootstrapping, and off-policy learning. The answer is 3. Remove any one — e.g. switch to on-policy Monte-Carlo targets (no bootstrapping) or a tabular value (no approximation) — and the convergence story is far safer. 5.2 Deep Q-Networks — replay & target nets The 2015 DQN paper is the landmark: a single architecture, learning straight from raw pixels and a score, reached human-level play on most of 49 Atari games. The network itself is unremarkable — a small convnet mapping a stack of four frames to one Q-value per action. The two ideas that made it stable are the lesson, and both are direct countermeasures to the deadly triad. Experience replay Instead of learning from each transition the instant it occurs and then discarding it, DQN writes every transition \((s, a, r, s')\) into a large circular replay buffer and trains on random minibatches sampled from it. This buys two things. First, it breaks the temporal correlation between consecutive samples: successive frames of one episode are near-identical and violate the i.i.d. assumption every SGD convergence proof leans on; shuffling from a buffer of a million transitions restores approximate independence. Second, it reuses each experience many times, turning a precious environment interaction into many gradient updates — a large gain in sample efficiency. The target network The second stabilizer attacks the moving-target problem head on. DQN keeps a separate copy of the network, the target network \(Q_{\theta^-}\), whose weights are frozen and only periodically copied from the online network \(Q_\theta\) (every \(C\) steps in the original; modern code often uses a slow Polyak average instead). The regression target is computed with the frozen copy, so it does not move while the online network chases it. EQ R5.2 — THE DQN LOSS (WITH A FROZEN TARGET) $$ \mathcal{L}(\theta) \;=\; \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\Big(\, \underbrace{r + \gamma \max_{a'} Q_{\theta^-}(s', a')}_{\text{target } y,\ \text{frozen}} \;-\; Q_\theta(s, a) \,\Big)^{2}\right] $$ \(\mathcal{D}\) is the replay buffer; \((s,a,r,s')\) a minibatch sampled uniformly from it. The target \(y\) uses the frozen parameters \(\theta^-\); the gradient flows only through \(Q_\theta(s,a)\) — never through the target. Stop-gradient on the bootstrap target plus a buffer that decorrelates samples is the whole stabilization recipe. For a terminal transition the \(\gamma \max\) term is dropped: \(y = r\). Double-DQN refines this by choosing the next action with the online net but evaluating it with the target net, which curbs the systematic over-estimation that a single \(\max\) introduces. The target network's weights are refreshed by a hard copy every \(C\) steps, or by a soft Polyak (exponential) update applied every step — the form most continuous-control code now uses: EQ R5.3 — POLYAK (SOFT) TARGET UPDATE $$ \theta^- \;\leftarrow\; \tau\, \theta \;+\; (1 - \tau)\, \theta^-, \qquad 0 < \tau \ll 1 $$ With \(\tau\) small (say \(0.005\)) the target net is a slowly-trailing exponential moving average of the online net: it moves, but far too slowly to resonate with the online updates. \(\tau = 1\) recovers a hard copy every step (no smoothing at all); \(\tau \to 0\) freezes the target forever. \(\tau\) trades stability against the speed at which the target tracks genuine improvement — too small and learning crawls, too large and the moving-target instability creeps back. True or false: DQN's experience-replay buffer breaks the correlation between consecutive training samples by storing transitions and drawing random minibatches from the whole buffer rather than learning from each transition in order. (Answer true or false.) Consecutive frames within an episode are highly correlated and badly violate the i.i.d. assumption SGD relies on. Sampling uniformly from a buffer of up to a million past transitions mixes experiences from many different times and episodes, restoring approximate independence — that decorrelation, together with sample reuse, is the buffer's whole purpose. The statement is true. A target-network weight is updated by a Polyak step (EQ R5.3) with \(\tau = 0.005\). The online weight is \(\theta = 10\) and the current target weight is \(\theta^- = 2\). What is the new target weight \(\theta^-\)? \(\theta^- \leftarrow \tau\,\theta + (1-\tau)\,\theta^- = 0.005 \times 10 + 0.995 \times 2 = 0.05 + 1.99 = \) 2.04. The target inches only \(0.04\) toward the online value — exactly the slow trailing average that keeps the bootstrap target from chasing itself. PYTHON · RUNNABLE IN-BROWSER # DQN target + replay on a toy 4-state chain MDP (EQ R5.2, EQ R5.3) import numpy as np rng = np.random.default_rng(0) # states 0..3, action "go" advances one state; reward +1 only on reaching s3 nS, gamma = 4, 0.9 def step(s): # deterministic toy dynamics ns = min(s + 1, 3); r = 1.0 if ns == 3 else 0.0; done = (ns == 3) return ns, r, done # fill a replay buffer with transitions from random starts buffer = [(s, *step(s)) for s in rng.integers(0, 3, size=400)] Q = np.zeros(nS) # "online" tabular value (one per state, greedy action) Qt = Q.copy() # frozen target network lr, tau = 0.5, 0.1 for it in range(60): s, ns, r, done = buffer[rng.integers(len(buffer))] # sample from replay y = r if done else r + gamma * Qt[ns] # target uses FROZEN Qt Q[s] += lr * (y - Q[s]) # gradient step on online Q only Qt = tau * Q + (1 - tau) * Qt # Polyak soft update (EQ R5.3) true = np.array([gamma**2, gamma**1, gamma**0, 0.0]) # exact V from each state print("learned Q:", Q.round(3).tolist()) print("true V:", true.round(3).tolist()) print("max error:", float(np.abs(Q - true).max()).__round__(4)) print("\nfreezing the target (Qt) is what stops Q from chasing its own moving estimate.") RUN ▶ edits are live — break it on purpose INSTRUMENT R5.1 — DQN STABILIZERS REPLAY BUFFER · TARGET NET · EQ R5.2 REPLAY BUFFER ON OFF (ONLINE) TARGET NETWORK FROZEN OFF (SELF) REGIME STABILIZED FINAL VALUE ERROR — OUTCOME — Each curve is the learned value of the start state over training on the toy chain, plotted against its true value (the dashed mint line). With both stabilizers ON the estimate climbs smoothly to the truth. Turn OFF the target network and the bootstrap target chases itself — the curve overshoots and oscillates. Turn OFF replay and learning from a single correlated stream becomes jagged and slow. Switch both off to watch the deadly triad's instability in miniature. Nothing here needs a click — it renders the stabilized run on load. 5.3 Proximal Policy Optimization DQN learns a value and acts greedily; it is confined to discrete actions and is famously fiddly to tune. The other half of deep RL learns the policy directly (Chapter 04). Vanilla policy gradients are unbiased but high-variance and brittle: a single overlarge step can push the policy into a region where it collects no reward, and with no good data it never recovers. The fix that won the field is Proximal Policy Optimization (PPO) — robust, simple to implement, and the workhorse behind everything from robotics to the RLHF that aligns language models (Chapter 06). PPO descends from Trust Region Policy Optimization (TRPO), whose principle is: improve the policy, but never step so far that the new policy is unrecognizably different from the old one, because the advantage estimates were collected under the old policy and stop being valid far from it. TRPO enforces this with a hard KL-divergence constraint and a second-order optimization — correct but heavy. PPO achieves nearly the same effect with a first-order trick that fits in a few lines: clip the probability ratio. Let \(r_t(\theta) = \dfrac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\) be the ratio of the new policy's probability of the taken action to the old policy's. \(r_t = 1\) means no change; \(r_t > 1\) means the new policy is more likely to take that action. PPO maximizes the clipped surrogate objective: EQ R5.4 — PPO CLIPPED SURROGATE OBJECTIVE $$ L^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\!\Big[\, \min\big(\, r_t(\theta)\, \hat{A}_t,\;\; \mathrm{clip}\big(r_t(\theta),\, 1 - \varepsilon,\, 1 + \varepsilon\big)\, \hat{A}_t \,\big) \Big] $$ \(\hat{A}_t\) is the estimated advantage (typically from GAE, Chapter 04) — how much better the action was than the policy's average. The \(\mathrm{clip}\) confines the ratio to \([1-\varepsilon,\, 1+\varepsilon]\) (the default \(\varepsilon = 0.2\) gives \([0.8,\, 1.2]\)). The outer \(\min\) takes the pessimistic of the clipped and unclipped terms, so the objective is a lower bound on the true improvement. The effect: once an update would move the policy too far in the rewarding direction, the gradient simply switches off — there is no incentive to step past the trust region. No KL constraint, no second-order solve: a one-line guardrail that made policy gradients dependable. The asymmetry is the clever part. Read it case by case. When the advantage is positive (a good action, push its probability up) the objective stops rewarding any increase in \(r_t\) past \(1+\varepsilon\): the upside is capped, so the update cannot over-commit. When the advantage is negative (a bad action, push its probability down) the clipping floors the term at \(1-\varepsilon\), again removing the incentive to over-correct. Crucially, the \(\min\) only ever removes incentive when the policy has already moved far enough in the favorable direction — it never clips in a way that prevents undoing a too-large step, so the policy can always claw back from a mistake. That single property is why PPO is forgiving where vanilla policy gradients are not. PPO clips the probability ratio \(r_t(\theta)\) to the interval \([1-\varepsilon,\, 1+\varepsilon]\). For the standard \(\varepsilon = 0.2\), what is the upper end of the clipping interval, \(1 + \varepsilon\)? \(1 + \varepsilon = 1 + 0.2 = \) 1.2. (The lower end is \(1 - 0.2 = 0.8\).) Once the new policy is more than 20% more likely to take an advantageous action than the old policy was, the clipped objective stops rewarding any further increase — the trust region, made of arithmetic. For one sample with ratio \(r_t = 1.5\), advantage \(\hat{A}_t = +2\), and \(\varepsilon = 0.2\), evaluate the per-sample PPO objective \(\min\!\big(r_t\hat{A}_t,\ \mathrm{clip}(r_t,\,0.8,\,1.2)\,\hat{A}_t\big)\) from EQ R5.4. Unclipped term: \(r_t\hat{A}_t = 1.5 \times 2 = 3\). Clipped ratio: \(\mathrm{clip}(1.5, 0.8, 1.2) = 1.2\), so the clipped term is \(1.2 \times 2 = 2.4\). The objective is the minimum: \(\min(3,\ 2.4) = \) 2.4. The advantage is positive and the ratio has already exceeded \(1+\varepsilon\), so PPO caps the reward at the clipped value — pushing \(r_t\) higher would buy nothing. PYTHON · RUNNABLE IN-BROWSER # PPO clipped objective on toy advantages, swept over the ratio r (EQ R5.4) import numpy as np eps = 0.2 r = np.linspace(0.0, 2.0, 21) # candidate probability ratios def ppo_obj(r, A, eps=0.2): unclipped = r * A clipped = np.clip(r, 1 - eps, 1 + eps) * A return np.minimum(unclipped, clipped) # pessimistic lower bound A_pos, A_neg = +1.0, -1.0 L_pos = ppo_obj(r, A_pos, eps) L_neg = ppo_obj(r, A_neg, eps) print(" r L(A=+1) L(A=-1)") for ri, lp, ln in zip(r, L_pos, L_neg): print(f"{ri:4.2f} {lp:+6.3f} {ln:+6.3f}") print(f"\nclip interval at eps={eps}: [{1-eps:.2f}, {1+eps:.2f}]") print("A>0: objective FLATTENS once r exceeds 1.20 (no reward for over-stepping).") print("A<0: objective FLATTENS once r drops below 0.80 (no reward for over-correcting).") plot_xy(r.tolist(), L_pos.tolist()) RUN ▶ edits are live — break it on purpose INSTRUMENT R5.2 — PPO CLIP VISUALIZER L^CLIP vs RATIO · EQ R5.4 CLIP ε 0.20 ADVANTAGE Â POSITIVE (+1) NEGATIVE (−1) CLIP INTERVAL [0.80, 1.20] OBJECTIVE FLATTENS AT r = 1.20 L^CLIP AT r = 1.5 — The mint curve is the clipped objective \(L^{\text{CLIP}}\) as a function of the ratio \(r_t\); the faint grey line is the unclipped \(r_t\hat{A}_t\) that vanilla policy gradients would chase off to infinity. For a positive advantage the mint curve rises, then goes flat past \(1+\varepsilon\) — the gradient dies, so no update can over-step. Flip to a negative advantage and the flat shoulder appears below \(1-\varepsilon\) instead. Widen \(\varepsilon\) to loosen the trust region (bigger, riskier steps); narrow it for timid, stable ones. The default \(\varepsilon = 0.2\) is what most PPO code ships with — and what renders on load. 5.4 Continuous control — DDPG & SAC DQN's \(\max_{a'}\) over actions is fine when there are four buttons; it is intractable when the action is a vector of continuous torques, because the maximization is itself an optimization problem at every step. Continuous control — robot arms, locomotion, autonomous driving — needs a different shape of algorithm. The dominant family is actor–critic, which keeps a learned policy (the actor) and a learned value (the critic) and lets them improve each other. DDPG (Deep Deterministic Policy Gradient). An off-policy actor–critic that you can read as "DQN for continuous actions". A deterministic actor \(\mu_\theta(s)\) outputs the action directly, so the critic's \(\max\) is replaced by \(Q(s, \mu_\theta(s))\) — no inner optimization. It inherits DQN's replay buffer and target networks (with Polyak updates, EQ R5.3) and adds exploration noise to the actor's output. Powerful, but notoriously sensitive to hyperparameters. TD3 (Twin Delayed DDPG). Three targeted fixes for DDPG's pathologies: twin critics (take the minimum of two Q-networks to fight the over-estimation bias DQN also suffers); delayed actor updates (update the policy less often than the critic, so it chases a more settled target); and target-policy smoothing (add noise to the target action so the critic cannot exploit sharp peaks). Together they make off-policy continuous control far more reliable. SAC (Soft Actor–Critic). The current default for continuous control. SAC is built on maximum-entropy RL: the objective adds an entropy bonus, so the agent is rewarded not only for return but for staying as random as it can while still doing well. This yields strong, automatic exploration, robustness to hyperparameters, and excellent sample efficiency. EQ R5.5 — MAXIMUM-ENTROPY OBJECTIVE (SAC) $$ J(\pi) \;=\; \sum_{t} \mathbb{E}_{(s_t, a_t) \sim \pi}\!\Big[\, R(s_t, a_t) \;+\; \alpha\, \mathcal{H}\big(\pi(\cdot \mid s_t)\big) \,\Big], \qquad \mathcal{H}(\pi) = -\!\sum_{a} \pi(a\mid s)\log \pi(a\mid s) $$ The familiar return, plus a per-step entropy bonus \(\alpha\,\mathcal{H}(\pi)\) that pays the agent to keep its action distribution spread out. The temperature \(\alpha\) sets the price of randomness: large \(\alpha\) keeps the policy exploratory and stochastic, \(\alpha \to 0\) recovers ordinary reward maximization. Entropy turns exploration from a bolt-on heuristic (the ε of DQN) into a first-class term of the objective — and modern SAC tunes \(\alpha\) automatically to hold entropy at a target, removing one of the most painful knobs in RL. The price: a continuous, off-policy method that, like DDPG/TD3, leans on replay buffers and target networks for stability. A useful mental map: PPO is the robust, on-policy default when you can afford to throw away data after each batch (and it dominates RLHF for that simplicity). SAC is the sample-efficient, off-policy default when interactions are expensive — a real robot, a slow simulator. DQN and its descendants own discrete-action problems. There is no universal winner; the right choice is dictated by action space, sample budget, and how much tuning you can tolerate. 5.5 Stability & reproducibility Deep RL works — and it is also, honestly, the least reproducible corner of mainstream machine learning. The reason traces straight back to §5.1: the agent generates its own data, so a tiny early difference in behavior steers it toward an entirely different region of experience, and the gap compounds. The most uncomfortable symptom is seed sensitivity: the same algorithm, the same code, the same hyperparameters, changing only the random seed, can produce wildly different learning curves — one seed solving the task, another never leaving the floor. This is not a rumor; it was documented carefully. Henderson et al. (2018) showed that reported results in deep-RL papers were routinely driven by a handful of lucky seeds, that the choice of seed could matter as much as the choice of algorithm, and that comparisons drawn from too few runs were often statistically meaningless. The practical consequences are now widely accepted: Report many seeds, not one. A single learning curve is anecdote. Five to ten independent seeds, with the spread shown — not just the best or the mean — is the minimum honest unit of evidence. Show the distribution. Mean standard deviation, or better, the interquartile range and confidence intervals; aggregate protocols like RLiable exist precisely to stop cherry-picking. The variance across seeds is itself a result — a high-variance method may be worse in practice than a lower-mean but reliable one. Pin the stack. Environment version, library version, hardware, and every hyperparameter, because deep-RL outcomes are sensitive to all of them — implementation details that sound cosmetic (reward scaling, observation normalization, the exact advantage estimator) routinely swing final performance more than the headline algorithm does. EQ R5.6 — WHAT A SINGLE SEED HIDES $$ \bar{G} = \frac{1}{N}\sum_{i=1}^{N} G^{(i)}, \qquad \mathrm{SE} = \frac{s}{\sqrt{N}}, \qquad s^2 = \frac{1}{N-1}\sum_{i=1}^{N}\big(G^{(i)} - \bar{G}\big)^2 $$ \(G^{(i)}\) is the final return of seed \(i\); \(\bar{G}\) the mean across \(N\) seeds; \(s\) the sample standard deviation; \(\mathrm{SE}\) the standard error of the mean, which shrinks only as \(1/\sqrt{N}\). With \(N = 1\) there is no \(s\) and no \(\mathrm{SE}\) — the number you report has an error bar you simply cannot see. Because deep-RL seed variance is large, the \(\sqrt{N}\) in the denominator is brutal: halving your uncertainty costs four times the compute. This is the arithmetic behind "run more seeds". PYTHON · RUNNABLE IN-BROWSER # Why one seed lies: variance of final return across seeds (EQ R5.6) import numpy as np # simulate 8 seeds of a high-variance deep-RL run: some solve it, some stall rng = np.random.default_rng(7) seeds = 8 # bimodal outcome: ~60% reach a good return ~180, ~40% get stuck near ~40 solved = rng.random(seeds) < 0.6 final = np.where(solved, rng.normal(180, 15, seeds), rng.normal(40, 20, seeds)).clip(0) mean = final.mean() s = final.std(ddof=1) # sample std (N-1) se = s / np.sqrt(seeds) # standard error of the mean print("per-seed final return:", final.round(1).tolist()) print(f"mean G_bar = {mean:6.1f}") print(f"std s = {s:6.1f} (this is what one seed cannot show you)") print(f"std-err SE = {se:6.1f} (shrinks only as 1/sqrt(N))") print(f"if you reported ONLY seed 0: {final[0]:.1f} <- anecdote, not evidence") plot_scatter(list(range(seeds)), final.tolist(), solved.astype(int).tolist()) RUN ▶ edits are live — break it on purpose INSTRUMENT R5.3 — REWARD-CURVE VARIANCE ACROSS SEEDS SAME ALGORITHM · DIFFERENT SEEDS · EQ R5.6 SEEDS SHOWN N 8 RUN-TO-RUN NOISE 1.00 MEAN FINAL RETURN — STD ACROSS SEEDS — STD ERROR (s/√N) — Every faint curve is one seed of the same deep-RL agent — identical code, identical hyperparameters, only the random seed differs. The bright mint line is the mean across them; the shaded band is one standard deviation. Drag N down to 1 and you are left with a single anecdotal curve that could be the lucky run or the doomed one — you cannot tell. Drag it up and the mean steadies while the band reveals the true spread the field learned to report. Crank the noise to feel why high-variance methods demand many seeds before any comparison means anything. Renders eight seeds on load — no interaction needed. PITFALLS The deep-RL reproducibility checklist. (1) One-seed results are anecdotes — report 5, ideally with IQR/CIs. (2) The deadly triad can diverge silently; watch the Q-values, not just the reward. (3) Reward scaling and observation normalization swing outcomes more than the algorithm name — log them. (4) "Beats SOTA" from a different env version or evaluation protocol is not a comparison. (5) Tuning on the test environment is a contamination, exactly as in supervised learning. NEXT The clip that stabilized policy gradients is about to stabilize something far larger. Chapter 06 turns PPO outward: the same clipped objective, with a language model as the policy and a learned reward model standing in for the environment, is the engine of RLHF — and its leaner successors, DPO and GRPO, that align the models you talk to every day. 5.R References Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — the DQN paper; experience replay and the frozen target network (EQ R5.2) learning Atari from pixels. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 — the clipped surrogate objective (EQ R5.4) at the heart of §5.3. Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. (2015). Trust Region Policy Optimization. arXiv:1502.05477 — the KL-constrained trust region PPO approximates with a first-order clip. Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor. arXiv:1801.01290 — the maximum-entropy objective (EQ R5.5) and the continuous-control default of §5.4. van Hasselt, H., Guez, A. & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016 (arXiv:1509.06461) — double-DQN, decoupling action selection from evaluation to curb over-estimation. Lillicrap, T. P. et al. (2016). Continuous control with deep reinforcement learning. arXiv:1509.02971 — DDPG, the deterministic actor–critic for continuous actions (§5.4). Fujimoto, S., van Hoof, H. & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. arXiv:1802.09477 — TD3; twin critics, delayed updates, and target smoothing. Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI 2018 (arXiv:1709.06560) — the reproducibility and seed-variance study behind §5.5 and EQ R5.6. ← PREVIOUS 04 Policy Gradients NEXT CHAPTER 06 RL & LLMs AI // ENCYCLOPEDIA — REINFORCEMENT LEARNING · CH 05 FULL CONTENTS ↗
## RL · RL Meets LLMs (https://ai-encyclopedia.com/rl/06-rl-and-llms.html)
RL Meets LLMs — RLHF, DPO & GRPO — AI Encyclopedia AI // ENCYCLOPEDIA / REINFORCEMENT LEARNING / 06 / RL & LLMs INDEX NEXT: GAME THEORY · 01 → REINFORCEMENT LEARNING · CHAPTER 06 / 06 RL Meets LLMs — RLHF, DPO & GRPO For most of this volume the agent acted in a maze, a game, or a control loop. Now the environment is a conversation and the agent is a language model, yet the machinery barely changes. The same reward-maximizing apparatus that mastered Atari and Go now aligns language models, and the latest variants skip the reward model entirely. This chapter traces the line from a contextual bandit, through the reward model and PPO of RLHF, to DPO's closed-form shortcut and GRPO's group-relative, value-network-free objective driving the reasoning models of 2025. LEVEL ADVANCED READING TIME ≈ 30 MIN BUILDS ON RL 05 · POLICY GRADIENTS INSTRUMENTS RM PIPELINE · DPO vs PPO · REWARD HACK IN THIS CHAPTER 6.1 Bandits & the contextual case 6.2 RLHF — learning from preferences 6.3 PPO for language models 6.4 DPO — preferences without RL 6.5 GRPO & RLVR — verifiable rewards 6.R References 6.1 Bandits & the contextual case Before the conversation, the slot machine. A multi-armed bandit is the smallest non-trivial RL problem: one state, \(K\) actions ("arms"), each returning a noisy reward, and a single dilemma — explore arms you are unsure about, or exploit the one that has paid best so far. There are no transitions and no credit assignment across time, which is exactly what makes it the cleanest laboratory for the explore–exploit trade-off of Chapter 01. Add a twist and you get the frame that makes the rest of this chapter click. In a contextual bandit, before each pull the agent sees a context \(x\) and chooses an arm \(a\) conditioned on it; the reward depends on both. The episode is exactly one step long: observe, act, get rewarded, done. There is no \(s_{t+1}\) to plan toward, so the discount factor and the Bellman recursion fall away entirely. EQ R6.1 — THE CONTEXTUAL-BANDIT OBJECTIVE $$ \max_{\pi}\; \mathbb{E}_{x \sim \mathcal{D}}\; \mathbb{E}_{a \sim \pi(\cdot \mid x)}\big[\, r(x, a) \,\big] $$ \(\mathcal{D}\) is the distribution of contexts; \(\pi(a \mid x)\) the policy; \(r(x,a)\) the reward for taking action \(a\) in context \(x\). Compare the full RL return (Vol RL · EQ R1.3): the sum over future steps has collapsed to a single expected reward, because the horizon is one. This is the exact shape of LLM alignment. Read \(x\) as the prompt, \(a\) as the entire generated response, and \(r(x,a)\) as "how good was that answer" — and RLHF is nothing more than a contextual bandit over an astronomically large action space. That reframing is the load-bearing idea of the whole chapter. An LLM response is a single action drawn from a policy \(\pi_\theta(y \mid x)\) — yes, it is built token by token, but the reward arrives once, on the finished sequence, so the optimization is bandit-shaped even though the generation is sequential. The action space is the set of all token sequences, combinatorially huge, which is why we never enumerate arms; we sample, score, and nudge the sampling distribution. Two things are missing from EQ R6.1, and supplying them is the entire history that follows: where does \(r(x,a)\) come from when no environment hands it to us, and how do we optimize it when we cannot try every arm. A caveat experts insist on: treating an LLM rollout as one bandit action throws away all intermediate structure. Per-token credit assignment (the dense-reward, token-level MDP view) is an active research frontier, and process-reward models that score reasoning steps rather than only final answers are exactly an attempt to reintroduce the horizon the bandit framing discards. The bandit picture is the right first model — not the last word. A contextual bandit episode is exactly how many environment steps long (observe context, take one action, receive one reward, terminate)? Enter the integer. There is one context, one action, one reward, then termination — no \(s_{t+1}\). The horizon is 1, which is why the discounted return of Chapter 01 collapses to a single expected reward (EQ R6.1). 6.2 RLHF — learning from human preferences The reward in EQ R6.1 is the problem. "How good is this answer" has no closed form — helpfulness, honesty, and tone are not functions you can write down. The insight that unlocked modern alignment, due to Christiano and colleagues in 2017 and scaled to language by InstructGPT in 2022, is that people cannot reliably score a response on an absolute scale, but they can reliably compare two. So do not ask for a number; ask which of two completions is better, and learn a reward function that explains those choices. The bridge from comparisons to a scalar is the Bradley–Terry model, a century-old model of paired comparisons. Assign each response a latent reward \(r_\phi(x,y)\); the probability that response \(y_w\) is preferred over \(y_l\) is the sigmoid of their reward difference. EQ R6.2 — BRADLEY–TERRY PREFERENCE MODEL $$ P\big(y_w \succ y_l \mid x\big) \;=\; \frac{\exp r_\phi(x, y_w)}{\exp r_\phi(x, y_w) + \exp r_\phi(x, y_l)} \;=\; \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big) $$ \(\sigma\) is the logistic sigmoid; \(y_w\) ("win") is the preferred completion, \(y_l\) ("lose") the rejected one. Only the difference of rewards matters — the model is invariant to adding any constant to every reward, so the scale is fixed only up to a shift. Equal rewards give exactly \(\sigma(0) = 0.5\): a coin flip when the two answers are equally good. Fitting \(r_\phi\) is then a binary-classification problem on preference pairs. The reward model \(r_\phi\) is itself a transformer — usually the supervised-fine-tuned policy with its token head replaced by a single scalar head reading the final hidden state. It is trained by maximum likelihood on a dataset of comparisons \(\{(x, y_w, y_l)\}\): minimize the negative log-likelihood of the human's choice under EQ R6.2. EQ R6.3 — REWARD-MODEL LOSS $$ \mathcal{L}_{\text{RM}}(\phi) \;=\; -\,\mathbb{E}_{(x,\,y_w,\,y_l)\sim \mathcal{D}}\Big[\, \log \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big) \Big] $$ This is logistic regression on the reward gap. The gradient pushes \(r_\phi(x,y_w)\) up and \(r_\phi(x,y_l)\) down until the model's predicted preference probability matches the humans'. A subtlety that bites in practice: the reward model is a frozen snapshot of human judgment, and as the policy drifts to exploit it (§6.5's reward hacking), its scores grow unreliable on exactly the off-distribution outputs the policy is now producing. With a learned \(r_\phi\) standing in for the human, the contextual-bandit objective of EQ R6.1 is finally concrete: maximize the reward model's score over completions the policy generates. The classic RLHF pipeline is three stages — supervised fine-tuning (SFT) to teach the format, reward-model training on preferences, then policy optimization against the reward model — and the third stage is the subject of §6.3. PYTHON · RUNNABLE IN-BROWSER # Bradley-Terry: fit a scalar reward per item from pairwise preferences (EQ R6.2-3) import numpy as np rng = np.random.default_rng(0) # 4 responses with hidden "true" qualities; we only get to SEE comparisons true_r = np.array([2.0, 1.0, 0.0, -1.0]) n = len(true_r) # generate 600 noisy pairwise preferences: winner sampled by Bradley-Terry pairs = rng.integers(0, n, (600, 2)); pairs = pairs[pairs[:,0] != pairs[:,1]] p_win = 1 / (1 + np.exp(-(true_r[pairs[:,0]] - true_r[pairs[:,1]]))) i_wins = rng.random(len(pairs)) < p_win # True => left item won r = np.zeros(n) # learned rewards, start at 0 for step in range(400): # gradient descent on EQ R6.3 w = np.where(i_wins, pairs[:,0], pairs[:,1]) # winner index per pair l = np.where(i_wins, pairs[:,1], pairs[:,0]) # loser index per pair pred = 1 / (1 + np.exp(-(r[w] - r[l]))) # P(winner beats loser) under model g = np.zeros(n) # dL/dr; (pred-1) flows to winner np.add.at(g, w, (pred - 1)); np.add.at(g, l, (1 - pred)) r -= 0.05 * g / len(pairs) r -= r.mean() # rewards fixed only up to a shift print("true (centered):", (true_r - true_r.mean()).round(2)) print("learned(centered):", r.round(2)) print("ranking recovered:", list(np.argsort(-r)), "== ", list(np.argsort(-true_r))) RUN ▶ edits are live — break it on purpose INSTRUMENT R6.1 — PREFERENCE → REWARD-MODEL PIPELINE BRADLEY–TERRY · EQ R6.2 · LIVE REWARD r(y_w) — CHOSEN 2.0 REWARD r(y_l) — REJECTED 0.0 REWARD GAP Δ = r_w − r_l — P(CHOSEN ≻ REJECTED) = σ(Δ) — RM LOSS −log σ(Δ) — The reward model only ever sees the gap between two completions, never an absolute score (EQ R6.2). Slide the two rewards: when they are equal the preference probability sits at exactly 0.50 — a coin flip — and the loss is its maximum, \(\log 2 \approx 0.69\). Push the chosen response above the rejected one and the sigmoid curve marks how confidently the model now predicts the human's pick. Make the gap negative (rate the rejected answer higher) and watch the loss explode: the model is being told it ranked the pair backwards. 6.3 PPO for language models Stage three optimizes the policy against the reward model. The workhorse is Proximal Policy Optimization (PPO), a policy-gradient method (Chapter 05) chosen for one property above all: it takes small, conservative steps. That conservatism is not incidental. The reward model is a fragile, frozen approximation; optimize against it too aggressively and the policy sprints off-distribution into regions where \(r_\phi\) is meaningless — and produces fluent nonsense that the reward model nonetheless loves. PPO's mechanism is the clipped surrogate objective. Let \(\rho_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t)\) be the probability ratio between the updated and the data-collecting policy, and \(\hat A_t\) the advantage estimate. PPO maximizes the smaller of the unclipped and clipped products, which caps how far one update can move the policy. EQ R6.4 — PPO CLIPPED SURROGATE $$ \mathcal{L}^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\Big[\, \min\big(\rho_t\,\hat A_t,\; \operatorname{clip}(\rho_t,\, 1-\varepsilon,\, 1+\varepsilon)\,\hat A_t\big) \Big] $$ The ratio \(\rho_t\) is clipped to \([1-\varepsilon,\, 1+\varepsilon]\) (typically \(\varepsilon = 0.2\)). When the advantage is positive, the objective stops rewarding the update once \(\rho_t > 1+\varepsilon\); when negative, once \(\rho_t < 1-\varepsilon\). The \(\min\) makes the bound pessimistic — it removes the incentive to move the policy too far in a single step, a cheap surrogate for the trust region of TRPO without the second-order machinery. On top of the clip, RLHF adds a second leash: a per-token KL penalty against the original SFT model. The reward actually optimized is not \(r_\phi\) alone but \(r_\phi\) minus a penalty for drifting away from where the policy started. EQ R6.5 — THE KL-REGULARIZED RLHF REWARD $$ \max_{\pi_\theta}\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta}\big[\, r_\phi(x, y) \,\big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta(y\mid x)\,\|\,\pi_{\text{ref}}(y\mid x)\big) $$ \(\pi_{\text{ref}}\) is the frozen SFT reference; \(\beta\) sets the strength of the leash. The KL term keeps the policy near a region where the reward model is trustworthy and where the model still speaks fluent, on-distribution language. The whole RLHF objective is this one line — and §6.4 shows it has a closed-form optimum, which is the crack DPO pries open. Standard PPO-RLHF needs four models in memory at once: policy, reference, reward model, and a value/critic network. The cost is the story. Four large models resident simultaneously, online rollouts at every step, a separate value network to train, and a notorious sensitivity to hyperparameters — PPO-RLHF works, and it produced InstructGPT, ChatGPT, and the first generation of aligned assistants, but it is heavy, finicky, and hard to reproduce. Every method that follows is, in part, an attempt to keep RLHF's results while shedding its weight. In PPO with \(\varepsilon = 0.2\), the new policy makes an action four times as likely as the old policy, so \(\rho_t = 4\), and the advantage \(\hat A_t\) is positive. Using EQ R6.4, what effective ratio multiplies \(\hat A_t\) in the clipped objective? For positive \(\hat A_t\) the objective is the \(\min\), which selects the clipped branch once \(\rho_t > 1+\varepsilon\). With \(\varepsilon = 0.2\) the cap is \(1 + 0.2 = \) 1.2: pushing the ratio from 1.2 toward 4 buys no extra objective, so PPO has no incentive to take the giant step. PYTHON · RUNNABLE IN-BROWSER # PPO clipped surrogate vs the raw ratio objective (EQ R6.4) import numpy as np eps = 0.2 ratio = np.linspace(0.0, 2.5, 26) # pi_new / pi_old def clip_obj(rho, A, eps=0.2): return np.minimum(rho * A, np.clip(rho, 1-eps, 1+eps) * A) A_pos, A_neg = 1.0, -1.0 obj_pos = clip_obj(ratio, A_pos, eps) # good action: A > 0 obj_neg = clip_obj(ratio, A_neg, eps) # bad action: A < 0 print(" ratio clip(A=+1) clip(A=-1)") for r, op, on in list(zip(ratio, obj_pos, obj_neg))[::4]: print(f" {r:5.2f} {op:8.3f} {on:9.3f}") # the objective FLATTENS past the clip edges -> no reward for a giant step print(f"\nA>0 objective is flat for ratio >= {1+eps}: ", np.allclose(obj_pos[ratio >= 1+eps], 1+eps)) print(f"A<0 objective is flat for ratio <= {1-eps}: ", np.allclose(obj_neg[ratio <= 1-eps], -(1-eps))) plot_xy(ratio.tolist(), obj_pos.tolist()) RUN ▶ edits are live — break it on purpose 6.4 DPO — preferences without RL Here is the elegant turn. The KL-regularized objective of EQ R6.5 is not an open-ended search — it has a known, closed-form optimal policy. For a fixed reward \(r\), the policy that maximizes "expected reward minus \(\beta\)-KL to the reference" is the reference distribution reweighted by the exponentiated reward: EQ R6.6 — THE OPTIMAL KL-REGULARIZED POLICY $$ \pi_r(y \mid x) \;=\; \frac{1}{Z(x)}\,\pi_{\text{ref}}(y \mid x)\,\exp\!\Big(\tfrac{1}{\beta}\, r(x, y)\Big), \qquad Z(x) = \sum_{y}\pi_{\text{ref}}(y \mid x)\,\exp\!\Big(\tfrac{1}{\beta}\, r(x,y)\Big) $$ This is a standard result (a Gibbs / Boltzmann distribution); the partition function \(Z(x)\) is intractable because it sums over all sequences, which is why RLHF resorts to PPO instead of using it directly. But invert it — solve for \(r\) in terms of \(\pi_r\) — and the reward becomes a function of the policy itself, with \(Z(x)\) appearing as an additive term that depends only on \(x\). Rafailov and colleagues (2023) made the leap: substitute that inverted reward into the Bradley–Terry preference model (EQ R6.2). The intractable \(Z(x)\) is the same for both completions of a pair, so in the difference \(r(x,y_w) - r(x,y_l)\) it cancels exactly. What remains is a reward expressed purely as a log-ratio of the policy to the reference — and the entire reward-model-plus-RL pipeline collapses into a single supervised loss on preference pairs. EQ R6.7 — THE DPO LOSS $$ \mathcal{L}_{\text{DPO}}(\theta) = -\,\mathbb{E}_{(x,y_w,y_l)}\!\left[\log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log \frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right)\right] $$ The bracketed term is the implicit reward \(\hat r_\theta(x,y) = \beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\): the policy is its own reward model. Minimizing this raises the likelihood of \(y_w\) and lowers that of \(y_l\), each measured relative to the reference. No reward model is trained, no rollouts are sampled, no RL loop runs — just a forward/backward pass on a fixed dataset, like ordinary supervised fine-tuning. The \(\beta\) that was the KL strength in EQ R6.5 reappears here as the loss temperature. The gradient makes the behavior vivid. Its magnitude scales with how badly the implicit reward model currently ranks the pair — pairs the model already gets right contribute little, pairs it gets backwards contribute a lot — and its direction increases \(\log\pi_\theta(y_w\mid x)\) while decreasing \(\log\pi_\theta(y_l\mid x)\). DPO is preference learning that looks and runs exactly like supervised learning, and that simplicity made it the default for budget alignment almost overnight. Honest caveats, because the field is not settled. DPO is offline: it optimizes on a fixed preference set and cannot explore beyond it, so it is sensitive to how well that data covers the policy's behavior, and the implicit reward can drift on out-of-distribution completions. Online and iterative variants (sampling fresh pairs, IPO's bounded objective, KTO's prospect-theory single-label loss) exist precisely to patch these gaps. Several careful studies find well-tuned PPO still edges out DPO on the hardest tasks; DPO's win is overwhelmingly one of simplicity and cost, not a clean dominance on quality. True or false: DPO removes the need for a separately trained reward model and for an online RL optimization loop, optimizing preferences with a single supervised loss instead. (Answer true or false.) EQ R6.7 depends only on the policy \(\pi_\theta\) and the frozen reference \(\pi_{\text{ref}}\) — the intractable \(Z(x)\) cancelled and the reward model became implicit in the policy. There is no \(r_\phi\) to train and no rollout loop; a single supervised gradient step on preference pairs suffices. The statement is true. For a preference pair, the policy assigns the chosen completion twice the reference likelihood (\(\pi_\theta/\pi_{\text{ref}} = 2\)) and the rejected completion half (\(\pi_\theta/\pi_{\text{ref}} = 0.5\)). With \(\beta = 1\), the implicit-reward gap is \(\beta(\ln 2 - \ln 0.5) = 2\ln 2 \approx 1.386\). What preference probability \(\sigma(\text{gap})\) does the model now assign to the chosen completion? (Use \(\sigma(z) = 1/(1+e^{-z})\).) Since \(e^{2\ln 2} = (e^{\ln 2})^2 = 2^2 = 4\), we have \(e^{-1.386} = 1/4 = 0.25\). So \(\sigma(2\ln 2) = \dfrac{1}{1 + 0.25} = \dfrac{1}{1.25} = \) 0.8. The DPO gradient keeps pushing this toward 1 — raising \(\pi_\theta(y_w)\), lowering \(\pi_\theta(y_l)\). PYTHON · RUNNABLE IN-BROWSER # DPO loss on toy preferred/rejected pairs; check the gradient DIRECTION (EQ R6.7) import numpy as np beta = 1.0 # log-probs (policy and frozen reference) for chosen y_w and rejected y_l lp_pi_w, lp_ref_w = -2.0, -2.3 # policy already prefers y_w a bit lp_pi_l, lp_ref_l = -1.5, -2.4 # but policy still over-likes y_l # implicit reward = beta * (log pi - log ref) -- the policy IS the reward model r_w = beta * (lp_pi_w - lp_ref_w) r_l = beta * (lp_pi_l - lp_ref_l) gap = r_w - r_l p_pref = 1 / (1 + np.exp(-gap)) # P(y_w > y_l) under EQ R6.2 loss = -np.log(p_pref) print(f"implicit reward r_w={r_w:+.3f} r_l={r_l:+.3f} gap={gap:+.3f}") print(f"P(chosen preferred) = {p_pref:.3f} DPO loss = {loss:.3f}") # dL/d(logprob): coefficient (p_pref - 1) < 0 => RAISE logpi(y_w), LOWER logpi(y_l) coef = p_pref - 1.0 g_w = beta * coef * (+1) # gradient wrt log pi(y_w) g_l = beta * coef * (-1) # gradient wrt log pi(y_l) print(f"\ngrad wrt logpi(y_w) = {g_w:+.3f} (negative -> ascent RAISES y_w)") print(f"grad wrt logpi(y_l) = {g_l:+.3f} (positive -> ascent LOWERS y_l)") print("direction: push probability mass from the rejected toward the chosen answer.") RUN ▶ edits are live — break it on purpose INSTRUMENT R6.2 — DPO vs PPO SAME OBJECTIVE · TWO MACHINES · EQ R6.5–R6.7 OPTIMIZER DPO PPO-RLHF MODELS IN MEMORY — ONLINE ROLLOUTS — SEPARATE REWARD MODEL — Both targets optimize the same KL-regularized objective (EQ R6.5). Toggle between them: PPO-RLHF trains a reward model, then samples online rollouts and runs four models at once (policy, reference, reward, critic); DPO proves that objective has a closed-form optimum (EQ R6.6), folds the reward into the policy (EQ R6.7), and reduces the whole thing to one supervised loss over a fixed preference set — two models, no rollouts, no reward model. The stages light up to show exactly which pieces each pipeline keeps. 6.5 GRPO & RLVR — verifiable rewards DPO and PPO both lean on human preferences, with all the noise, expense, and gameability that entails. But for some tasks the reward needs no human at all: a math answer is right or wrong, code passes the unit tests or it does not. This is RLVR — reinforcement learning from verifiable rewards: replace the learned, hackable reward model with a deterministic checker that returns a clean, ungameable signal. It is the engine behind the reasoning models — DeepSeek-R1, OpenAI's o-series, and their kin — that surged through 2024–2025. The optimizer of choice is GRPO — Group Relative Policy Optimization, introduced with DeepSeekMath. Its central move attacks PPO's most expensive component: the value network (the critic) that estimates a baseline for the advantage. GRPO deletes it. Instead, for each prompt it samples a group of \(G\) complete responses, scores them all, and uses the group's own statistics as the baseline — the advantage of a response is simply how far above or below the group average its reward sits. EQ R6.8 — GRPO GROUP-RELATIVE ADVANTAGE $$ \hat A_i \;=\; \frac{r_i - \operatorname{mean}(r_1, \ldots, r_G)}{\operatorname{std}(r_1, \ldots, r_G)}, \qquad i = 1, \ldots, G $$ \(r_i\) is the reward of the \(i\)-th sampled response to the same prompt; the baseline is the group mean and the scale is the group standard deviation. No learned value network is needed — the baseline that PPO spends a whole second model to estimate, GRPO reads straight off a batch of samples. A response beats its peers \(\Rightarrow\) positive advantage \(\Rightarrow\) its tokens are reinforced; it lags \(\Rightarrow\) negative \(\Rightarrow\) suppressed. The normalized advantage then enters a PPO-style clipped objective (EQ R6.4) with the usual KL leash to the reference. Strip away the value network and what remains is almost startlingly simple: sample several answers, reward each (often just 1 for correct, 0 for wrong), standardize the rewards within the group, and push the policy toward the above-average answers. Run that loop on verifiable math and code, and reasoning behavior — longer chains of thought, self-checking, backtracking — emerges without any of it being explicitly supervised. That emergence, more than the algorithm itself, is what made GRPO the defining method of the reasoning era. REWARD HACKING The recurring failure of every method in this chapter. The policy optimizes the measured reward, not the intended one — so any gap between them gets exploited. A reward model that slightly favors longer answers breeds verbosity; one that likes confident tone breeds confident wrongness; a verifiable checker with a loophole gets gamed by answers that pass the test without solving the task. This is Goodhart's law in a gradient: when a measure becomes a target, it ceases to be a good measure. The KL leash (EQ R6.5) is the main defense — it keeps the policy near the trustworthy region — but it only slows the drift, it does not remove the incentive. True or false: GRPO estimates the advantage of each response from the statistics of a group of sampled outputs for the same prompt, removing the need for a separately learned value (critic) network. (Answer true or false.) EQ R6.8's baseline is the group mean and its scale the group std — both read directly off a batch of \(G\) sampled responses, never from a learned critic. That is precisely how GRPO drops PPO's value network. The statement is true. A GRPO group of \(G = 4\) responses to one prompt scores rewards \((1, 0, 0, 1)\) (1 = correct). Using EQ R6.8, what is the standardized advantage \(\hat A_i\) of a correct response? (Mean \(= 0.5\); population std \(= 0.5\).) Mean \(= (1+0+0+1)/4 = 0.5\). Variance \(= \frac{1}{4}\big[(0.5)^2\cdot 4\big] = 0.25\), so std \(= 0.5\). A correct response: \(\hat A = (1 - 0.5)/0.5 = \) 1. A wrong one gets \((0-0.5)/0.5 = -1\): symmetric, and the whole group needn't be re-baselined by any extra network. PYTHON · RUNNABLE IN-BROWSER # GRPO group-relative advantage from a group of sampled outputs (EQ R6.8) import numpy as np rng = np.random.default_rng(0) # one prompt, G=8 sampled responses; verifiable reward = 1 if correct else 0 correct = np.array([1, 0, 1, 1, 0, 0, 1, 0], dtype=float) # RLVR: pass/fail G = len(correct) mean = correct.mean() std = correct.std() + 1e-8 # population std, EQ R6.8 adv = (correct - mean) / std # group-relative advantage print(f"rewards: {correct.astype(int).tolist()}") print(f"group mean (baseline) = {mean:.3f} group std = {std:.3f}") print("advantages:", adv.round(3).tolist()) print("\ncorrect responses get +adv (reinforced), wrong get -adv (suppressed);") print("the baseline is the GROUP itself -- no learned value network anywhere.") # if EVERY sample is correct, std -> 0: the group gives no learning signal allright = np.ones(G) adv0 = (allright - allright.mean()) / (allright.std() + 1e-8) print("\nall-correct group advantages:", adv0.round(3).tolist(), "-> zero signal (nothing to prefer)") RUN ▶ edits are live — break it on purpose INSTRUMENT R6.3 — REWARD HACKING PROXY REWARD vs TRUE QUALITY · EQ R6.5 OPTIMIZATION PRESSURE (STEPS) 40 KL LEASH β 0.20 PROXY REWARD r_φ — TRUE QUALITY — KL FROM REFERENCE — The mint curve is what the reward model measures; the blue curve is the true quality you actually want. Early optimization lifts both — the proxy is a decent stand-in near the reference. Crank up the pressure and they diverge: the proxy keeps climbing while true quality peaks and falls as the policy learns to exploit the reward model's blind spots. This gap is reward hacking, and the dashed line is where true quality turns over. Tighten the KL leash β and the policy stays near the reference — flatter proxy gains, but the divergence is delayed and shallower. Loosen it toward 0 and the hack arrives fast and hard. NEXT Every method here turned a goal into a number and maximized it — and every failure was a player gaming the rules. Preference learning, reward hacking, and self-play are all strategic interaction in disguise. The Game Theory volume opens with the formal language for that: players, payoffs, strategies, and the equilibria that emerge when every agent optimizes against every other — including against the very humans whose preferences we just spent a chapter learning. 6.R References Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS — the original preference-to-reward pipeline and the Bradley–Terry reward model (EQ R6.2–R6.3) that RLHF scaled to language. Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS — the three-stage SFT → reward model → PPO RLHF recipe behind ChatGPT; source of the KL-regularized objective (EQ R6.5). Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv — the clipped surrogate objective (EQ R6.4) used as the RLHF policy optimizer. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS — DPO; the closed-form optimal policy (EQ R6.6) and the supervised preference loss (EQ R6.7) that skip the reward model and RL loop. Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv — introduces GRPO; the group-relative advantage (EQ R6.8) that removes PPO's value network. DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv — RLVR with GRPO at scale; reasoning behavior emerging from verifiable rewards (§6.5). Stiennon, N. et al. (2020). Learning to Summarize from Human Feedback. NeurIPS — the reward-hacking dynamics of over-optimizing a learned reward model (Instrument R6.3 §6.5). ← PREVIOUS 05 Deep RL NEXT CHAPTER 01 Games & Equilibria AI // ENCYCLOPEDIA — REINFORCEMENT LEARNING · CH 06 FULL CONTENTS ↗
========================================================================
GAME THEORY
========================================================================
## GAME · Games & Equilibria (https://ai-encyclopedia.com/game-theory/01-games-equilibria.html)
Games & Equilibria — AI Encyclopedia AI // ENCYCLOPEDIA / GAME THEORY / 01 / EQUILIBRIA INDEX NEXT: REPEATED & COOPERATIVE → GAME THEORY · CHAPTER 01 / 03 Games & Equilibria In single-agent optimization, "optimal" means picking the action with the highest payoff. Once a player's reward depends on what other rational agents do, and theirs on what the player does, that definition no longer applies: there is no fixed objective to maximize against. The replacement is the Nash equilibrium, a strategy profile in which no player can gain by changing their own move alone. This chapter develops the supporting machinery: games and payoffs, dominance, the equilibrium itself, the structure of zero-sum conflict, and the randomized strategies that guarantee equilibria exist. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON RL · PROBABILITY INSTRUMENTS PAYOFF MATRIX · MINIMAX · SIMPLEX IN THIS CHAPTER 1.1 Games, players, strategies & payoffs 1.2 Dominant strategies & iterated elimination 1.3 Nash equilibrium 1.4 Zero-sum games & minimax 1.5 Mixed strategies 1.R References 1.1 Games, players, strategies & payoffs A game in the sense of this volume is not a pastime; it is a model of interaction between agents whose outcomes are intertwined. The minimal data is a triple. There is a set of players \(N = \{1, 2, \ldots, n\}\). Each player \(i\) has a set of strategies \(S_i\) — the actions available to them. And each player has a payoff function \(u_i\) that assigns a real number to every combination of choices, one from each player: EQ G1.1 — NORMAL-FORM GAME $$ \Gamma = \big(N,\; \{S_i\}_{i \in N},\; \{u_i\}_{i \in N}\big), \qquad u_i: S_1 \times S_2 \times \cdots \times S_n \to \mathbb{R} $$ A choice of one strategy per player is a strategy profile \(s = (s_1, \ldots, s_n)\). The crucial feature — the one that breaks ordinary optimization — is that \(u_i\) depends on the whole profile, not just on \(s_i\). Player \(i\) controls only the \(i\)-th coordinate; the rest is chosen by others. We write \(s = (s_i, s_{-i})\), splitting player \(i\)'s move from everyone else's \(s_{-i}\). For two players with finitely many actions, the game is a payoff matrix: rows are player 1's strategies, columns are player 2's, and each cell holds an ordered pair \((u_1, u_2)\). The canonical example is the Prisoner's Dilemma. Two suspects each choose to Cooperate (stay silent) or Defect (betray). Higher numbers are better; the cell entries are (row payoff, column payoff): row ↓ / col → Cooperate Defect Cooperate (2, 2) (0, 3) Defect (3, 0) (1, 1) A central modeling assumption runs through everything below: players are rational (each maximizes their own payoff) and this rationality is common knowledge (each knows the others are rational, knows that they know it, and so on). This is a strong, often unrealistic idealization — real humans deviate systematically, and behavioral game theory exists precisely to map those deviations. We will be honest about where the idealization bites. But it is the assumption that gives the predictions their teeth. The tool that turns this raw data into prediction is the best response. Holding everyone else's choices \(s_{-i}\) fixed, player \(i\)'s best responses are the strategies that maximize their payoff: EQ G1.2 — BEST RESPONSE $$ \mathrm{BR}_i(s_{-i}) \;=\; \operatorname*{arg\,max}_{s_i \in S_i} \; u_i(s_i,\, s_{-i}) $$ A set, not a single point — ties are allowed. Almost every equilibrium concept in game theory is a fixed point of mutual best response: a profile in which everyone is simultaneously best-responding to everyone else. The entire chapter is, in one sentence, the study of when such fixed points exist and how to find them. In the Prisoner's Dilemma above, suppose the column player has chosen Cooperate. Apply EQ G1.2: what is the row player's payoff \(u_1\) when they play their best response to a cooperating opponent? Holding the column at Cooperate, row compares Cooperate (payoff \(2\)) against Defect (payoff \(3\)). The best response is Defect, paying \(u_1 = \) 3 — the "temptation" payoff. This is exactly why mutual cooperation is unstable: from \((2,2)\), a unilateral defection jumps you to \(3\). 1.2 Dominant strategies & iterated elimination Sometimes a strategy is good no matter what anyone else does. Strategy \(s_i\) strictly dominates \(s_i'\) when it pays more against every possible choice of the opponents: EQ G1.3 — STRICT DOMINANCE $$ u_i(s_i,\, s_{-i}) \;>\; u_i(s_i',\, s_{-i}) \qquad \text{for every } s_{-i} \in S_{-i} $$ Replace the strict \(>\) with \(\ge\) (strict somewhere) and you get weak dominance. A rational player never plays a strictly dominated strategy — it is beaten regardless of what the world does, so reasoning about the opponent is unnecessary. This is the rare corner of game theory where a player's optimal move requires no belief about the others. In the Prisoner's Dilemma, Defect strictly dominates Cooperate for both players: against a Cooperating opponent, \(3 > 2\); against a Defecting opponent, \(1 > 0\). Each player has a strictly dominant strategy — Defect — so the predicted outcome is mutual defection, paying \((1, 1)\). And here is the dilemma's sting: both would have preferred \((2, 2)\), but \((2, 2)\) is not stable, because each could unilaterally jump to \(3\) by defecting. Individual rationality drives the pair to a jointly worse outcome. This single fact underwrites everything from arms races to tragedy-of-the-commons depletion to why multi-agent AI systems trained only on self-interest can converge on collectively destructive policies. Most games have no dominant strategy. But dominance still buys leverage through iterated elimination of strictly dominated strategies (IESDS): delete a dominated strategy, and in the smaller game some other strategy may now be dominated, so delete that too, and repeat. With strict dominance the order of deletion does not change the result. If the process leaves exactly one strategy per player, the game is dominance-solvable and we have a prediction without ever invoking equilibrium. INSTRUMENT G1.1 — PAYOFF-MATRIX EXPLORER EDIT A 2×2 GAME · FIND THE NASH EQUILIBRIA EACH CELL = (ROW PAYOFF, COLUMN PAYOFF) · EDIT ANY NUMBER · UNDERLINED = A BEST RESPONSE · MINT CELL = PURE NASH PRESET GAME PRISONER'S DILEMMA COORDINATION CHICKEN MATCHING PENNIES PURE NASH EQUILIBRIA — ROW DOMINANT STRATEGY — COLUMN DOMINANT STRATEGY — A pure Nash equilibrium is a cell where both numbers are underlined — row is best-responding in its column and column is best-responding in its row simultaneously. Prisoner's Dilemma has one (mutual defect); Coordination has two (both pure equilibria are stable but uncoordinated play is not); Matching Pennies has none — the underlines chase each other around the matrix, which is exactly why §1.5 needs randomization. Edit any payoff and watch the equilibria move. PYTHON · RUNNABLE IN-BROWSER # Pure Nash equilibria of a 2x2 game by best response (EQ G1.2) import numpy as np # Prisoner's Dilemma, "higher is better". A = row payoffs, B = column payoffs. A = np.array([[2, 0], # row plays Cooperate [3, 1]]) # row plays Defect B = np.array([[2, 3], # column payoffs, same cell layout [0, 1]]) acts = ["Cooperate", "Defect"] # A cell (i,j) is Nash if i maximizes row's payoff in column j # AND j maximizes column's payoff in row i. row_br = (A == A.max(axis=0, keepdims=True)) # best row responses per column col_br = (B == B.max(axis=1, keepdims=True)) # best col responses per row nash = row_br & col_br print("row best-response mask (per column):\n", row_br.astype(int)) print("col best-response mask (per row):\n", col_br.astype(int)) for i in range(2): for j in range(2): if nash[i, j]: print(f"PURE NASH -> (row={acts[i]}, col={acts[j]}), " f"payoffs=({A[i,j]}, {B[i,j]})") RUN ▶ edits are live — break it on purpose 1.3 Nash equilibrium Dominance handles only the easy games. The concept that handles all of them — John Nash's 1950 contribution, and the reason game theory became the lingua franca of economics, biology, and multi-agent AI — is a notion of mutual stability. A strategy profile \(s^\star\) is a Nash equilibrium when no single player can improve their payoff by unilaterally changing their own strategy, holding everyone else's fixed: EQ G1.4 — NASH EQUILIBRIUM $$ u_i\big(s_i^\star,\, s_{-i}^\star\big) \;\ge\; u_i\big(s_i,\, s_{-i}^\star\big) \qquad \text{for every player } i \text{ and every } s_i \in S_i $$ Equivalently, every player is best-responding to everyone else at once: \(s_i^\star \in \mathrm{BR}_i(s_{-i}^\star)\) for all \(i\). The word "unilaterally" is load-bearing — equilibrium says nothing about coordinated deviations by coalitions (that is cooperative game theory, next chapter). It is a statement about no profitable solo move, and that is exactly why it can be self-enforcing without any contract. Read the definition as a test, not a recipe: given a candidate profile, you check it by asking each player in turn, "could you do better by switching, assuming the others stand pat?" If every answer is no, it is an equilibrium. The Prisoner's Dilemma's \((D, D)\) passes: from payoff \(1\), unilaterally cooperating drops you to \(0\). The cooperative \((C, C)\) fails: from \(2\), unilaterally defecting jumps you to \(3\). Two cautions the textbooks insist on, and rightly. First, equilibria need not be unique — coordination games have several, and the theory alone does not say which one rational players will land on (this is the equilibrium selection problem, genuinely unsettled). Second, equilibrium is not optimality: the Prisoner's Dilemma equilibrium is Pareto-dominated by mutual cooperation — every player prefers the non-equilibrium outcome. Nash equilibrium predicts what self-interested rational agents will do, not what is collectively best. Conflating the two is the most common error in applying the concept. Nash's theorem — proved with the Kakutani fixed-point theorem — guarantees that every finite game has at least one equilibrium, provided we allow mixed strategies (randomization over actions, §1.5). Existence is the deep result; the Prisoner's Dilemma happens to have a pure one, but games like Matching Pennies have an equilibrium only once randomization is permitted. By the definition in EQ G1.4: at a Nash equilibrium, no player can increase their own payoff by deviating unilaterally (changing only their own strategy while everyone else holds fixed). Is this statement true or false ? (Answer "true" or "false".) This is the definition itself. EQ G1.4 states \(u_i(s_i^\star, s_{-i}^\star) \ge u_i(s_i, s_{-i}^\star)\) for every player \(i\) and every alternative \(s_i\) — i.e., no unilateral deviation pays more. So the statement is true. (Note the equilibrium says nothing about coordinated multi-player deviations, only solo ones.) 1.4 Zero-sum games & minimax A two-player zero-sum game is pure conflict: whatever one player wins, the other loses, so \(u_1 + u_2 = 0\) in every cell. We can then describe the whole game by a single matrix \(A\), where \(A_{ij}\) is the row player's payoff (the column player's is \(-A_{ij}\)). The row player maximizes; the column player minimizes. This is the setting von Neumann solved in 1928, decades before the general Nash concept — and it is far better behaved. The conservative move for the row player is to choose the strategy whose worst case is best — the maximin. The column player symmetrically picks the strategy whose worst case (from their side) is best — the minimax: EQ G1.5 — MAXIMIN AND MINIMAX $$ \underline{v} = \max_{i} \min_{j} A_{ij} \;\le\; \overline{v} = \min_{j} \max_{i} A_{ij} $$ \(\underline{v}\) is the most the row player can guarantee regardless of the opponent; \(\overline{v}\) is the most the column player can be forced to concede. The inequality \(\underline{v} \le \overline{v}\) always holds (the second mover never does worse). When the two coincide at a single cell, that cell is a saddle point — a pure-strategy equilibrium that is also the value of the game. Von Neumann's Minimax Theorem is the crown jewel: in any finite two-player zero-sum game, once mixed strategies are allowed, the maximin and minimax are always equal. Their common value \(v\) is the value of the game, and the optimal mixed strategies are interchangeable and worst-case optimal: EQ G1.6 — THE MINIMAX THEOREM $$ \max_{p \in \Delta(S_1)} \min_{q \in \Delta(S_2)} \; p^{\top} A\, q \;=\; \min_{q \in \Delta(S_2)} \max_{p \in \Delta(S_1)} \; p^{\top} A\, q \;=\; v $$ \(p\) and \(q\) are probability distributions over the row and column actions (\(\Delta\) is the simplex). The order of "max then min" no longer matters — a property that fails for general-sum games. This is why zero-sum is uniquely tractable: a unique value, no equilibrium-selection ambiguity, and the optimal strategy is computable by linear programming. It is the mathematical backbone of self-play in AlphaZero-style systems and of robust/adversarial training, where the "opponent" is a worst-case perturbation. INSTRUMENT G1.2 — MINIMAX SOLVER 2×2 ZERO-SUM · ROW MAXIMIZES, COLUMN MINIMIZES · EQ G1.5–G1.6 ROW-PAYOFF MATRIX A (COLUMN GETS −A) · EDIT ANY CELL PRESET MATCHING PENNIES HAS A SADDLE SKEWED MAXIMIN v (PURE) — MINIMAX v (PURE) — SADDLE POINT? — ROW MIX p* (TOP, BOTTOM) — COLUMN MIX q* (LEFT, RIGHT) — VALUE OF GAME v* — When the maximin and minimax agree on a single cell, that pure cell is a saddle point and you are done — no randomization needed ("HAS A SADDLE"). When they disagree (Matching Pennies: maximin \(-1\), minimax \(+1\)), there is no pure equilibrium and the solver falls back to the closed-form mixed solution of EQ G1.7. The value sits between the pure maximin and minimax — exactly the gap that randomization closes. PYTHON · RUNNABLE IN-BROWSER # Solve a 2x2 zero-sum game's mixed strategy + value (EQ G1.6-G1.7) import numpy as np A = np.array([[ 1.0, -1.0], # Matching Pennies, row payoffs [-1.0, 1.0]]) # First check for a pure saddle point. maximin = A.min(axis=1).max() # best worst-case row minimax = A.max(axis=0).min() # best worst-case column print(f"pure maximin = {maximin:+.3f}, pure minimax = {minimax:+.3f}") if np.isclose(maximin, minimax): print("saddle point exists -> pure value =", maximin) else: a, b, c, d = A.ravel() denom = a - b - c + d # EQ G1.7 denominator p = (d - c) / denom # P(row plays top) q = (d - b) / denom # P(col plays left) v = (a*d - b*c) / denom # value of the game print(f"no saddle -> mix needed") print(f"row mix p* = ({p:.3f}, {1-p:.3f})") print(f"col mix q* = ({q:.3f}, {1-q:.3f})") print(f"value v* = {v:+.3f}") # sanity: with these mixes the column is indifferent (both columns equal) col_payoffs = np.array([p, 1-p]) @ A print("row's mix makes columns equal:", np.round(col_payoffs, 6)) RUN ▶ edits are live — break it on purpose 1.5 Mixed strategies Matching Pennies has no pure equilibrium — any pure choice you make, the opponent can exploit. The escape is to randomize. A mixed strategy for player \(i\) is a probability distribution \(\sigma_i\) over their actions; a pure strategy is the degenerate case that puts all mass on one action. Payoffs become expected payoffs, and the strategy space is now the simplex \(\Delta(S_i)\) — for two actions, just the line segment of probabilities \((p, 1-p)\). The key to solving for a mixed equilibrium is the indifference principle: if a player is randomizing between several actions in equilibrium, every action they put positive weight on must yield the same expected payoff. Otherwise they would shift all their probability to the better one — so the opponent must mix precisely so as to leave them indifferent. This flips the intuition inside out: your mixing probabilities are pinned down by making the other player indifferent, not yourself. For a 2×2 zero-sum game with row-payoff matrix \(A = \begin{psmallmatrix} a & b \\ c & d \end{psmallmatrix}\) and no saddle point, the indifference conditions give closed forms. The row player mixes with probability \(p\) on the top row so that the column player's two columns pay equally; the column player mixes with \(q\) on the left column likewise: EQ G1.7 — 2×2 ZERO-SUM MIXED SOLUTION $$ p^\star = \frac{d - c}{a - b - c + d}, \qquad q^\star = \frac{d - b}{a - b - c + d}, \qquad v = \frac{a\,d - b\,c}{a - b - c + d} $$ \(p^\star\) is the probability the row player puts on the top row; \(q^\star\) the probability the column player puts on the left column; \(v\) the value of the game. The shared denominator \(a - b - c + d\) is nonzero precisely when no saddle point exists. For Matching Pennies (\(a = d = 1,\; b = c = -1\)): denominator \(= 1 - (-1) - (-1) + 1 = 4\), so \(p^\star = (1-(-1))/4 = 0.5\), \(q^\star = 0.5\), and \(v = (1 - 1)/4 = 0\). Each side plays heads and tails with equal probability, and the game is fair. The mixed equilibrium has a striking robustness: at \((p^\star, q^\star)\), neither player can be exploited, because each has made the other indifferent across all their options. There is nothing to grab. This is the precise sense in which a randomized strategy can be safer than any deterministic one — an idea that reappears, dressed differently, in adversarial robustness and in the stochastic policies of reinforcement learning. INSTRUMENT G1.3 — MIXED-STRATEGY SIMPLEX MATCHING PENNIES · EXPECTED PAYOFF vs ROW MIX p · EQ G1.7 ROW MIX p = P(top row) 0.50 PAYOFF IF COL PLAYS LEFT — PAYOFF IF COL PLAYS RIGHT — ROW'S GUARANTEE (min) — The two lines are the row player's expected payoff against a pure-Left and a pure-Right column, as a function of their own mix \(p\). The column player will always exploit you down to the lower of the two lines — your guarantee is the mint curve (the lower envelope). It peaks exactly where the lines cross, at \(p^\star = 0.5\), giving value \(v = 0\). Drag \(p\) away from \(0.5\) and watch your guarantee fall: any tilt hands the opponent something to exploit. The crossing point is the indifference principle made visible. In Matching Pennies (row-payoff matrix \(\begin{psmallmatrix} +1 & -1 \\ -1 & +1 \end{psmallmatrix}\)), what probability does each player assign to each of their two actions at the unique mixed Nash equilibrium? (Give the probability of a single action, e.g. heads.) By EQ G1.7 with \(a=d=1,\ b=c=-1\): denominator \(= 1-(-1)-(-1)+1 = 4\), so \(p^\star = (d-c)/4 = (1-(-1))/4 = 2/4 = \) 0.5, and \(q^\star = 0.5\) by the same arithmetic. Each action — heads and tails — is played with probability 0.5. Any deviation from 0.5 lets the opponent tilt their own mix to exploit the imbalance, so 0.5 is the only unexploitable choice. A 2×2 zero-sum game has row-payoff matrix \(\begin{psmallmatrix} a & b \\ c & d \end{psmallmatrix} = \begin{psmallmatrix} 4 & 0 \\ 1 & 3 \end{psmallmatrix}\) and no saddle point. Using EQ G1.7, what is the row player's equilibrium probability \(p^\star\) of playing the top row? (Give a decimal.) Denominator \(= a - b - c + d = 4 - 0 - 1 + 3 = 6\). Then \(p^\star = \dfrac{d - c}{6} = \dfrac{3 - 1}{6} = \dfrac{2}{6} = \dfrac{1}{3} \approx \) 0.333. (As a check, the value is \(v = \dfrac{ad - bc}{6} = \dfrac{12 - 0}{6} = 2\), comfortably between the pure maximin of \(1\) and minimax of \(3\).) NEXT One-shot games answer "what is stable?" — but life is repeated, and repetition changes everything. When the Prisoner's Dilemma is played again and again, cooperation can become rational through the shadow of the future, and entire strategies (tit-for-tat, grim trigger) emerge that have no meaning in a single round. Chapter 02 takes up repeated games, the Folk Theorem, and the cooperative side of game theory. 1.R References von Neumann, J. & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press — the founding text; normal-form games (EQ G1.1), the minimax theorem (EQ G1.6), and expected-utility theory. Nash, J. F. (1950). Equilibrium Points in n-Person Games. PNAS 36(1), 48–49 — the existence theorem for the equilibrium concept of EQ G1.4 in general finite games. Nash, J. F. (1951). Non-Cooperative Games. Annals of Mathematics 54(2), 286–295 — the full development of non-cooperative equilibrium, dominance, and the proof via Kakutani's fixed-point theorem. von Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Mathematische Annalen 100 — the original minimax theorem for two-player zero-sum games (§1.4). Osborne, M. J. & Rubinstein, A. (1994). A Course in Game Theory. MIT Press — standard graduate reference for dominance, IESDS, Nash equilibrium, and mixed strategies as presented in §§1.2–1.5. Easley, D. & Kleinberg, J. (2010). Networks, Crowds, and Markets. Cambridge University Press (Ch. 6) — an accessible, freely available treatment of best response, dominant strategies, and equilibrium used to frame this chapter. ← PREVIOUS 06 RL & LLMs NEXT CHAPTER 02 Repeated & Cooperative AI // ENCYCLOPEDIA — GAME THEORY · CH 01 FULL CONTENTS ↗
## GAME · Repeated & Cooperative Games (https://ai-encyclopedia.com/game-theory/02-repeated-cooperative.html)
Repeated & Cooperative Games — AI Encyclopedia AI // ENCYCLOPEDIA / GAME THEORY / 02 / REPEATED GAMES INDEX NEXT: GAMES IN AI → GAME THEORY · CHAPTER 02 / 03 Repeated & Cooperative Games In a single encounter, rational self-interest can drive two players to an outcome both of them reject, as the Prisoner's Dilemma demonstrates. Almost no real interaction happens exactly once. Cooperation that is irrational in one shot becomes rational when the game repeats, because the prospect of future rounds changes the calculation. This chapter traces that idea from the one-shot dilemma through tit-for-tat and Axelrod's tournament into evolutionary stability, then turns to cooperative game theory and the question of how a jointly produced payoff should be divided fairly. LEVEL CORE READING TIME ≈ 26 MIN BUILDS ON GT 01 · NASH EQUILIBRIUM INSTRUMENTS IPD ARENA · REPLICATOR · SHAPLEY IN THIS CHAPTER 2.1 The shadow of the future 2.2 The Prisoner's Dilemma 2.3 Tit-for-tat & Axelrod 2.4 Evolutionary games & ESS 2.5 Cooperative games & Shapley 2.R References 2.1 Repeated games & the shadow of the future Chapter 01 left us with a uncomfortable fact: a Nash equilibrium can be jointly terrible. Two players, each best-responding to the other, can lock into an outcome that both would gladly trade away if only they could trust one another. The escape is not a cleverer one-shot argument — there is none. The escape is repetition. When the same players meet again and again, today's defection can be punished tomorrow, and the prospect of that punishment makes cooperation a credible, self-enforcing equilibrium. A repeated game takes a one-shot game — the stage game — and plays it over and over, the same opponents each round. What changes is that a player's strategy is no longer a single action; it is a plan that can condition on history. "Cooperate, but defect forever the moment you defect on me" is only expressible when there is a future to threaten. The value of that future is governed by a single number, the discount factor \(\delta\): how much a payoff one round from now is worth today. EQ G2.1 — DISCOUNTED PAYOFF OF A REPEATED GAME $$ U \;=\; \sum_{t=0}^{\infty} \delta^{t}\, u_t \;=\; u_0 + \delta\,u_1 + \delta^{2} u_2 + \cdots, \qquad \delta \in [0, 1) $$ \(u_t\) is the stage payoff in round \(t\); \(\delta\) discounts the future. \(\delta\) is the "shadow of the future" — it can be read as patience, or as the per-round probability the relationship continues. A constant stream of \(c\) per round is worth \(c/(1-\delta)\), the same geometric sum that tames returns in reinforcement learning (Vol RL · EQ R1.3). The larger \(\delta\), the more a future punishment outweighs a one-round gain from cheating — which is exactly the lever that makes cooperation rational. This is not a vague hope; it is a theorem. The Folk Theorem says that in an infinitely repeated game with players patient enough (\(\delta\) close to 1), any outcome in which every player does at least as well as their guaranteed minimum (their minmax payoff) can be sustained as a subgame-perfect equilibrium. Cooperation is one such outcome — but so are many others, which is both the power and the embarrassment of the result: repetition explains how cooperation can arise, not which equilibrium will be selected. We will be honest about that gap throughout. The contrast with the one-shot world is stark. In a single play, a strategy is just an action and the only stable thing is mutual defection. In the repeated world, a strategy is a policy over histories and the set of equilibria explodes. The rest of this chapter is about which of those equilibria are robust — to a clever opponent, to mutation, to noise. A relationship yields a constant payoff of \(c = 5\) every round, discounted at \(\delta = 0.75\). Using \(U = c/(1-\delta)\) (EQ G2.1 with constant \(u_t\)), what is the total discounted value \(U\) of cooperating forever? \(U = \dfrac{c}{1-\delta} = \dfrac{5}{1 - 0.75} = \dfrac{5}{0.25} = \) 20. The future is worth four rounds of present payoff — patient players have a lot to lose, which is precisely what deters them from cheating. 2.2 The Prisoner's Dilemma The Prisoner's Dilemma is the cleanest specimen of the conflict between individual and collective rationality. Two suspects, held separately, can each cooperate (stay silent) or defect (betray the other). The payoffs are usually written with four letters: T emptation (defect on a cooperator), R eward (mutual cooperation), P unishment (mutual defection), and S ucker (cooperate against a defector). You ↓ / Them → Cooperate Defect Cooperate R, R = 3, 3 S, T = 0, 5 Defect T, S = 5, 0 P, P = 1, 1 A game is a Prisoner's Dilemma whenever the payoffs obey two inequalities. The first makes defection dominant; the second makes mutual cooperation the socially better outcome: EQ G2.2 — WHAT MAKES IT A DILEMMA $$ T > R > P > S \qquad\text{and}\qquad 2R > T + S $$ The first chain, \(T > R > P > S\), means that whatever the opponent does, defecting pays more: against a cooperator \(T > R\); against a defector \(P > S\). So defect strictly dominates cooperate — and two rational players both defect, landing on \((P,P) = (1,1)\). The second condition, \(2R > T + S\), ensures that mutual cooperation \((R,R)\) beats the average of taking turns exploiting, so alternating is not a way out. With our numbers: \(5 > 3 > 1 > 0\) ✓ and \(6 > 5\) ✓. The tragedy is exact: the unique Nash equilibrium \((1,1)\) is the one outcome both players would pay to avoid \((3,3)\). In one shot, that is the end of the story. There is no trick of reasoning that recovers cooperation, no "if I cooperate maybe they will too" — the dominance argument is airtight, and the chapter on equilibria proved it. Defection is not a failure of intelligence; it is what intelligence prescribes when the game is played exactly once. The dilemma is real, and it is why the one-shot answer to the headline true/false below is unambiguous. R and P>S."> True or false: in a one-shot Prisoner's Dilemma satisfying EQ G2.2, the dominant strategy for a rational player is to defect. (Answer true or false.) Against a cooperating opponent, defect pays \(T = 5\) versus cooperate's \(R = 3\); against a defecting opponent, defect pays \(P = 1\) versus cooperate's \(S = 0\). Defecting beats cooperating in both columns, so it strictly dominates and a rational one-shot player defects. The statement is true. (The whole point of this chapter is that repetition overturns this — but only when the game repeats.) Now repeat the game. Suppose both players adopt Grim Trigger: cooperate every round, but if the opponent ever defects, defect forever after. Is mutual cooperation stable? A player tempted to defect once gains \(T - R\) this round, then is punished with \(P\) instead of \(R\) for the rest of time. Cooperation is an equilibrium exactly when the one-time gain does not beat the discounted stream of forfeited rewards: EQ G2.3 — WHEN DOES COOPERATION HOLD? $$ \underbrace{T - R}_{\text{tempting gain now}} \;\le\; \underbrace{\frac{\delta}{1-\delta}\,(R - P)}_{\text{discounted future loss}} \qquad\Longleftrightarrow\qquad \delta \;\ge\; \frac{T - R}{T - P} $$ Deviating buys \(T-R\) once, but trades a future of \(R\) for a future of \(P\), starting next round — a loss of \((R-P)\) per round discounted by \(\delta/(1-\delta)\). Cooperation survives iff the future loss dominates the present gain. For our payoffs the threshold is \(\delta \ge (5-3)/(5-1) = 0.5\): any pair patient enough to value tomorrow at least half as much as today can sustain cooperation forever. Below \(\delta = 0.5\), the future is too faint a threat and the relationship collapses back to mutual defection. This single inequality is the engine of the whole chapter. = (T-R)/(T-P)."> With \(T = 5,\ R = 3,\ P = 1\), what is the minimum discount factor \(\delta\) at which Grim Trigger sustains cooperation, \(\delta = \dfrac{T-R}{T-P}\) (EQ G2.3)? \(\delta = \dfrac{T-R}{T-P} = \dfrac{5-3}{5-1} = \dfrac{2}{4} = \) 0.5. At or above \(\delta = 0.5\) the future is heavy enough that the threat of "defect forever" deters a one-round betrayal; below it, cooperation unravels. PYTHON · RUNNABLE IN-BROWSER # Grim-Trigger: when is "cooperate forever" worth more than "defect once"? import numpy as np T, R, P, S = 5, 3, 1, 0 # PD payoffs (EQ G2.2) # value of always cooperating vs deviating once then being punished forever def coop_value(delta): return R / (1 - delta) # R, R, R,... def defect_value(delta): return T + delta * P / (1 - delta) # T now, then P forever thresh = (T - R) / (T - P) # closed-form threshold (EQ G2.3) print(f"theoretical cooperation threshold: delta >= {thresh:.3f}\n") print(" delta coop_value defect_value cooperation holds?") for d in (0.30, 0.49, 0.50, 0.75, 0.95): c, x = coop_value(d), defect_value(d) print(f" {d:4} {c:9.2f} {x:11.2f} {c >= x}") print("\ncooperation becomes the better choice exactly at delta = 0.5,") print("matching (T-R)/(T-P). The shadow of the future has a sharp edge.") plot_xy([0.3,0.49,0.5,0.75,0.95], [coop_value(d)-defect_value(d) for d in (0.3,0.49,0.5,0.75,0.95)]) RUN ▶ edits are live — break it on purpose 2.3 Tit-for-tat & Axelrod's tournament Theory tells us cooperation can be an equilibrium. It does not tell us which strategy a real population will land on. In 1980 the political scientist Robert Axelrod ran an experiment to find out: he invited game theorists to submit computer strategies for the iterated Prisoner's Dilemma, then played them all against each other in a round-robin and summed each one's score. The winner — submitted by Anatol Rapoport, and the shortest program entered — was Tit-for-Tat (TFT): cooperate on the first move, then on every move after, simply copy what the opponent did last round. EQ G2.4 — TIT-FOR-TAT $$ a^{\text{TFT}}_{t} = \begin{cases} \texttt{C} & t = 0 \\ a^{\text{opp}}_{t-1} & t \ge 1 \end{cases} $$ One line, no memory beyond the last round. Axelrod distilled its success into four properties. It is nice — never the first to defect; retaliatory — it punishes a defection immediately, so it is not a patsy; forgiving — it returns to cooperation the instant the opponent does, so grudges do not spiral; and clear — its behavior is trivially legible, which lets opponents learn that cooperating pays. Tit-for-tat never beats any single opponent — it can only tie or lose by one defection — yet it wins the tournament, because it is not trying to beat opponents, it is trying to elicit cooperation. That last point is the deep lesson, and it is genuinely counter-intuitive. In a zero-sum world you win by making the other player lose. The iterated PD is not zero-sum: two cooperators each score \(R\) per round, far more than two defectors' \(P\). TFT racks up its total not by exploiting anyone but by spending most of its rounds in the lucrative \((R,R)\) groove with other nice strategies. Strategies that tried to be clever — probing for weakness, defecting "just once" — poisoned their own relationships and scored worse. Greed was self-defeating in a way that only repetition makes visible. TFT is not a flawless oracle, and honesty requires the caveats. First, it is fragile to noise: if a single move is misimplemented — a cooperate flips to a defect by error — two TFT players fall into an endless echo of mutual recrimination, each punishing the other's punishment. Variants like Tit-for-Two-Tats (retaliate only after two defections) and Generous TFT (forgive a defection with some probability) were designed to break that echo. Second, TFT's victory is tournament-dependent: change the population of opponents and a different strategy can top the table. Against a field with no exploitable cooperators, the relentless defector ALLD can win; Axelrod's result holds because his fields were rich in nice, retaliatory strategies. The instrument below lets you feel exactly this — and the one after it shows what happens when strategies must survive, not just score. PYTHON · RUNNABLE IN-BROWSER # A mini Axelrod tournament: round-robin, sum each strategy's total score import numpy as np T, R, P, S = 5, 3, 1, 0 rng = np.random.default_rng(0) def TFT(me, opp): return 'C' if not opp else opp[-1] # copy last move def ALLD(me, opp): return 'D' # always defect def ALLC(me, opp): return 'C' # always cooperate def GRIM(me, opp): return 'D' if 'D' in opp else 'C' # never forgive def RAND(me, opp): return 'C' if rng.random() RUN ▶ edits are live — break it on purpose INSTRUMENT G2.1 — ITERATED PRISONER'S-DILEMMA ARENA TIT-FOR-TAT · ALWAYS-DEFECT · RANDOM · ROUND-ROBIN ROUNDS PER MATCH 200 NOISE (MOVE FLIP %) 0% TOURNAMENT WINNER — TFT TOTAL — ALLD TOTAL — Three strategies — Tit-for-Tat, Always-Defect, and Random — meet in a full round-robin (each plays every strategy, itself included), and the bars show total score. With no noise, TFT wins: it ties ALLD to within a single defection, mutually cooperates with itself, and harvests Random. Now drag the noise slider up. A single mistaken move sends two TFTs into a retaliation echo, their score collapses, and the unforgiving defector climbs the table — the precise fragility that motivated Generous-TFT. Cooperation is robust, but not unconditionally. 2.4 Evolutionary game theory & ESS Axelrod's tournament scored strategies once. But in biology — and in any population of learning agents — a successful strategy does more than score: it reproduces. Strategies that earn more payoff become more common; strategies that earn less die out. This shift in perspective, from a rational chooser to a population under selection, is evolutionary game theory, introduced by John Maynard Smith and George Price in 1973. It needs no assumption that players are rational — only that fitter strategies spread. The central solution concept is the Evolutionarily Stable Strategy (ESS): a strategy that, if adopted by the whole population, cannot be invaded by a small group of mutants playing anything else. Formally, an incumbent strategy \(x\) is an ESS if, against itself, it does at least as well as any mutant \(y\) — and, in the knife-edge case where they tie against the incumbent, \(x\) beats \(y\) when the mutant has to play against other mutants. EQ G2.5 — EVOLUTIONARILY STABLE STRATEGY $$ \text{either } u(x, x) > u(y, x), \quad\text{or}\quad u(x, x) = u(y, x) \ \text{ and } \ u(x, y) > u(y, y) \qquad \forall\, y \neq x $$ \(u(a, b)\) is the payoff to a player using \(a\) against an opponent using \(b\). The first clause says the incumbent strictly out-earns the mutant in the prevailing population (which is almost all incumbents). The second handles the tie: if the mutant matches the incumbent against incumbents, it must lose against itself, so a rare mutant cluster cannot get a foothold. Every ESS is a Nash equilibrium, but not every Nash equilibrium is an ESS — ESS is a strict refinement that adds robustness to invasion, which is exactly what "stable" should mean. How a population gets to an ESS is described by the replicator dynamics: the share of each strategy grows in proportion to how much its fitness beats the population average. Strategies above average expand; strategies below average shrink. The rest points of this flow are the Nash equilibria, and its stable rest points are the ESSs. EQ G2.6 — REPLICATOR DYNAMICS $$ \dot{x}_i \;=\; x_i\,\big(\, f_i(x) - \bar{f}(x) \,\big), \qquad f_i(x) = (A x)_i, \quad \bar{f}(x) = x^{\top} A x $$ \(x_i\) is the fraction of the population playing strategy \(i\); \(A\) is the payoff matrix (\(A_{ij}\) = payoff to \(i\) against \(j\)); \(f_i\) is strategy \(i\)'s fitness against the current mix and \(\bar f\) the mean fitness. The bracket is strategy \(i\)'s advantage over the average — positive shares grow, negative shrink, and the simplex \(\sum_i x_i = 1\) is preserved. Note the form: it is the soft, population-level cousin of a policy-gradient step (Vol RL), pushing mass toward above-average strategies. Run it on a Hawk–Dove game and it spirals into the mixed ESS, never to a pure one. The canonical illustration is Hawk–Dove, Maynard Smith's own example. Animals contest a resource of value \(V\). Hawks escalate and risk injury of cost \(C\); Doves display and retreat. Two Hawks split the resource minus the expected injury, \((V-C)/2\); a Hawk meeting a Dove takes the whole \(V\); two Doves share it, \(V/2\). When \(C > V\), neither pure strategy is stable — a population of all Hawks is invadable by Doves (who avoid the ruinous fights) and vice versa. The unique ESS is a mixed population with a Hawk fraction \(p^{*} = V/C\), and the replicator dynamics converges there from almost any start. True or false: every Nash equilibrium is automatically an Evolutionarily Stable Strategy. (Answer true or false.) The implication runs the other way. Every ESS is a Nash equilibrium (an ESS must be a best response to itself), but ESS adds a strict invasion-resistance condition (EQ G2.5) that some Nash equilibria fail — for example, equilibria sustained only by weakly-best-response ties can be invaded by neutral mutants. So the statement is false: ESS is a strict refinement of Nash. PYTHON · RUNNABLE IN-BROWSER # Replicator dynamics on Hawk-Dove: converge to the mixed ESS p* = V/C import numpy as np V, C = 2.0, 4.0 # resource value, injury cost (C > V) # rows/cols: 0 = Hawk, 1 = Dove; A[i,j] = payoff to i against j (EQ G2.6) A = np.array([[(V - C) / 2, V ], [ 0.0, V / 2]]) p_star = V / C # predicted ESS Hawk fraction print(f"predicted mixed ESS: Hawk fraction p* = V/C = {p_star:.3f}\n") x = np.array([0.90, 0.10]) # start: mostly Hawks dt = 0.05 traj = [] for step in range(600): f = A @ x # fitness of each strategy phi = x @ f # mean fitness x^T A x x = x + dt * x * (f - phi) # replicator update x = x / x.sum() # renormalize onto the simplex if step % 120 == 0: print(f"step {step:3d}: Hawk={x[0]:.3f} Dove={x[1]:.3f}") traj.append(x[0]) print(f"\nconverged Hawk fraction: {x[0]:.3f} (matches V/C = {p_star:.3f})") print("neither pure strategy survives: the population settles at the mix.") plot_xy(list(range(len(traj))), traj) RUN ▶ edits are live — break it on purpose INSTRUMENT G2.2 — REPLICATOR-DYNAMICS SIMULATOR HAWK–DOVE · CONVERGENCE TO THE MIXED ESS · EQ G2.6 RESOURCE VALUE V 2.0 INJURY COST C 4.0 INITIAL HAWK SHARE 0.90 PREDICTED ESS p* = V/C — CONVERGED HAWK SHARE — REGIME — The red curve is the Hawk fraction over time; the mint dashed line is the predicted ESS \(p^{*} = V/C\). Start the population anywhere and watch it flow to the same interior mix — that is what makes the ESS attracting, not merely an equilibrium. Push \(C\) below \(V\) and the math changes character entirely: injuries become cheap, \(V/C\) exceeds 1, and Hawk becomes a pure ESS that sweeps the population — the regime readout flips to tell you which world you are in. No interaction is needed to see the answer: the default \(V=2,\,C=4\) already converges to a 50/50 mix. 2.5 Cooperative games & the Shapley value Everything so far has been non-cooperative game theory: players choose actions independently and we ask what they will do. Cooperative (or coalitional) game theory asks a different question. Suppose the players can form binding agreements and pool their efforts — the only question is how to divide the joint payoff fairly. The primitive is no longer a payoff matrix but a characteristic function \(v(S)\): for every possible coalition \(S\) of players, the total value that coalition can guarantee on its own. The most celebrated answer to "what is each player's fair share?" is the Shapley value, introduced by Lloyd Shapley in 1953. Its idea is disarmingly simple: a player's worth is their average marginal contribution across every order in which the coalition could have been assembled. Imagine the players walking into a room one at a time in a random order; each player is credited with how much they add to the value of those already present. Average that credit over all \(n!\) orders, and you have the Shapley value. EQ G2.7 — THE SHAPLEY VALUE $$ \phi_i(v) \;=\; \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\,\big(n - |S| - 1\big)!}{n!}\,\Big[\, v(S \cup \{i\}) - v(S) \,\Big] $$ \(N\) is the set of all \(n\) players; the sum runs over every coalition \(S\) that excludes \(i\); the bracket \([v(S\cup\{i\}) - v(S)]\) is \(i\)'s marginal contribution to that coalition; and the weight \(\tfrac{|S|!\,(n-|S|-1)!}{n!}\) is the probability that, in a uniformly random arrival order, \(i\) arrives exactly after the players in \(S\). So \(\phi_i\) is precisely the expected marginal contribution over a random ordering of arrivals. The shares always exhaust the grand coalition's value, \(\sum_i \phi_i = v(N)\) — nothing is created or lost in the division. The Shapley value is the unique allocation satisfying four axioms that any reasonable notion of fairness should demand. Efficiency: the shares sum to the total value \(v(N)\). Symmetry: two players who contribute identically to every coalition get equal shares. Null player: a player who adds nothing to any coalition gets zero. Additivity: the value of two games played together is the sum of the values played separately. That a single formula is forced by these four innocuous requirements is the result's quiet power — and it is why the Shapley value reaches far beyond economics. That reach is the bridge to the next chapter. In machine learning, SHAP (SHapley Additive exPlanations) treats a model's input features as "players" cooperating to produce a prediction, and uses the Shapley value to attribute the prediction fairly among them — the dominant method for feature attribution in 2026, and the rare interpretability tool with an axiomatic guarantee. The same idea credits data points for a model's accuracy (data valuation) and apportions cost in shared infrastructure. Fair division, it turns out, is a computational problem the field cannot stop needing. True or false: the Shapley value distributes a coalition's total payoff to each player according to their average marginal contribution across all possible orderings of the players. (Answer true or false.) EQ G2.7 is exactly an average of the marginal contributions \(v(S\cup\{i\}) - v(S)\), weighted by the probability of each arrival order — i.e. the expected marginal contribution over a uniformly random ordering. So the statement is true; this is the defining intuition of the Shapley value. Three players have a characteristic function \(v\) with \(v(\{0\})=10,\ v(\{1\})=20,\ v(\{2\})=30,\ v(\{0,1\})=50,\ v(\{0,2\})=60,\ v(\{1,2\})=70,\ v(\{0,1,2\})=100\) (and \(v(\varnothing)=0\)). What is player 0's Shapley value \(\phi_0\)? (Round to two decimals.) Average player 0's marginal contribution over all \(3! = 6\) orders. Orders and player 0's marginal: (0,1,2)→10; (0,2,1)→10; (1,0,2)→\(50-20=30\); (2,0,1)→\(60-30=30\); (1,2,0)→\(100-70=30\); (2,1,0)→\(100-70=30\). Sum \(= 10+10+30+30+30+30 = 140\); divide by 6: \(\phi_0 = 140/6 = \) 23.33. (Check: by symmetry of the arithmetic, \(\phi_1 = 33.33\), \(\phi_2 = 43.33\), and they sum to exactly \(100 = v(N)\) — efficiency.) PYTHON · RUNNABLE IN-BROWSER # Shapley value of a 3-player game, by averaging over all arrival orders import numpy as np from itertools import permutations # characteristic function v(S): value each coalition can secure (EQ G2.7) v = {(): 0, (0,): 10, (1,): 20, (2,): 30, (0,1): 50, (0,2): 60, (1,2): 70, (0,1,2): 100} def val(S): return v[tuple(sorted(S))] n = 3 phi = np.zeros(n) orders = list(permutations(range(n))) # all n! = 6 arrival orders for order in orders: present = set() for p in order: before = val(present) # value without player p present.add(p) phi[p] += val(present) - before # p's marginal contribution phi /= len(orders) # average over orderings for i in range(n): print(f"player {i}: Shapley value phi = {phi[i]:.3f}") print(f"\nsum of shares = {phi.sum():.3f}") print(f"grand-coalition v(N) = {val(range(n))} RUN ▶ edits are live — break it on purpose INSTRUMENT G2.3 — SHAPLEY-VALUE CALCULATOR 3-PLAYER COALITIONAL GAME · EQ G2.7 v(A) 10 v(B) 20 v(C) 30 v(AB) 50 v(AC) 60 v(BC) 70 v(ABC) — GRAND COALITION 100 φ(A) — φ(B) — φ(C) — Σ φ (SHOULD EQUAL v(ABC)) — Set the value every coalition can secure on its own, and the calculator splits the grand-coalition payoff by each player's average marginal contribution across all 6 arrival orders (EQ G2.7). The bars are the three Shapley shares; the sum readout always equals \(v(\text{ABC})\) — that is efficiency, baked in. Try making one player a null player (set every coalition's value identical with and without them) and watch their share fall to zero; make two players symmetric and their bars equalize. The defaults already give the worked-exercise answer: \(\phi = (23.3,\,33.3,\,43.3)\). NEXT We have seen how games stabilize cooperation and how to divide its rewards fairly. Now watch these ideas leave the blackboard. Chapter 03 follows game theory into modern AI: self-play that bootstrapped superhuman Go and poker, GANs as a literal two-player minimax, multi-agent reinforcement learning, RLHF as a game between a model and a reward, and SHAP — the Shapley value of this chapter — as the field's leading tool for explaining what a model decided and why. 2.R References Axelrod, R. (1984). The Evolution of Cooperation. Basic Books — the round-robin tournaments and the four properties of tit-for-tat (EQ G2.4, §2.3); the founding popular text on repeated cooperation. Axelrod, R. & Hamilton, W. D. (1981). The Evolution of Cooperation. Science 211(4489) — the peer-reviewed account of the iterated-PD tournaments and the evolutionary stability of tit-for-tat. Maynard Smith, J. & Price, G. R. (1973). The Logic of Animal Conflict. Nature 246 — introduces the Evolutionarily Stable Strategy (EQ G2.5) and the Hawk–Dove game (§2.4). Shapley, L. S. (1953). A Value for n-Person Games. In Contributions to the Theory of Games II, Princeton University Press — defines the Shapley value (EQ G2.7) and its axiomatic characterization (§2.5). Maynard Smith, J. (1982). Evolution and the Theory of Games. Cambridge University Press — the book-length development of ESS and the replicator perspective (EQ G2.6). Friedman, J. W. (1971). A Non-cooperative Equilibrium for Supergames. Review of Economic Studies 38(1) — Grim-Trigger equilibria and an early form of the Folk Theorem behind EQ G2.3 (§2.1). Lundberg, S. M. & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 30 (SHAP) — the Shapley value applied to machine-learning feature attribution, the §2.5 bridge into Chapter 03. ← PREVIOUS 01 Games & Equilibria NEXT CHAPTER 03 Games in AI AI // ENCYCLOPEDIA — GAME THEORY · CH 02 FULL CONTENTS ↗
## GAME · Games in AI (https://ai-encyclopedia.com/game-theory/03-games-in-ai.html)
Games in AI — Self-Play, GANs & Multi-Agent — AI Encyclopedia AI // ENCYCLOPEDIA / GAME THEORY / 03 / GAMES IN AI INDEX NEXT: INDEX → GAME THEORY · CHAPTER 03 / 03 Games in AI — Self-Play, GANs & Multi-Agent Supervised learning is bounded by its teacher: a model can only chase the labels a human already wrote. Framing learning as a game lets the agents generate their own curriculum. Each improvement in one player redefines the problem for the other, so the difficulty rises in step with the system's capability. Self-play and adversarial objectives are how AI moved from imitating human experts to surpassing them. LEVEL ADVANCED READING TIME ≈ 28 MIN BUILDS ON GAME THEORY 01–02 INSTRUMENTS GAN PAYOFF · SELF-PLAY LADDER · COORDINATION IN THIS CHAPTER 3.1 When learning becomes a game 3.2 GANs as a minimax game 3.3 Self-play — AlphaZero & beyond 3.4 Multi-agent RL 3.5 Mechanism design & robustness § References 3.1 When learning becomes a game The first two chapters treated games as a model of the world: rational players, payoff matrices, equilibria you solve for. This chapter inverts the relationship. Here the game is a training objective — a structure we impose on optimization so that the loss surface is no longer fixed but co-created by the learner itself. The defining feature is a moving target: the thing a model is trying to beat improves whenever the model does. Static supervised learning has a ceiling. The objective is a frozen dataset, and the best you can do is fit it; once you match the labels, the gradient goes quiet. A game-based objective never goes quiet, because the opponent (an adversary, a past version of yourself, a population of peers) keeps raising the bar. Three families dominate modern practice: Setup The two sides What the game produces Canonical system Adversarial generator vs critic A learned loss function that sharpens as samples improve GANs Self-play agent vs its own past An automatic curriculum of ever-stronger opponents AlphaZero Multi-agent N agents in a shared world Emergent strategy, cooperation, and convention Pluribus, MADDPG What unifies them is the minimax skeleton from Chapter 01: a value that one party maximizes and another minimizes. The mathematics of saddle points, best responses and equilibria — built for analyzing rational agents — turns out to be exactly the mathematics of training them. The catch, returned to throughout, is that gradient descent was designed to find minima, not saddle points, so these games are notoriously harder to optimize than ordinary losses. FRAME A useful slogan: supervised learning imitates a teacher; a game manufactures one. Everything below is a different answer to the question "where does the next, slightly-harder training example come from?" 3.2 GANs as a minimax game A Generative Adversarial Network pits a generator \(G\), which maps noise \(z \sim p_z\) to fake samples \(G(z)\), against a discriminator \(D\), which outputs the probability that a sample is real. \(D\) wants to label reals as 1 and fakes as 0; \(G\) wants \(D(G(z))\) to read as 1. Goodfellow et al. (2014) wrote this as a single two-player zero-sum game on one value function: EQ G3.1 — THE GAN MINIMAX $$ \min_{G}\,\max_{D}\; V(D,G) \;=\; \mathbb{E}_{x \sim p_{\text{data}}}\!\big[\log D(x)\big] \;+\; \mathbb{E}_{z \sim p_z}\!\big[\log\big(1 - D(G(z))\big)\big] $$ \(D\) maximizes \(V\) (push \(D(x)\to 1\) on reals, \(D(G(z))\to 0\) on fakes); \(G\) minimizes it by fooling \(D\). It is zero-sum in spirit: every bit \(D\) gains, \(G\) loses. The generator never sees the data directly — its only teacher is the gradient flowing back through \(D\). That is the whole trick: the loss function is itself learned, and it grows more discerning exactly as the generator improves. Fix \(G\) and ask for the best discriminator. For any \(x\), \(V\) is maximized pointwise, and calculus gives the optimal critic in closed form: EQ G3.2 — THE OPTIMAL DISCRIMINATOR $$ D^{*}_{G}(x) \;=\; \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{g}(x)} $$ where \(p_g\) is the generator's induced distribution. When the generator has won — \(p_g = p_{\text{data}}\) everywhere — the optimal discriminator reads \(D^{*}(x) = \tfrac{1}{2}\) for every input: it can do no better than a coin flip. That fixed point is the Nash equilibrium of the game. Substituting \(D^{*}_G\) back collapses the game onto a divergence between the two distributions: EQ G3.3 — VALUE AT THE GENERATOR'S OPTIMUM $$ C(G) \;=\; \max_{D} V(D,G) \;=\; -\log 4 \;+\; 2\cdot \mathrm{JSD}\!\big(p_{\text{data}}\,\|\,p_g\big) $$ \(\mathrm{JSD}\ge 0\) is the Jensen–Shannon divergence, zero only when \(p_g = p_{\text{data}}\). So the global minimum of \(C(G)\) is \(-\log 4 \approx -1.386\), attained exactly when the generator matches the data. Training a GAN is, in this idealized analysis, minimizing JSD by a game instead of by an explicit formula — which matters because JSD itself is intractable to compute on real high-dimensional data. The honest caveats. The clean theory assumes \(D\) is trained to optimality at every step and that both networks have unlimited capacity. Neither holds. In practice GANs are infamous for training instability and mode collapse (the generator parks all its mass on a few outputs that reliably fool the current \(D\)). JSD also saturates — its gradient vanishes when the distributions barely overlap — which motivated Wasserstein GANs (Arjovsky et al., 2017), replacing JSD with an Earth-Mover distance whose gradient stays informative. The minimax framing is the right mental model; the optimization is genuinely hard, and as of 2026 diffusion and autoregressive models have largely displaced GANs for frontier image and audio synthesis, even as the adversarial idea persists everywhere from super-resolution to robustness training. At the GAN's Nash equilibrium the generator matches the data, so \(p_g(x) = p_{\text{data}}(x)\) for every \(x\). Plug this into EQ G3.2: what value does the optimal discriminator \(D^{*}(x)\) output everywhere? \(D^{*}(x) = \dfrac{p_{\text{data}}}{p_{\text{data}} + p_g} = \dfrac{p_{\text{data}}}{2\,p_{\text{data}}} = \dfrac{1}{2} = \) 0.5. The perfect critic is reduced to a coin flip — it can no longer tell real from fake. Using EQ G3.3, what is the value \(C(G)\) at the global optimum, where \(\mathrm{JSD} = 0\)? (Give the natural-log value of \(-\log 4\), to three decimals.) \(C(G) = -\log 4 + 2\cdot 0 = -\log 4 = -1.38629\ldots \approx \) −1.386. This is the floor of the game; a generator that has matched the data can drive the value no lower. A GAN is a zero-sum (minimax) game between the generator and the discriminator over a single value function \(V(D,G)\). True or false? (Answer true or false.) EQ G3.1 is literally \(\min_G \max_D V(D,G)\): the discriminator maximizes the same quantity the generator minimizes, so it is a two-player minimax (zero-sum) game. The answer is true. PYTHON · RUNNABLE IN-BROWSER # GAN minimax value on a toy: two distributions over 5 discrete bins. # Optimal D is closed-form (EQ G3.2); the game value collapses to JSD (EQ G3.3). import numpy as np p_data = np.array([0.05, 0.15, 0.40, 0.25, 0.15]) # the real distribution def value(p_g): p_g = np.asarray(p_g, float); p_g /= p_g.sum() D = p_data / (p_data + p_g) # EQ G3.2, optimal critic V = (p_data * np.log(D) + p_g * np.log(1 - D)).sum() # EQ G3.1 at D* m = 0.5 * (p_data + p_g) # JSD, base-e jsd = 0.5*(p_data*np.log(p_data/m)).sum() + 0.5*(p_g*np.log(p_g/m)).sum() return V, jsd, D for name, pg in [("bad ", [0.40,0.30,0.10,0.10,0.10]), ("closer", [0.10,0.20,0.30,0.25,0.15]), ("matched", p_data.copy())]: V, jsd, D = value(pg) print(f"{name}: value V={V:+.4f} JSD={jsd:.4f} check(-log4+2*JSD)={-np.log(4)+2*jsd:+.4f}") print(f"\nfloor of the game: -log 4 = {-np.log(4):+.4f} (reached only when p_g == p_data)") print("at the match, every D* entry equals 0.5:", np.round(value(p_data.copy())[2], 3)) RUN ▶ edits are live — break it on purpose INSTRUMENT G3.1 — GAN PAYOFF VISUALIZER EQ G3.1–G3.3 · ONE BIN, CLOSED-FORM D* GENERATOR MASS p_g (vs fixed p_data = 0.50) 0.20 OPTIMAL D*(x) — GAME VALUE C(G) — JSD(p_data ‖ p_g) — A single point with \(p_{\text{data}} = 0.5\); slide the generator's mass \(p_g\). The curve is the game value \(C(G)\) over all \(p_g\); the dot is where you are. It bottoms out at \(p_g = 0.5\) where \(D^{*} = 0.5\), JSD \(= 0\) and \(C(G) = -\log 4\). Push \(p_g\) to either extreme and the discriminator wins decisively — exactly the regime where \(G\)'s gradient (the slope of the curve) goes flat and learning stalls. 3.3 Self-play — AlphaZero & beyond The cleanest game-as-curriculum is an agent playing against itself. There is no human data, no teacher, no fixed opponent: the agent's current policy is both the player and the environment it must beat. Because the opponent is a copy of you, the difficulty tracks your skill automatically — a perfectly calibrated curriculum that needs no designer. AlphaGo Zero (Silver et al., 2017) made this concrete for Go and then chess and shogi (AlphaZero). A single network \(f_\theta(s) = (\boldsymbol{p}, v)\) outputs a move-probability vector \(\boldsymbol{p}\) and a scalar value \(v \in [-1, 1]\) estimating who wins from state \(s\). Monte-Carlo Tree Search (MCTS) uses the network to look ahead, producing improved move counts \(\boldsymbol{\pi}\); the game is then played to a result \(z \in \{-1, +1\}\). Training pulls the network toward its own searched-and-played behavior: EQ G3.4 — ALPHAZERO'S SELF-PLAY LOSS $$ \ell(\theta) \;=\; (z - v)^2 \;-\; \boldsymbol{\pi}^{\top} \log \boldsymbol{p} \;+\; c\,\lVert \theta \rVert^2 $$ First term: regress the value head toward the actual game outcome \(z\). Second term: cross-entropy pulling the raw policy \(\boldsymbol{p}\) toward the search-improved policy \(\boldsymbol{\pi}\) — the network distills its own lookahead back into its instincts. Third: weight decay. The data is generated entirely by the current network playing itself; tomorrow's training set is produced by today's model, which is what makes the curriculum self-generating. Each generation is stronger, so each generation's self-play games are harder, so the next network must improve to keep winning — a ratchet. The same ratchet drives AlphaStar (StarCraft II), OpenAI Five (Dota 2), and the policy-improvement loops inside RLHF, where a reward model plays the critic. The mechanism that powers Pluribus (Brown & Sandholm, 2019) — superhuman six-player poker — is self-play too, but in an imperfect-information game, so it computes a blueprint via counterfactual regret minimization and refines it with real-time search; its solution concept is approximate Nash rather than a hard win/loss value. The minimal engine behind self-play improvement is a value bootstrap: a state's value is estimated from the values of the states it leads to, and those estimates pull each other toward consistency. In a two-player zero-sum game the backup is a minimax — you assume the opponent (your own copy) plays its best reply: EQ G3.5 — MINIMAX VALUE BACKUP $$ V(s) \;\leftarrow\; \max_{a}\; \Big[\, r(s,a) \;-\; \gamma\, V\big(s'(s,a)\big) \,\Big] $$ The sign flips on the child value because what is good for you is bad for the opponent who moves next (a "negamax" backup, \(\gamma\) the discount). Iterating this map is a contraction: the values converge to the game's true minimax values regardless of where you start. No labels were ever supplied — the targets are bootstrapped from the agent's own evolving estimates. In self-play, the agent's own games against copies of itself produce the training data, so each generation's opponents are as strong as the current model — i.e. self-play generates its own training curriculum. True or false? (Answer true or false.) There is no external dataset; the agent plays itself, and as it improves, the games it generates get harder, supplying a steadily-harder curriculum with no human in the loop. The answer is true. PYTHON · RUNNABLE IN-BROWSER # Tiny self-play value bootstrap on a toy game. # A 6-node game tree: leaves have true outcomes; internal nodes back up by # negamax (EQ G3.5). We start from a WRONG guess and let it self-correct. import numpy as np # children[node] = list of child indices ([] means a leaf) children = {0:[1,2], 1:[3,4], 2:[4,5], 3:[], 4:[], 5:[]} leaf_val = {3:+1.0, 4:-1.0, 5:+1.0} # zero-sum outcomes from mover's view gamma = 1.0 V = {n: (leaf_val[n] if n in leaf_val else 0.7) for n in children} # bad init print("init:", {k: round(v,3) for k,v in V.items()}) for sweep in range(5): for n in [2,1,0]: # back up internal nodes, leaves to root if children[n]: V[n] = max(-gamma*V[c] for c in children[n]) # negamax backup print(f"sweep {sweep}:", {k: round(v,3) for k,v in V.items()}) best = max(children[0], key=lambda c: -V[c]) print(f"\nroot value V(0) = {V[0]:+.0f}; mover should play toward child {best}.") print("targets were never labeled -- they bootstrapped from the leaves up.") RUN ▶ edits are live — break it on purpose INSTRUMENT G3.2 — SELF-PLAY LADDER SIMULATOR ELO RATCHET · GENERATION-OVER-GENERATION LEARNING RATE (ELO GAINED PER WIN-MARGIN) 32 NOISE (TRAINING VARIANCE) 20 OPPONENT SELF-PLAY (LATEST) FIXED TEACHER (ELO 1500) FINAL ELO — VS FIXED TEACHER — GENERATIONS 40 Forty training generations. In SELF-PLAY the opponent is always the latest version, so each win nudges the rating up and the bar rises with it — open-ended growth, the AlphaZero ratchet. Switch to FIXED TEACHER and watch the curve flatten the moment the agent matches the teacher: a frozen opponent is a ceiling. The dashed line marks the teacher's strength — self-play sails past it without ever being shown a stronger example. 3.4 Multi-agent reinforcement learning Two players is the easy case. Multi-agent reinforcement learning (MARL) drops \(N\) learners into a shared environment, each with its own policy \(\pi_i\) and reward \(r_i\). The hard part is structural: from any single agent's view, the others are part of the environment, and they are changing as they learn. The world is non-stationary — the ground that gradient descent assumes is fixed is, in fact, moving under every step. The right object is the Markov (stochastic) game: state transitions and each agent's reward depend on the joint action \((a_1,\ldots,a_N)\). The solution concept is a Nash equilibrium of policies — no agent can improve by unilaterally changing its own. Cooperation, competition and mixtures all live here, distinguished only by how the reward functions relate: Reward structure Game What agents learn Example Fully aligned cooperative Coordination, role assignment, shared conventions Team play, traffic Fully opposed zero-sum Robust, minimax-optimal strategies Go, poker Mixed general-sum Negotiation, reciprocity, social dilemmas Markets, Diplomacy The workhorse algorithmic idea is centralized training, decentralized execution (CTDE). During training a critic may see everyone's observations and actions — making its target stationary — while each agent's actor learns a policy that runs on its own local view alone. MADDPG (Lowe et al., 2017) is the canonical instance. The key intuition is that one agent's policy-gradient sign depends on what the others do: EQ G3.6 — DECENTRALIZED POLICY GRADIENT (CTDE) $$ \nabla_{\theta_i} J_i \;=\; \mathbb{E}\!\left[\, \nabla_{\theta_i} \log \pi_i(a_i \mid o_i)\; Q_i^{\boldsymbol{\pi}}\!\big(s,\, a_1,\ldots,a_N\big) \,\right] $$ Agent \(i\)'s actor depends only on its local observation \(o_i\), but its centralized critic \(Q_i\) is conditioned on the joint action — so it can attribute outcomes correctly even when the cause was a teammate's move. This is what tames non-stationarity at training time. At deployment the critic is discarded and each policy acts on \(o_i\) alone. The deepest lessons in MARL come from the simplest games. A coordination game can have several equilibria, and which one a population lands on is a matter of risk and history, not just payoff. The textbook case is the Stag Hunt: hunting a stag together pays best but only if your partner also commits; hunting hare is a safe solo payoff. There are two pure Nash equilibria — (stag, stag), which is payoff-dominant, and (hare, hare), which is risk-dominant — and learners frequently converge to the safe-but-worse one. EQ G3.7 — STAG HUNT & RISK DOMINANCE $$ \begin{array}{c|cc} & \text{Stag} & \text{Hare}\\\hline \text{Stag} & (4,4) & (0,3)\\ \text{Hare} & (3,0) & (3,3) \end{array} \qquad \mathbb{E}[\text{Stag}] = 4q,\;\; \mathbb{E}[\text{Hare}] = 3 $$ With partner probability \(q\) of choosing Stag, hunting stag beats hunting hare only when \(4q > 3\), i.e. \(q > 0.75\). So you must believe your partner cooperates more than three-quarters of the time before cooperation is rational. (stag, stag) earns more but demands trust; (hare, hare) is safe. This single threshold — not a payoff comparison — is why coordination is hard, and why learned conventions and communication matter. In the Stag Hunt of EQ G3.7, your partner plays Stag with probability \(q\). Your expected payoff is \(4q\) for Stag and \(3\) for Hare. At what \(q\) are you exactly indifferent (the threshold above which cooperating is rational)? Set \(4q = 3\): \(q = 3/4 = \) 0.75. Below this you should defect to Hare; above it, hunt Stag. Cooperation requires believing your partner cooperates more than 75% of the time. INSTRUMENT G3.3 — MULTI-AGENT COORDINATION TOY STAG HUNT · BEST-RESPONSE DYNAMICS · EQ G3.7 INITIAL FRACTION HUNTING STAG 0.60 EXPLORATION (MUTATION RATE) 0.03 BASIN THRESHOLD q* 0.75 CONVERGES TO — FINAL STAG FRACTION — A population plays Stag Hunt and each round shifts toward the better reply (replicator-style best response). The basin boundary sits at \(q^{*} = 0.75\): start above it and the population climbs to the payoff-dominant all-Stag equilibrium; start below and it collapses to the risk-dominant all-Hare. Set the initial fraction near 0.75 to feel the knife-edge — a tiny change in starting beliefs flips the entire outcome. That sensitivity is the core difficulty of cooperative MARL. Where the field actually is (2026). MARL works well in two-team zero-sum settings (it inherits self-play's stability) and in tightly cooperative ones with CTDE. General-sum, partially-observable, many-agent settings remain hard: equilibria may not be unique or even exist in tractable form, credit assignment across agents is brittle, and emergent behavior is difficult to specify or guarantee. The standout recent result, Meta's CICERO playing Diplomacy, needed to fuse a planning engine with a language model precisely because raw self-play does not by itself produce the negotiation and trust-building a mixed-motive game demands. 3.5 Mechanism design & adversarial robustness Two more places where the game frame is load-bearing — one about designing games, one about defending against them. Mechanism design: the inverse game Ordinary game theory takes the rules as given and predicts behavior. Mechanism design runs the arrow backward: choose the rules so that self-interested play produces the outcome you want. It is the theory behind auctions, voting, and — increasingly — AI training. RLHF is a mechanism: the reward model is an incentive structure designed so that maximizing it yields helpful behavior, and reward hacking is what happens when the mechanism is mis-specified and the agent finds an unintended winning strategy. A central result is incentive compatibility — make truth-telling a dominant strategy — exemplified by the second-price (Vickrey) auction, where bidding your true value is optimal no matter what others do. Adversarial robustness: the game against your inputs A deployed model faces an implicit adversary: an attacker choosing the worst input within a small budget. Training a model to survive this is, again, a minimax game — but now the inner maximizer perturbs the data, not a network: EQ G3.8 — ADVERSARIAL TRAINING (ROBUST OPTIMIZATION) $$ \min_{\theta}\; \mathbb{E}_{(x,y)\sim \mathcal{D}}\Big[\, \max_{\lVert \delta \rVert_p \le \epsilon}\; L\big(f_\theta(x + \delta),\, y\big) \,\Big] $$ The inner \(\max\) (Madry et al., 2018) finds the most damaging perturbation \(\delta\) inside an \(\epsilon\)-ball; the outer \(\min\) hardens \(\theta\) against it. It is a self-generating curriculum of hard examples — the model manufactures its own worst case at every step. The persistent caveat: robustness usually costs clean accuracy, and an \(\ell_p\)-ball is a narrow proxy for the open-ended threats a real deployment faces. The same shape recurs across modern safety work: red-teaming a model is an adversary searching for a prompt that breaks it; constitutional and debate-style training pit models against each other to surface flaws; GAN-style discriminators reappear as learned detectors. The lesson of the chapter, stated once more: whenever you want a system to be robust, train it against an adversary that improves alongside it. A fixed test set is a teacher with a ceiling; a learning opponent is a teacher without one. NEXT You have now seen the through-line of the whole volume: from the rational agents of Chapter 01, to repeated cooperation in Chapter 02, to games as the engine of modern AI here. The minimax skeleton that began as a way to analyze strategic behavior turned out to be the way to create it. Return to the Index to branch into the deep-learning and reinforcement-learning volumes where these games are implemented at scale. 3.R References Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014). Generative Adversarial Networks. NeurIPS — the minimax game of EQ G3.1–G3.3. Silver, D. et al. (2017). Mastering the game of Go without human knowledge. Nature 550 — AlphaGo Zero / AlphaZero self-play (EQ G3.4). Brown, N. & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science 365 — Pluribus, self-play in imperfect information. Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein GAN. ICML — replacing JSD with an Earth-Mover distance for stable gradients. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P. & Mordatch, I. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS — MADDPG and the CTDE gradient of EQ G3.6. Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR — adversarial training as the robust min-max of EQ G3.8. Meta FAIR Diplomacy Team et al. (2022). Human-level play in the game of Diplomacy by combining language models with strategic reasoning (CICERO). Science 378 — mixed-motive multi-agent play with negotiation. ← PREVIOUS 02 Repeated Games NEXT CHAPTER → Index AI // ENCYCLOPEDIA — GAME THEORY · CH 03 FULL CONTENTS ↗
========================================================================
TIME SERIES & ECONOMETRICS
========================================================================
## TIME · Time Series Fundamentals (https://ai-encyclopedia.com/timeseries/01-fundamentals.html)
Time Series Fundamentals — AI Encyclopedia AI // ENCYCLOPEDIA / TIME SERIES / 01 / FUNDAMENTALS INDEX NEXT: 02 ARIMA → TIME SERIES & ECONOMETRICS · CHAPTER 01 / 06 Time Series Fundamentals Most models assume the rows are interchangeable, so shuffling them loses nothing. Attach a clock and that assumption fails: yesterday shapes today, the order is the signal, and ordinary error bars understate uncertainty. A time index breaks the i.i.d. assumption every other model relies on, and stationarity is the weaker condition that replaces it. LEVEL INTRO READING TIME ≈ 24 MIN BUILDS ON STATS 01–03 INSTRUMENTS DECOMPOSER · ACF/PACF · RANDOM WALK IN THIS CHAPTER 1.1 Trend, seasonality & noise 1.2 Stationarity & why it matters 1.3 Autocorrelation — ACF & PACF 1.4 White noise & the random walk 1.5 Differencing & transforms 1.R References 1.1 Trend, seasonality & noise A time series is a sequence of observations indexed by time, \(y_1, y_2, \ldots, y_T\), where the index is not a label but a coordinate: \(y_t\) and \(y_{t+1}\) are neighbours, and that adjacency carries information. The first reflex of the field is to read the series as a sum of structured parts plus what is left over. The classical decomposition is additive: EQ T1.1 — ADDITIVE DECOMPOSITION $$ y_t \;=\; T_t \;+\; S_t \;+\; R_t $$ \(T_t\) is the trend-cycle — the slow drift (a growing user base, a warming climate); \(S_t\) is the seasonal component — a pattern that repeats every \(m\) steps (weekly traffic, yearly retail); \(R_t\) is the remainder — everything the first two cannot explain, ideally structureless noise. When the seasonal swings grow with the level of the series, a multiplicative form \(y_t = T_t \times S_t \times R_t\) fits better — and taking logs turns multiplication back into the additive form above, the first hint that a transform can simplify structure. This split is descriptive, not causal: it is a lens, and choosing additive versus multiplicative, or the seasonal period \(m\), is a modelling decision you make by looking. The remainder \(R_t\) is the part we actually want to be boring. If \(R_t\) still wiggles in a predictable way — if knowing \(R_{t-1}\) helps you guess \(R_t\) — then the decomposition left structure on the table, and the chapters that follow (ARIMA, ETS, GARCH) exist to mop it up. A note on honesty. The classical additive split assumes the trend is smooth and the season has a fixed period and shape. Real series violate both — holidays move, regimes shift, the period itself drifts. Robust modern decompositions (STL, the loess-based method) allow the seasonal shape to evolve and resist outliers; treat any decomposition as a hypothesis to check, not a fact to trust. Under the additive model (EQ T1.1), at a given month the trend is \( T_t = 100 \), the seasonal term is \( S_t = 25 \), and the remainder is \( R_t = -5 \). What is the observed value \( y_t \)? The additive decomposition simply sums the parts: \( y_t = T_t + S_t + R_t = 100 + 25 + (-5) = \) 120. Each component pulls the level up or down; the remainder is the small correction the structured terms missed. INSTRUMENT T1.1 — TIME-SERIES DECOMPOSER COMPOSE T + S + R · EQ T1.1 TREND SLOPE 0.40 SEASONAL AMPLITUDE 8 NOISE LEVEL σ 3.0 OBSERVED RANGE — SEASONAL PERIOD m 12 REMAINDER VARIANCE — Four stacked panels: the observed series on top, then the three components that built it — trend, seasonal, remainder. Push the trend slope negative to watch the whole series tilt down; the seasonal panel never moves, because season is independent of level in the additive model. Crank noise up and the remainder panel fills with hash while the observed series gets ragged — that hash is exactly the \(R_t\) the next chapters try to model. With noise at zero, the observed series is a clean sum of two smooth curves: a perfect, and unrealistic, world. 1.2 Stationarity & why it matters Here is the assumption almost every classical model needs, and the one a clock loves to break. A series is (weakly) stationary if its statistical character does not depend on when you look at it. Concretely, three things must hold for all \(t\) and all lags \(k\): EQ T1.2 — WEAK (COVARIANCE) STATIONARITY $$ \mathbb{E}[y_t] = \mu \;\;(\text{constant}), \qquad \mathrm{Var}(y_t) = \sigma^2 \;\;(\text{constant}), \qquad \mathrm{Cov}(y_t,\, y_{t+k}) = \gamma_k \;\;(\text{depends on } k \text{ only}) $$ The mean is flat, the variance is flat, and the covariance between two points depends only on the gap \(k\) between them, never on their absolute position. A series with a trend fails the first condition; a series whose swings widen over time fails the second; a series with a moving seasonal pattern fails the third. Stationarity is what lets the past stand in for the future — if the rules of the game keep changing, a model fit on history is estimating a target that no longer exists. Why is this the load-bearing assumption? Independent-and-identically-distributed (i.i.d.) data is the comfortable world of the rest of this encyclopedia: each row drawn fresh from one fixed distribution, so a sample average converges to the truth and a single split estimates generalization (the holdout logic of MLOPS · §1.1). A time series is emphatically not i.i.d. — the points are dependent by construction. Stationarity is the weaker substitute: it does not require independence, only that the dependence structure be stable over time. That stability is enough to make estimation and forecasting well-posed. Series Violates Stationary? Fix (§1.5) Linear upward trend constant mean no difference once Variance grows with level constant variance no log / Box–Cox Seasonal sales const. mean & \(\gamma_k\) no seasonal difference White noise — nothing — yes already there Stable AR(1), \(|\phi|<1\) — nothing — yes already there Strict vs weak. The definition above is weak (second-order) stationarity — it constrains only the first two moments. Strict stationarity asks that the entire joint distribution be time-invariant, a much stronger demand. For Gaussian processes the two coincide, which is why the weak form is the working definition in practice. Most of forecasting lives on the assumption that, after some transform, the series is weakly stationary. A company's monthly revenue grows steadily year after year along a clear upward trend. Is that raw revenue series stationary in the sense of EQ T1.2? (Answer yes or no.) A persistent upward trend means \(\mathbb{E}[y_t]\) climbs with \(t\) — the mean is not constant, so the first condition of EQ T1.2 fails. The series is no t stationary; differencing it (§1.5) removes the trend and usually restores stationarity. 1.3 Autocorrelation — ACF & PACF If the points are dependent, the natural question is: how dependent, and at what range? The autocorrelation function (ACF) answers it by correlating the series with a delayed copy of itself. At lag \(k\) it is the covariance \(\gamma_k\) from EQ T1.2, normalized by the variance so it lives in \([-1, +1]\): EQ T1.3 — THE AUTOCORRELATION FUNCTION $$ \rho_k \;=\; \frac{\gamma_k}{\gamma_0} \;=\; \frac{\mathrm{Cov}(y_t,\, y_{t+k})}{\mathrm{Var}(y_t)}, \qquad \hat{\rho}_k = \frac{\sum_{t=1}^{T-k} (y_t - \bar{y})(y_{t+k} - \bar{y})}{\sum_{t=1}^{T} (y_t - \bar{y})^2} $$ \(\rho_0 = 1\) always (a series is perfectly correlated with itself). The plot of \(\hat{\rho}_k\) against \(k\) is a correlogram. Under the null of pure white noise, the estimates scatter inside a band of roughly \(\pm 1.96/\sqrt{T}\) — bars that poke outside it are evidence of real structure. The shape of the ACF is a fingerprint: a slow geometric decay says "autoregressive memory"; a sharp cut-off after a few lags says "moving-average"; a single tall spike at lag \(m\) says "seasonality of period \(m\)". The ACF has a blind spot. If today depends on yesterday, then today also correlates with the day before — not directly, but through yesterday. The ACF cannot tell a direct link from a relayed one. The partial autocorrelation function (PACF) closes that gap: \(\alpha_k\) is the correlation between \(y_t\) and \(y_{t-k}\) after removing the linear effect of all the lags in between. It is the direct dependence at range \(k\), with the relayed paths stripped out. EQ T1.4 — ACF / PACF SIGNATURES $$ \text{AR}(p): \quad \text{ACF decays},\;\; \text{PACF cuts off after lag } p; \qquad \text{MA}(q): \quad \text{ACF cuts off after lag } q,\;\; \text{PACF decays} $$ This duality is the classic Box–Jenkins identification rule, and it is why both plots are read together. An AR(\(p\)) process — each value a weighted sum of its own past — shows a PACF that drops to zero past lag \(p\), because once you condition on the first \(p\) lags there is no direct link left. An MA(\(q\)) process — each value a weighted sum of past shocks — is its mirror image. Chapter 02 turns these fingerprints into fitted models. For the workhorse AR(1) process \(y_t = \phi\, y_{t-1} + \varepsilon_t\), the theory is exact and worth memorizing: the ACF is a clean geometric decay, \(\rho_k = \phi^k\), and the PACF is a single spike of height \(\phi\) at lag 1 and exactly zero everywhere after. That pair — exponential ACF, one-spike PACF — is the textbook AR(1) signature, and it is what the next instrument lets you see. An AR(1) process \( y_t = \phi\,y_{t-1} + \varepsilon_t \) has \( \phi = 0.7 \). Using the AR(1) result \( \rho_k = \phi^{k} \), what is its theoretical autocorrelation at lag \( k = 3 \)? For an AR(1), the ACF decays geometrically: \( \rho_3 = \phi^3 = 0.7^3 = 0.7 \times 0.7 \times 0.7 = \) 0.343. Memory fades by a constant factor \(\phi\) per step — the defining shape of an autoregressive correlogram. PYTHON · RUNNABLE IN-BROWSER # Simulate AR(1), then compute and plot its sample ACF (EQ T1.3). import numpy as np rng = np.random.default_rng(0) phi, T = 0.7, 600 eps = rng.normal(0, 1, T) y = np.zeros(T) for t in range(1, T): # y_t = phi * y_{t-1} + eps_t y[t] = phi * y[t-1] + eps[t] y = y - y.mean() # center so the ACF formula is clean def acf(x, K): # sample autocorrelation up to lag K denom = np.sum(x * x) return np.array([np.sum(x[:len(x)-k] * x[k:]) / denom for k in range(K+1)]) K = 12 r = acf(y, K) band = 1.96 / np.sqrt(T) # +/- white-noise significance band print(" lag sample ACF theory phi^k") for k in range(K+1): flag = " *" if abs(r[k]) > band and k > 0 else "" print(f" {k:3d} {r[k]:8.3f} {phi**k:8.3f}{flag}") print(f"\nwhite-noise band +/-{band:.3f}; bars marked * are real memory.") print("note the sample ACF tracks the geometric phi^k decay of an AR(1).") plot_xy(list(range(K+1)), list(r)) RUN ▶ edits are live — break it on purpose INSTRUMENT T1.2 — ACF / PACF EXPLORER AR & MA SERIES → CORRELOGRAMS · EQ T1.4 PROCESS AR(1) MA(1) WHITE NOISE COEFFICIENT 0.70 RESHUFFLE ▶ PROCESS — SIGNATURE — WHITE-NOISE BAND — Top panel: a simulated realization. Bottom two: its sample ACF and PACF, with the grey \(\pm 1.96/\sqrt{T}\) band — bars inside it are indistinguishable from noise. Pick AR(1) and watch the ACF decay smoothly while the PACF shows one spike and quits (EQ T1.4); flip to MA(1) and the two plots swap roles. WHITE NOISE keeps almost every bar inside the band — the look of a series with no exploitable memory. Drag the coefficient negative to make the correlogram alternate sign, and press RESHUFFLE to feel how much a finite sample wobbles around the theory. 1.4 White noise & the random walk Two reference processes anchor the whole subject — one the picture of "no structure," the other the most important non-stationary series in practice. White noise is the boring ideal: a sequence of uncorrelated, zero-mean, constant-variance shocks. It is stationary by construction and, crucially, unforecastable beyond its mean. EQ T1.5 — WHITE NOISE $$ \varepsilon_t \;\sim\; (0,\, \sigma^2), \qquad \mathbb{E}[\varepsilon_t] = 0, \quad \mathrm{Var}(\varepsilon_t) = \sigma^2, \quad \mathrm{Cov}(\varepsilon_t,\, \varepsilon_{t+k}) = 0 \;\; \text{for } k \neq 0 $$ Every autocorrelation past lag 0 is zero, so its ACF is a single spike at the origin and flat thereafter. White noise is the goal, not the enemy: when the residuals of a fitted model look like white noise, you have extracted all the linear structure the data offered. Tools like the Ljung–Box test formalize "do these residuals look white?" by checking whether a batch of autocorrelations is jointly indistinguishable from zero. Now cumulate that noise. A random walk sets each value equal to the previous one plus a fresh independent shock — it is the running sum of white noise, and it is the canonical model for an unpredictable price, a diffusing particle, or any quantity that wanders without an anchor: EQ T1.6 — RANDOM WALK $$ y_t \;=\; y_{t-1} + \varepsilon_t \;=\; y_0 + \sum_{i=1}^{t} \varepsilon_i, \qquad \mathrm{Var}(y_t) = t\,\sigma^2 $$ It is the AR(1) of §1.3 pushed to its boundary, \(\phi = 1\) — a unit root. That single fact is decisive: the variance \(t\sigma^2\) grows without bound, so the constant-variance condition of EQ T1.2 fails and a random walk is not stationary. There is no fixed mean to revert to; a shock today is never forgotten, it is baked permanently into every future value. This is why "the series looks like it has momentum" is so often just a random walk fooling the eye — and why distinguishing a true trend from a unit root (the Dickey–Fuller test, Chapter 03) is one of the field's defining problems. The contested part, stated plainly. Whether a given real series — GDP, a stock index, an exchange rate — is "trend-stationary" (a deterministic trend plus stationary noise) or "difference-stationary" (a random walk with drift) is genuinely hard to decide from finite data, and decades of econometrics have been spent arguing specific cases. The two imply very different forecasts and very different long-run behaviour. Unit-root tests give evidence, not certainty; honest practice reports the ambiguity rather than hiding it. Is a random walk \( y_t = y_{t-1} + \varepsilon_t \) a stationary process? (Answer yes or no.) From EQ T1.6, \(\mathrm{Var}(y_t) = t\,\sigma^2\) grows without bound as \(t\) increases, violating the constant-variance condition of EQ T1.2 — and there is no fixed mean to revert to. A random walk is no t stationary; its first difference \(y_t - y_{t-1} = \varepsilon_t\) is white noise, which is. INSTRUMENT T1.3 — RANDOM WALK vs STATIONARY AR(1) φ → 1 IS A UNIT ROOT · EQ T1.6 AR COEFFICIENT φ 0.50 NEW SHOCKS ▶ REGIME — THEORETICAL Var(y∞) — STATIONARY? — Five independent paths share one set of shocks but differ only in \(\phi\). Down near \(\phi = 0.5\) every path is a tight, mean-reverting AR(1): pulled back toward zero, finite variance \(\sigma^2/(1-\phi^2)\), the dashed envelope holds them in. Slide \(\phi\) toward 1 and the envelope flares open — at exactly \(\phi = 1\) it becomes a random walk, the paths wander off and never come home, and the readout's variance goes to ∞. That divergence at the unit root is the loss of stationarity, made visible. Press NEW SHOCKS to redraw. 1.5 Differencing & transforms to stationarity So a great many real series are not stationary — and the entire toolkit needs them to be. The fix is a pair of cheap, reversible transforms that attack the two ways stationarity fails: a non-constant mean, and a non-constant variance. The mean problem — trend — is killed by differencing: replace the series with the step-to-step changes. Define the difference operator \(\nabla y_t = y_t - y_{t-1}\). One difference removes a linear trend; a second difference removes a quadratic one. The payoff is exact for the random walk: EQ T1.7 — FIRST DIFFERENCING $$ \nabla y_t \;=\; y_t - y_{t-1}, \qquad \text{random walk} \;\Rightarrow\; \nabla y_t = (y_{t-1} + \varepsilon_t) - y_{t-1} = \varepsilon_t $$ Differencing a random walk returns pure white noise — the non-stationary unit root is annihilated in one step. A series that needs \(d\) differences to become stationary is called integrated of order \(d\), written \(I(d)\); a random walk is \(I(1)\), white noise is \(I(0)\). That little \(d\) is precisely the "I" in ARIMA (Chapter 02). For seasonal trends, the seasonal difference \(\nabla_m y_t = y_t - y_{t-m}\) does the same job at lag \(m\). Caution: over-differencing injects artificial negative autocorrelation and inflates variance — difference only as much as you must. The variance problem — swings that widen as the series grows — is killed by a variance-stabilizing transform. The log is the everyday choice; the Box–Cox family generalizes it with a single tunable power \(\lambda\), smoothly spanning from "no transform" (\(\lambda = 1\)) through "square root" (\(\lambda = 0.5\)) to "log" (\(\lambda \to 0\)): EQ T1.8 — THE BOX–COX TRANSFORM $$ y_t^{(\lambda)} = \begin{cases} \dfrac{y_t^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\[4pt] \ln y_t & \lambda = 0 \end{cases} \qquad (y_t > 0) $$ Choose \(\lambda\) so the spread of the series stops depending on its level. Because \(\ln\) turns a multiplicative seasonal pattern into an additive one (recall §1.1), the log is also what converts a multiplicative decomposition into the friendly additive form. The standard recipe stacks the two: first stabilize the variance with a transform, then stabilize the mean with differencing — variance before mean, because differencing a heteroscedastic series just relocates the problem. Apply the first-difference operator \(\nabla y_t = y_t - y_{t-1}\) to the series \( [\,2,\ 5,\ 9,\ 14\,] \). What is the last value of the differenced series? The differences are \(5-2 = 3\), \(9-5 = 4\), \(14-9 = 5\), giving \([\,3,\ 4,\ 5\,]\). Differencing shortens the series by one (you cannot difference the first point), and the last value is 5. Notice the gaps are themselves rising by 1 each step — a hint this series has quadratic curvature that a second difference would flatten. PYTHON · RUNNABLE IN-BROWSER # Difference a trending series and watch the variance collapse (EQ T1.7). import numpy as np rng = np.random.default_rng(1) T = 400 trend = 0.5 * np.arange(T) # a steady linear climb: non-stationary mean y = trend + np.cumsum(rng.normal(0, 1, T)) # trend + a random-walk wander on top d1 = np.diff(y) # first difference: nabla y_t = y_t - y_{t-1} d2 = np.diff(d1) # second difference def stats(name, x): print(f"{name:18s} mean {x.mean():8.3f} variance {x.var():12.1f}") print("level vs differenced series:") stats("y (level)", y) # huge variance: the trend dominates stats("diff once (d=1)", d1) # variance plummets; mean ~ the slope 0.5 stats("diff twice (d=2)", d2) # flat mean ~0; over-differenced -> var rises again print("\none difference removes the trend (mean -> the slope, variance collapses);") print("a SECOND difference over-does it -- variance climbs back. Difference sparingly.") plot_xy(list(range(len(d1))), list(d1)) # the stationary-looking differenced series RUN ▶ edits are live — break it on purpose NEXT You now have the vocabulary; ARIMA gives it grammar. Once a series is stationary — variance-stabilized, then differenced \(d\) times — its leftover memory is exactly the AR and MA structure the correlograms revealed. Chapter 02 fuses the three letters: the I ntegration order \(d\) from this chapter, the A uto R egression and M oving A verage orders \(p\) and \(q\) read off the ACF and PACF, into the single most-used forecasting model in the world. 1.R References Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the canonical text; the ACF/PACF identification method (§1.3) and integration order \(d\) (§1.5) are its core. Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press — the graduate-level reference for stationarity, unit roots, and the econometric theory behind §1.2 and §1.4. Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts (free online) — the modern practitioner's guide; decomposition (§1.1), STL, and Box–Cox (§1.5). Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the unit-root test that decides random walk vs stationary (§1.4). Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations. J. R. Stat. Soc. B 26(2) — the variance-stabilizing power transform of EQ T1.8. Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the portmanteau test for "are these residuals white noise?" (§1.4). Yule, G. U. (1927). On a Method of Investigating Periodicities in Disturbed Series. Phil. Trans. R. Soc. A 226 — the paper that introduced the autoregressive model (§1.3). ← PREVIOUS ↖ Index NEXT CHAPTER 02 ARIMA AI // ENCYCLOPEDIA — TIME SERIES & ECONOMETRICS · CH 01 FULL CONTENTS ↗
## TIME · AR, MA, ARIMA & SARIMA (https://ai-encyclopedia.com/timeseries/02-arima.html)
AR, MA, ARIMA & SARIMA — AI Encyclopedia AI // ENCYCLOPEDIA / TIME SERIES / 02 / ARIMA INDEX NEXT: 03 EXPONENTIAL SMOOTHING → TIME SERIES & ECONOMETRICS · CHAPTER 02 / 06 AR, MA, ARIMA & SARIMA Before neural networks reached forecasting, Box and Jenkins reduced it to a procedure that could be taught. The recipe has three steps: difference the series until it is stationary, read the ACF and PACF to choose orders, then fit autoregressive and moving-average terms. Now a one-line call in every statistics library, it remains the baseline that more elaborate models are measured against. LEVEL CORE READING TIME ≈ 28 MIN BUILDS ON TIME SERIES 01 · STATS 04 INSTRUMENTS ARIMA LAB · AR ROOTS · BOX-JENKINS IN THIS CHAPTER 2.1 Autoregressive (AR) 2.2 Moving-average (MA) 2.3 ARMA & ARIMA 2.4 Seasonal ARIMA 2.5 Box-Jenkins 2.R References 2.1 Autoregressive (AR) models The simplest honest forecast is "tomorrow looks like today, plus a nudge." An autoregressive model formalizes that intuition: regress the series on its own past. An AR of order \(p\) — written AR(\(p\)) — predicts the current value as a weighted sum of the previous \(p\) values plus a fresh shock: EQ T2.1 — AUTOREGRESSIVE MODEL AR(p) $$ y_t \;=\; c \;+\; \phi_1 y_{t-1} \;+\; \phi_2 y_{t-2} \;+\; \cdots \;+\; \phi_p y_{t-p} \;+\; \varepsilon_t, \qquad \varepsilon_t \sim \mathrm{WN}(0,\sigma^2) $$ \(\phi_1,\dots,\phi_p\) are the AR coefficients, \(c\) a constant tied to the long-run mean (\(\mu = c/(1-\sum\phi_i)\)), and \(\varepsilon_t\) is white noise — zero-mean, constant-variance, serially uncorrelated. The model has memory: a shock at time \(t\) propagates forward through the \(\phi\)'s, decaying geometrically. AR(1) is the workhorse — \(\phi\) is literally the one-step persistence of the series. Persistence is the whole story for AR(1), \(y_t = c + \phi y_{t-1} + \varepsilon_t\). A \(\phi\) near \(+1\) means shocks linger (a slow, trending-looking series); a \(\phi\) near \(0\) means the series snaps back to its mean almost immediately (near white noise); a negative \(\phi\) makes it oscillate, flipping sign each step. The catch is stationarity: the process only has a stable mean and variance if its shocks don't compound forever. For AR(1) that means \(|\phi| < 1\); for higher orders the condition lives in the roots of the characteristic polynomial. EQ T2.2 — STATIONARITY: ROOTS OUTSIDE THE UNIT CIRCLE $$ \Phi(z) \;=\; 1 - \phi_1 z - \phi_2 z^2 - \cdots - \phi_p z^p \;=\; 0 \quad\Longrightarrow\quad |z_i| > 1 \;\; \forall i $$ Write the model with the lag operator \(L\) (where \(L\,y_t = y_{t-1}\)) as \(\Phi(L)\,y_t = c + \varepsilon_t\). The process is stationary iff every root of \(\Phi(z)\) lies strictly outside the unit circle — equivalently, every reciprocal root lies inside it. For AR(1), \(1-\phi z = 0\) gives \(z = 1/\phi\), so \(|z|>1\) is exactly \(|\phi|<1\). A root on the circle (\(|z|=1\)) is a unit root — the boundary case of a random walk, which §2.3 differences away. WORKED EXAMPLE ▾ 01 Take AR(2) with \(\phi_1 = 0.5,\ \phi_2 = 0.3\). The characteristic polynomial is \(\Phi(z) = 1 - 0.5z - 0.3z^2\). 02 Solve \(0.3z^2 + 0.5z - 1 = 0\): \(z = \dfrac{-0.5 \pm \sqrt{0.25 + 1.2}}{0.6} = \dfrac{-0.5 \pm 1.204}{0.6}\), giving \(z_1 = 1.174,\ z_2 = -2.840\). 03 Both \(|z_i| > 1\), so the process is stationary. Equivalently \(\phi_1 + \phi_2 = 0.8 < 1\), \(\phi_2 - \phi_1 = -0.2 < 1\), and \(|\phi_2| < 1\) — the three sides of the AR(2) stationarity triangle. RESULT: roots 1.17 and −2.84 — both outside the unit circle → stationary An AR(1) process with no constant is \( y_t = 0.5\,y_{t-1} + \varepsilon_t \). The last observed value is \( y_{t-1} = 10 \). What is the one-step forecast \( \hat{y}_t = 0.5\,y_{t-1} \) (the expected next value, since \( \mathbb{E}[\varepsilon_t] = 0 \))? The minimum-mean-squared-error forecast sets the unknown shock to its mean, \( \mathbb{E}[\varepsilon_t] = 0 \), so \( \hat{y}_t = 0.5 \times 10 = \) 5. With \( \phi = 0.5 \) the series gives back half its current deviation each step — fast mean reversion. How do you estimate the \(\phi\)'s from data? The classical route is the Yule-Walker equations, which connect the AR coefficients to the series' autocorrelations. They say: the autocorrelation at lag \(k\) equals the same weighted combination of nearby autocorrelations that the model imposes on the values themselves. EQ T2.3 — YULE-WALKER EQUATIONS $$ \rho_k \;=\; \phi_1 \rho_{k-1} + \phi_2 \rho_{k-2} + \cdots + \phi_p \rho_{k-p}, \quad k = 1,\dots,p \qquad\Longleftrightarrow\qquad R\,\boldsymbol{\phi} = \mathbf{r} $$ \(\rho_k\) is the autocorrelation at lag \(k\); \(R\) is the \(p\times p\) Toeplitz matrix of autocorrelations \(\rho_{|i-j|}\) and \(\mathbf{r} = (\rho_1,\dots,\rho_p)\). Estimate the \(\rho_k\) from the data, plug them in, and solve the linear system \(\boldsymbol{\phi} = R^{-1}\mathbf{r}\) — a closed-form fit with no iteration. For AR(2) this unpacks to \(\phi_1 = \dfrac{\rho_1(1-\rho_2)}{1-\rho_1^2}\), \(\phi_2 = \dfrac{\rho_2 - \rho_1^2}{1-\rho_1^2}\). Maximum likelihood is usually preferred in production, but Yule-Walker is the transparent estimator that shows where the numbers come from. PYTHON · RUNNABLE IN-BROWSER # Fit an AR(2) by Yule-Walker (EQ T2.3) in pure numpy. No statsmodels needed. import numpy as np rng = np.random.default_rng(0) phi1, phi2 = 0.5, 0.3 # the TRUE coefficients we will recover n = 600 e = rng.normal(0, 1, n) y = np.zeros(n) for t in range(2, n): # simulate the AR(2) process (EQ T2.1) y[t] = phi1 * y[t-1] + phi2 * y[t-2] + e[t] y = y - y.mean() # center: Yule-Walker works on the mean-removed series def acf(x, k): # sample autocorrelation at lag k return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x) r1, r2 = acf(y, 1), acf(y, 2) R = np.array([[1.0, r1], [r1, 1.0]]) # Toeplitz matrix of autocorrelations r = np.array([r1, r2]) phi_hat = np.linalg.solve(R, r) # phi = R^{-1} r (EQ T2.3) print(f"sample autocorrelations: rho1={r1:.3f} rho2={r2:.3f}") print(f"Yule-Walker estimate: phi1={phi_hat[0]:.3f} phi2={phi_hat[1]:.3f}") print(f"true coefficients: phi1={phi1} phi2={phi2}") print(f"sum phi (persistence): {phi_hat.sum():.3f} ( stationary)") RUN ▶ edits are live — break it on purpose INSTRUMENT T2.1 — AR-COEFFICIENT STABILITY & ROOTS AR(2) STATIONARITY TRIANGLE · EQ T2.2 φ₁ 0.50 φ₂ 0.30 STATUS — ROOTS |z| — φ₁ + φ₂ — The mint triangle is the AR(2) stationarity region — its three sides are \(\phi_1+\phi_2<1\), \(\phi_2-\phi_1<1\) and \(\phi_2>-1\). Drag the coefficients and watch the marker: inside the triangle the roots of \(\Phi(z)\) sit outside the unit circle and the process is stable; cross a side and a root crosses the circle, the variance blows up, and the forecast diverges. The lower parabola \(\phi_2 = -\phi_1^2/4\) splits real roots (above) from the complex-root region (below), where the series oscillates with a pseudo-period rather than decaying monotonically. 2.2 Moving-average (MA) models An AR model remembers past values. A moving-average model remembers past shocks. MA(\(q\)) writes the current value as the current white-noise shock plus a weighted sum of the last \(q\) shocks: EQ T2.4 — MOVING-AVERAGE MODEL MA(q) $$ y_t \;=\; \mu \;+\; \varepsilon_t \;+\; \theta_1 \varepsilon_{t-1} \;+\; \theta_2 \varepsilon_{t-2} \;+\; \cdots \;+\; \theta_q \varepsilon_{t-q} $$ \(\mu\) is the mean, \(\theta_1,\dots,\theta_q\) the MA coefficients, and the \(\varepsilon\)'s are unobserved white-noise shocks. The defining property: an MA(\(q\)) has finite memory — a shock affects only the next \(q\) observations and then vanishes completely. So an MA process is always stationary (it is a finite sum of stationary terms), and its autocorrelation function cuts off sharply after lag \(q\). That clean cutoff is exactly the fingerprint §2.5 uses to identify \(q\). The two model families are mirror images, and the mirror is precise. An invertible MA(\(q\)) — one whose MA polynomial \(\Theta(z) = 1 + \theta_1 z + \cdots + \theta_q z^q\) also has all roots outside the unit circle — can be rewritten as an infinite-order AR, and a stationary AR(\(p\)) can be rewritten as an infinite-order MA. The duality has a sharp diagnostic consequence: Model ACF (autocorrelation) PACF (partial autocorr.) AR(p) tails off (decays geometrically / sinusoidally) cuts off after lag p MA(q) cuts off after lag q tails off (decays geometrically) ARMA(p,q) tails off after lag q tails off after lag p This table is the heart of classical model identification. The ACF measures correlation between \(y_t\) and \(y_{t-k}\); the partial ACF measures the same after removing the influence of the intervening lags. A sharp ACF cutoff at lag \(q\) with a slowly tailing PACF screams MA(\(q\)); the reverse screams AR(\(p\)). When both tail off, you are in mixed ARMA territory — and these "cutoffs" are statistical, blurred by sampling noise, so read them as strong hints, not certainties. The sample autocorrelation function of a series is large at lags 1 and 2, then drops to essentially zero (inside the noise band) from lag 3 onward, while the PACF tails off slowly. This is the fingerprint of an MA(\(q\)) model. What is \( q \)? An MA(\(q\)) process has autocorrelations that cut off — are exactly zero — for all lags greater than \(q\) (EQ T2.4). The last significant spike is at lag 2, so the memory is two shocks deep: \( q = \) 2. PYTHON · RUNNABLE IN-BROWSER # MA(2) fingerprint: its ACF cuts off after lag 2, the PACF tails off (the table). import numpy as np rng = np.random.default_rng(1) theta1, theta2 = 0.7, 0.4 n = 4000 e = rng.normal(0, 1, n + 2) y = e[2:] + theta1 * e[1:-1] + theta2 * e[:-2] # MA(2) per EQ T2.4 y = y - y.mean() def acf(x, k): return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x) # theoretical MA(2) autocorrelations for comparison denom = 1 + theta1**2 + theta2**2 th = [1.0, (theta1 + theta1*theta2)/denom, theta2/denom, 0.0, 0.0] print(" lag sample ACF theory ACF") for k in range(5): print(f" {k:2d} {acf(y,k):+.3f} {th[k]:+.3f}") print("\nACF is large at lags 1-2 then ~0 -> the MA(2) cutoff. q reads straight off.") plot_xy(list(range(8)), [acf(y, k) for k in range(8)]) RUN ▶ edits are live — break it on purpose 2.3 ARMA & ARIMA Combine the two memories and you get ARMA(\(p,q\)): the value depends on its own past and on past shocks. In lag-operator form the symmetry is plain — an AR polynomial acting on the values equals an MA polynomial acting on the noise: EQ T2.5 — ARMA(p,q) $$ \underbrace{\Big(1 - \phi_1 L - \cdots - \phi_p L^p\Big)}_{\Phi(L)}\, y_t \;=\; c + \underbrace{\Big(1 + \theta_1 L + \cdots + \theta_q L^q\Big)}_{\Theta(L)}\, \varepsilon_t $$ \(L\) is the lag operator (\(L^k y_t = y_{t-k}\)). ARMA needs the series to be stationary already: \(\Phi(z)\) must have its roots outside the unit circle (EQ T2.2). Most real series — prices, sales, GDP — are not stationary; they trend or wander. The fix is the "I" in ARIMA. The I stands for integrated: an ARIMA(\(p,d,q\)) is an ARMA(\(p,q\)) fitted to the series after taking \(d\) differences. Differencing is the operator \(\Delta y_t = y_t - y_{t-1} = (1-L)y_t\); apply it \(d\) times to strip out trend. A series that becomes stationary after \(d\) differences is "integrated of order \(d\)", written \(I(d)\). Most economic series are \(I(1)\): one difference — turning levels into changes — is enough. EQ T2.6 — ARIMA(p,d,q) $$ \Phi(L)\,\underbrace{(1-L)^d\, y_t}_{\text{differenced }d\text{ times}} \;=\; c + \Theta(L)\,\varepsilon_t $$ Three integers fully specify the model: \(p\) AR terms, \(d\) differences, \(q\) MA terms. The familiar special cases all fall out of this one equation: ARIMA(\(p,0,0\)) is plain AR(\(p\)); ARIMA(\(0,0,q\)) is MA(\(q\)); ARIMA(\(0,1,0\)) is \(\Delta y_t = \varepsilon_t\), the random walk; ARIMA(\(0,1,0\)) with a constant is a random walk with drift. Over-differencing is a real hazard — it injects spurious negative autocorrelation and inflates the variance — so difference the minimum needed, checked with a unit-root test (ADF / KPSS), not reflexively. WORKED EXAMPLE ▾ 01 Set \(p=0,\ d=1,\ q=0\). EQ T2.6 becomes \((1)(1-L)\,y_t = \varepsilon_t\), i.e. \(y_t - y_{t-1} = \varepsilon_t\). 02 Rearrange: \(y_t = y_{t-1} + \varepsilon_t\). Each value is the previous value plus an unpredictable shock — the definition of a random walk. 03 Its best forecast is therefore the last observation, \(\hat{y}_{t+1} = y_t\) ("naïve forecast"), and forecast uncertainty grows like \(\sqrt{h}\) with the horizon \(h\) — the variance accumulates because nothing pulls it back. RESULT: ARIMA(0,1,0) ≡ random walk, forecast = last value An ARIMA(0,1,0) model — zero AR terms, one difference, zero MA terms — is exactly a random walk, \( y_t = y_{t-1} + \varepsilon_t \). True or false? (Answer true or false.) With \(p=q=0\) and \(d=1\), EQ T2.6 reduces to \((1-L)y_t = \varepsilon_t\), i.e. \(y_t - y_{t-1} = \varepsilon_t\), which rearranges to \(y_t = y_{t-1} + \varepsilon_t\) — the textbook random walk. So the statement is true. (Add a constant and it becomes a random walk with drift.) PYTHON · RUNNABLE IN-BROWSER # One-step ARIMA(1,1,1) forecast BY HAND on a tiny series (EQ T2.6). import numpy as np y = np.array([10., 12., 11., 14., 16., 15.]) # the observed levels phi, theta = 0.6, 0.4 # ARMA(1,1) on the differences w = np.diff(y) # d=1: work on changes w_t = y_t - y_{t-1} print("differences w:", w) # Recover the unobserved shocks e_t by the ARMA(1,1) recursion (start e_0 = 0): # w_t = phi*w_{t-1} + e_t + theta*e_{t-1} => e_t = w_t - phi*w_{t-1} - theta*e_{t-1} e = np.zeros(len(w)) for t in range(1, len(w)): e[t] = w[t] - phi * w[t-1] - theta * e[t-1] print("residuals e:", np.round(e, 3)) w_next = phi * w[-1] + theta * e[-1] # forecast the NEXT difference (E[e_next]=0) y_next = y[-1] + w_next # integrate back: undo the differencing print(f"\nforecast next change w_hat = {w_next:.3f}") print(f"forecast next level y_hat = y[-1] + w_hat = {y[-1]:.1f} + ({w_next:.3f})" f" = {y_next:.3f}") RUN ▶ edits are live — break it on purpose INSTRUMENT T2.2 — ARIMA(p,d,q) PLAYGROUND SIMULATE · DIFFERENCE · FORECAST · EQ T2.6 AR order p 1 DIFFERENCING d 1 MA order q 1 AR strength φ 0.60 MA strength θ 0.40 MODEL — FORECAST DRIFT / STEP — SHAPE — A fixed white-noise driver builds an ARIMA(\(p,d,q\)) series ( grey), then the model forecasts the next 24 steps ( mint) with its \(\sqrt{h}\)-growing uncertainty band. Set \(d=0\) and the forecast reverts to the mean; set \(d=1\) and it persists from the last value (the random-walk limit at \(p=q=0\)); set \(d=2\) and a trend extrapolates. Push \(\phi\) toward \(\pm0.9\) to feel persistence vs oscillation, and watch how a higher \(d\) widens the forecast cone — uncertainty that compounds is the price of differencing away a trend. 2.4 Seasonal ARIMA (SARIMA) Monthly sales peak every December; electricity demand cycles every 24 hours; retail repeats every 7 days. A plain ARIMA can chase a trend but it has no machinery for a repeating season. SARIMA bolts a second, seasonal ARIMA onto the first — same AR/I/MA logic, but operating at the seasonal lag \(s\) (12 for monthly-with-yearly-cycle, 7 for daily-with-weekly-cycle): EQ T2.7 — SARIMA(p,d,q)(P,D,Q)ₛ $$ \Phi_p(L)\,\Phi_P(L^s)\,(1-L)^d\,(1-L^s)^D\, y_t \;=\; c + \Theta_q(L)\,\Theta_Q(L^s)\,\varepsilon_t $$ The lowercase \((p,d,q)\) handle the short-range dynamics; the uppercase \((P,D,Q)_s\) handle the seasonal dynamics at multiples of the period \(s\). \((1-L^s)^D\) is seasonal differencing — subtract the value one full season ago (\(y_t - y_{t-s}\)) to remove a stable seasonal pattern, just as \((1-L)^d\) removes trend. The seasonal polynomials \(\Phi_P(L^s),\,\Theta_Q(L^s)\) act only at lags \(s, 2s, \dots\). The classic "airline model", SARIMA(0,1,1)(0,1,1)₁₂, fits a startling range of monthly business series with just two parameters — it is the seasonal baseline to beat. The two layers multiply rather than add, which is what lets one shock leave both a short-range footprint (the next few periods) and a seasonal echo (the same period next year). In practice you almost never need \(D > 1\): one seasonal difference plus one ordinary difference removes both a trend and a yearly cycle, and stacking more differences over-differences just as fast as in the non-seasonal case. The cost of the extra flexibility is parameters — six orders to choose instead of three — which is exactly why automated order selection (§2.5) became indispensable for SARIMA. You fit a SARIMA model to monthly data with a yearly cycle and apply one seasonal difference, \( y_t - y_{t-s} \). What is the seasonal lag \( s \) — i.e. how many steps back does the seasonal difference reach? Monthly data with a yearly cycle repeats every 12 observations, so the seasonal period is \( s = \) 12: the seasonal difference subtracts the value from the same month one year earlier, \( y_t - y_{t-12} \). (Daily-with-weekly would be \( s = 7 \); hourly-with-daily, \( s = 24 \).) SARIMA's honest limits. It assumes a single, fixed seasonal period with constant amplitude. Multiple overlapping seasonalities (a daily series with both weekly and yearly cycles), seasonality that grows with the level, or non-integer periods all break it — and that is where TBATS, Fourier-term regressors with ARIMA errors, Prophet, and modern ML forecasters earn their place. SARIMA remains the right first tool for one clean season, and a strong baseline even when it is not the final one. 2.5 The Box-Jenkins methodology Box and Jenkins did not just propose models — they proposed a procedure, an iterative loop that turns a raw series into a fitted forecast. It is the recipe the whole chapter has been building toward, and it has three stages that you cycle until the residuals are clean: EQ T2.8 — THE BOX-JENKINS LOOP $$ \textbf{Identify} \;\longrightarrow\; \textbf{Estimate} \;\longrightarrow\; \textbf{Diagnose} \;\;\xrightarrow{\text{residuals not white?}}\;\; \textbf{back to Identify} $$ Identify — make the series stationary (difference / log-transform; confirm with ADF or KPSS), then read the ACF/PACF (and seasonal lags) to propose orders \((p,d,q)(P,D,Q)_s\). Estimate — fit the coefficients by maximum likelihood. Diagnose — check that the residuals are indistinguishable from white noise (Ljung-Box test, residual ACF); if not, the model missed structure, so revise the orders and loop. The terminal condition is white-noise residuals: when nothing predictable is left in the errors, you have extracted all the linear signal. Stage one — identification — is where the AR/MA fingerprint table from §2.2 does its work. Pick orders by competing candidates on an information criterion that rewards fit and penalizes complexity, rather than by eyeballing the ACF alone: EQ T2.9 — AKAIKE INFORMATION CRITERION (AIC) $$ \mathrm{AIC} \;=\; -2\,\ln \hat{L} \;+\; 2k, \qquad k = p + q + P + Q + (\text{constant}) $$ \(\hat{L}\) is the maximized likelihood and \(k\) the number of estimated parameters. The first term rewards goodness of fit; \(+2k\) is the complexity penalty that stops you from adding terms that only chase noise. Lower AIC wins. This is the engine inside auto.arima: it searches over candidate \((p,d,q)(P,D,Q)_s\) orders and keeps the lowest-AIC model. AICc adds a small-sample correction; BIC penalizes complexity harder (\(\ln(n)\,k\)) and prefers sparser models. None of them replaces the white-noise residual check — a low AIC with autocorrelated residuals is still a failed model. The genuinely contested part is whether to trust automation. auto.arima (and Python's pmdarima) made order selection a one-liner, and for well-behaved series it usually finds a sensible model. But it optimizes in-sample fit, can be fooled by outliers and structural breaks, and will happily return a model whose residuals still carry seasonality it failed to difference away. The defensible practice in 2026 is the same as in 1976: let the search propose, then diagnose — plot the residual ACF, run Ljung-Box, and back-test on held-out horizons before shipping. PITFALLS Four ways Box-Jenkins goes wrong: (1) over-differencing — differencing a series that was already stationary injects negative autocorrelation and inflates variance; difference the minimum and confirm with ADF/KPSS. (2) reading ACF/PACF too literally — sampling noise blurs the cutoffs, so a "spike at lag 7" may be chance, not weekly seasonality. (3) trusting AIC over residuals — the lowest-AIC model can still have autocorrelated errors; the residual diagnostics are the real gate. (4) forecasting far past the data's regime — ARIMA extrapolates its fitted linear dynamics and is blind to structural breaks, so long-horizon intervals are optimistic. INSTRUMENT T2.3 — BOX-JENKINS IDENTIFICATION GUIDE DECISION TREE FROM ACF / PACF · EQ T2.8 OBSERVED ACF / PACF PATTERN TREND / SLOW-DECAY ACF PACF CUTS OFF ACF CUTS OFF BOTH TAIL OFF SPIKE AT LAG s DIAGNOSIS — SUGGESTED MODEL — — Pick the pattern you see in a stationary series' correlograms and the tree maps it to a model order, drawing a stylized ACF ( mint) and PACF ( blue) so you can match the shape. This is the §2.2 fingerprint table made operational: PACF cutoff → AR, ACF cutoff → MA, both tail off → ARMA, slow ACF decay → difference first, spike at a seasonal lag → add a seasonal term. It is a teaching guide, not a substitute for fitting and diagnosing — real correlograms are noisier than these idealized stems. NEXT ARIMA fits the conditional mean by least squares on lagged values; exponential smoothing weights the past geometrically instead. The two families overlap more than their notation suggests — simple exponential smoothing is an ARIMA(0,1,1) in disguise. Chapter 03: the exponential-smoothing family, from simple to Holt-Winters, the state-space (ETS) formulation that gives it likelihoods and intervals, and when its decaying-memory view beats ARIMA's algebra. 2.R References Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the foundational text that defined the AR/MA/ARIMA/SARIMA framework and the identify-estimate-diagnose loop (EQ T2.8). Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts — the modern, free standard reference; its ARIMA chapter and auto.arima sit behind §2.3–§2.5. Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6) — the Akaike Information Criterion behind automated order selection (EQ T2.9). Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the Ljung-Box portmanteau test used in the diagnostic stage to check for white-noise residuals. Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the Augmented Dickey-Fuller unit-root test that decides how many differences \(d\) a series needs. Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software 27(3) — the algorithm behind auto.arima and its AIC-driven stepwise order search (§2.5). ← PREVIOUS 01 Fundamentals NEXT CHAPTER 03 Exponential Smoothing AI // ENCYCLOPEDIA — TIME SERIES & ECONOMETRICS · CH 02 FULL CONTENTS ↗
## TIME · Exponential Smoothing & Holt-Winters (https://ai-encyclopedia.com/timeseries/03-exponential-smoothing.html)
Exponential Smoothing & Holt-Winters — AI Encyclopedia AI // ENCYCLOPEDIA / TIME SERIES / 03 / SMOOTHING INDEX NEXT: 04 VOLATILITY & GARCH → TIME SERIES & ECONOMETRICS · CHAPTER 03 / 06 Exponential Smoothing & Holt-Winters Where ARIMA works through correlations of past errors, exponential smoothing makes a simpler assumption and performs well on it. It weights the recent past more heavily than the distant past, a single idea that still places near the top of forecasting competitions. Three short recurrences for level, trend, and season turn that idea into a method that runs in one pass over the data and forecasts millions of series a day in retail, supply-chain, and energy systems. LEVEL CORE READING TIME ≈ 24 MIN BUILDS ON TIME SERIES 01–02 INSTRUMENTS SES WEIGHTS · HOLT-WINTERS · α OPTIMIZER IN THIS CHAPTER 3.1 Simple exponential smoothing 3.2 Holt's linear trend 3.3 Holt-Winters seasonal 3.4 The ETS state-space framework 3.5 Choosing the parameters 3.R References 3.1 Simple exponential smoothing Start with a series that has no trend and no season — just a level that wanders, buried in noise. A naïve forecast uses only the last value; a long moving average uses many values but weights them all equally, which is plainly wrong: a reading from a year ago should not count as much as yesterday's. Simple exponential smoothing (SES) resolves the tension with one parameter. Maintain a running estimate of the level \(\ell_t\) and, at every new observation, nudge it toward the latest value by a fraction \(\alpha\): EQ T3.1 — THE SMOOTHING RECURRENCE $$ \ell_t \;=\; \alpha\, y_t + (1-\alpha)\,\ell_{t-1}, \qquad 0 < \alpha < 1, \qquad \hat{y}_{t+1\mid t} = \ell_t $$ \(\ell_t\) is the smoothed level after seeing \(y_t\); the one-step-ahead forecast is simply that level, and so is the forecast for every horizon (a flat line — SES has no trend). \(\alpha\) is the learning rate: \(\alpha \to 1\) recovers the naïve "repeat the last value" forecast; \(\alpha \to 0\) freezes the level at its initial estimate, a long-run average. The whole method is this single line, applied once per observation — \(O(n)\) time, \(O(1)\) memory. WORKED EXAMPLE ▾ 01 Take \(\alpha = 0.3\) and a current level \(\ell_{t-1} = 10\). A new observation arrives: \(y_t = 20\). 02 Apply EQ T3.1: \(\ell_t = 0.3 \times 20 + 0.7 \times 10 = 6 + 7\). 03 So \(\ell_t = 13\). The level moved 3 of the way from 10 toward the new reading of 20 — exactly \(\alpha = 30\%\) of the \(10\)-point gap. RESULT: updated level \(\ell_t = 13\); next forecast \(\hat{y}_{t+1} = 13\) The error-correction form makes the "learning rate" reading explicit. Rearranging EQ T3.1 around the one-step forecast error \(e_t = y_t - \ell_{t-1}\): EQ T3.2 — ERROR-CORRECTION FORM $$ \ell_t \;=\; \ell_{t-1} + \alpha\,(y_t - \ell_{t-1}) \;=\; \ell_{t-1} + \alpha\, e_t $$ Read it as gradient descent on squared error with step size \(\alpha\): each forecast miss \(e_t\) pulls the level a fraction \(\alpha\) of the way toward correcting it. This is the same shape as the perceptron and Widrow-Hoff (LMS) update — exponential smoothing is, quite literally, online learning of a moving level, decades before that name existed. Why "exponential"? Unrolling the recurrence shows the forecast is a weighted average of all past observations, with weights that decay geometrically into the past: EQ T3.3 — GEOMETRIC WEIGHTING OF THE PAST $$ \hat{y}_{t+1\mid t} \;=\; \alpha \sum_{k=0}^{t-1} (1-\alpha)^{k}\, y_{t-k} \;+\; (1-\alpha)^{t}\,\ell_0, \qquad \sum_{k=0}^{\infty} \alpha\,(1-\alpha)^{k} = 1 $$ The weight on the observation \(k\) steps back is \(\alpha(1-\alpha)^k\) — largest for the most recent point and shrinking by a constant factor \((1-\alpha)\) each step. The weights are a geometric series that sums to one, so the forecast is a genuine weighted average. This is the entire idea of the chapter in one line: the past is never thrown away, it just fades. A small \(\alpha\) means a long memory (slow fade); a large \(\alpha\) means a short one. The instrument below draws this decay. Simple exponential smoothing with \(\alpha = 0.3\). The current level is \(\ell_{t-1} = 10\) and a new observation arrives, \(y_t = 20\). What is the updated level \(\ell_t\)? EQ T3.1: \(\ell_t = \alpha\,y_t + (1-\alpha)\,\ell_{t-1} = 0.3 \times 20 + 0.7 \times 10 = 6 + 7 = \) 13. The level moves 30% of the way from 10 toward the new reading. With \(\alpha = 0.3\), what weight does EQ T3.3 place on the observation two steps in the past, \(y_{t-2}\)? (Use \(k = 2\): \(\alpha(1-\alpha)^k\).) \(\alpha(1-\alpha)^2 = 0.3 \times 0.7^2 = 0.3 \times 0.49 = \) 0.147. Compare \(y_t\)'s weight of \(0.30\) and \(y_{t-1}\)'s of \(0.21\): each step back loses a factor of \(0.7\). PYTHON · RUNNABLE IN-BROWSER # Simple exponential smoothing in numpy: fit a level, print fitted vs actual import numpy as np rng = np.random.default_rng(0) # a wandering level (random walk) plus observation noise -- no trend, no season n = 24 level_true = 50 + np.cumsum(rng.normal(0, 1.2, n)) y = level_true + rng.normal(0, 2.0, n) alpha = 0.3 ell = y[0] # initialise the level at the first observation fitted = np.empty(n) fitted[0] = ell for t in range(1, n): fitted[t] = ell # one-step forecast BEFORE seeing y[t] is the old level ell = alpha * y[t] + (1 - alpha) * ell # EQ T3.1 update sse = np.sum((y[1:] - fitted[1:]) ** 2) print(f"alpha = {alpha}, one-step SSE = {sse:.2f}, final level = {ell:.2f}") print(" t actual forecast error") for t in range(1, 8): print(f"{t:2d} {y[t]:7.2f} {fitted[t]:8.2f} {y[t]-fitted[t]:+7.2f}") plot_xy(list(range(n)), list(y)) # the noisy series; fitted line tracks its level RUN ▶ edits are live — break it on purpose INSTRUMENT T3.1 — EXPONENTIAL-SMOOTHING EXPLORER GEOMETRIC WEIGHTS · EQ T3.3 · LIVE SMOOTHING α 0.30 WEIGHT ON LAST OBS — EFFECTIVE MEMORY (½-LIFE) — WEIGHT IN LAST 5 OBS — Each mint bar is the weight EQ T3.3 places on an observation that many steps in the past; they form a geometric decay that sums to one. Drag α toward 1 and the forecast collapses onto the most recent point (a spike at lag 0 — short memory, twitchy). Drag it toward 0 and the bars flatten into a long, even tail — the method becomes a slow long-run average. The half-life readout, \(\ln 2 / -\ln(1-\alpha)\), is how many steps back the cumulative weight reaches 50%. 3.2 Holt's linear trend method SES forecasts a flat line, so it lags badly on any series that is climbing or falling: it is forever chasing a level that has already moved on. Holt (1957) added a second smoothed component — a trend \(b_t\), the estimated change per period — updated by its own smoothing parameter \(\beta\). Now two recurrences run in lockstep, and the forecast extrapolates the trend forward: EQ T3.4 — HOLT'S LINEAR (DOUBLE) SMOOTHING $$ \begin{aligned} \ell_t &= \alpha\, y_t + (1-\alpha)\,(\ell_{t-1} + b_{t-1}) \\ b_t &= \beta\,(\ell_t - \ell_{t-1}) + (1-\beta)\,b_{t-1} \\ \hat{y}_{t+h\mid t} &= \ell_t + h\, b_t \end{aligned} $$ The level update now smooths toward \(y_t\) but starts from \(\ell_{t-1}+b_{t-1}\) — last level plus where the trend said it would go. The trend update smooths the latest observed slope \((\ell_t - \ell_{t-1})\) against the old trend. The forecast is no longer flat: it is a straight line of slope \(b_t\), projected \(h\) steps out. Set \(\beta = 0\) (constant trend) or \(b_0 = 0\) and Holt degenerates back to SES. One honest caveat: a linear trend projected far into the future is usually too aggressive — real series flatten. The standard fix is the damped trend of Gardner & McKenzie (1985), which multiplies the trend by a damping factor \(0 < \phi < 1\) so the forecast bends toward a horizontal asymptote: EQ T3.5 — DAMPED TREND $$ \hat{y}_{t+h\mid t} \;=\; \ell_t + (\phi + \phi^2 + \cdots + \phi^{h})\,b_t, \qquad \lim_{h\to\infty} \hat{y}_{t+h\mid t} = \ell_t + \frac{\phi}{1-\phi}\,b_t $$ With \(\phi = 1\) this is exactly Holt's undamped line; with \(\phi < 1\) the per-step contribution of the trend shrinks geometrically and the forecast saturates at a finite ceiling. The damped-trend method is one of the most reliable automatic forecasters known — it was the benchmark to beat across the M-competitions, and a hard one. PYTHON · RUNNABLE IN-BROWSER # Holt's linear method: vary alpha and beta, forecast h steps ahead (EQ T3.4) import numpy as np # a trending series: level rises ~1.5/period with a little noise n = 30 y = 10 + 1.5 * np.arange(n) + np.array([0,1,-1,2,0,-2,1,3,-1,0, 2,-1,1,0,-2,1,2,-1,0,1, -1,2,0,1,-2,0,1,-1,2,0], float) def holt(y, alpha, beta, h=4): ell, b = y[0], y[1] - y[0] # init: level=y0, trend=first difference for t in range(1, len(y)): prev = ell ell = alpha * y[t] + (1 - alpha) * (ell + b) # level b = beta * (ell - prev) + (1 - beta) * b # trend fc = [ell + (i + 1) * b for i in range(h)] # straight-line forecast return ell, b, fc print(" alpha beta | final level trend 4-step forecast") for alpha, beta in [(0.8, 0.2), (0.5, 0.1), (0.3, 0.05)]: ell, b, fc = holt(y, alpha, beta) print(f" {alpha:4.2f} {beta:4.2f} | {ell:9.2f} {b:6.3f} " + " ".join(f"{v:6.1f}" for v in fc)) print("\nhigher beta -> trend reacts faster to slope changes (and to noise).") plot_xy(list(range(n)), list(y)) RUN ▶ edits are live — break it on purpose A naming map for the confused. SES is "single" smoothing; Holt is "double"; Holt-Winters (next) is "triple". The labels just count how many recurrences run — one per component you choose to track: level, then trend, then season. 3.3 Holt-Winters seasonal method Most operational series breathe on a calendar: weekly retail, daily electricity, monthly tourism. Winters (1960) completed Holt's method by adding a third smoothed component — a vector of \(m\) seasonal indices \(s_t\) (one per position in the cycle, \(m=12\) for monthly, \(m=7\) for daily-of-week), each updated by its own parameter \(\gamma\). The result, Holt-Winters, smooths level, trend, and season simultaneously. There are two flavours, depending on whether seasonal swings are a fixed amount or a fixed fraction of the level. EQ T3.6 — HOLT-WINTERS (ADDITIVE SEASONALITY) $$ \begin{aligned} \ell_t &= \alpha\,(y_t - s_{t-m}) + (1-\alpha)\,(\ell_{t-1} + b_{t-1}) \\ b_t &= \beta\,(\ell_t - \ell_{t-1}) + (1-\beta)\,b_{t-1} \\ s_t &= \gamma\,(y_t - \ell_t) + (1-\gamma)\,s_{t-m} \\ \hat{y}_{t+h\mid t} &= \ell_t + h\, b_t + s_{t+h-m(k+1)} \end{aligned} $$ Compared with Holt (EQ T3.4), the level now smooths the deseasonalised observation \(y_t - s_{t-m}\), and a third recurrence smooths the seasonal index from the detrended residual \(y_t - \ell_t\). The forecast adds back the matching seasonal index, where \(k = \lfloor (h-1)/m \rfloor\) just selects the right slot in the last estimated cycle. The seasonal indices are conventionally normalised to sum to zero each cycle so they do not absorb the level. EQ T3.7 — HOLT-WINTERS (MULTIPLICATIVE SEASONALITY) $$ \ell_t = \alpha\,\frac{y_t}{s_{t-m}} + (1-\alpha)(\ell_{t-1}+b_{t-1}), \qquad s_t = \gamma\,\frac{y_t}{\ell_t} + (1-\gamma)\,s_{t-m}, \qquad \hat{y}_{t+h\mid t} = (\ell_t + h\,b_t)\, s_{t+h-m(k+1)} $$ Here seasonal indices are multipliers around 1 (e.g. December = 1.4× the level), normalised to average one per cycle. Use additive when the seasonal swing is a constant size; use multiplicative when the swing grows with the level — the classic airline-passengers series, whose December peaks balloon as traffic grows, is the textbook case for multiplicative. Holt's method (EQ T3.4) smooths two components: a level and a trend. Holt-Winters adds a third recurrence. Which component does it add? (one word) Winters added a seasonal component — the vector of indices \(s_t\) updated by \(\gamma\) in EQ T3.6/T3.7. SES (single) tracks level; Holt (double) adds trend; Holt-Winters (triple) adds season. A multiplicative Holt-Winters model has level \(\ell_t = 200\), zero trend, and a December seasonal multiplier \(s = 1.4\). What is the one-step December forecast \(\hat{y} = \ell_t \cdot s\), expressed as a multiple of the level (i.e. give \(s\))? Equivalently: the forecast is 280, which is the level times what factor? \(\hat{y} = \ell_t \cdot s = 200 \times 1.4 = 280\). The factor relative to the level is \(280/200 = \) 1.4 — December runs 40% above the deseasonalised level. INSTRUMENT T3.2 — HOLT-WINTERS DECOMPOSITION SEASONAL SERIES · m = 12 · EQ T3.6 LEVEL α 0.30 TREND β 0.10 SEASON γ 0.30 IN-SAMPLE SSE — FINAL LEVEL · TREND — SEASON AMPLITUDE — The grey line is a synthetic monthly series (rising trend + 12-month season + noise); the mint line is the Holt-Winters one-step fit, and the blue segment past the divider is its 12-step seasonal forecast. Push γ up and the seasonal indices chase every wobble (overfit); push it down and the model holds a stable seasonal shape. Watch the SSE readout: the seasonal recurrence is what lets the fit hug the peaks and troughs an SES line would slice straight through. 3.4 The ETS state-space framework For forty years exponential smoothing was a bag of recurrences with no probability model behind them — you could forecast, but you could not say how uncertain the forecast was, nor choose a method by a principled criterion. Hyndman, Koehler, Ord & Snyder (2002, 2008) fixed that by showing every smoothing method is the point forecast of an underlying state-space model with a single source of error. This is the ETS family: Error · Trend · Season. EQ T3.8 — ETS AS A STATE-SPACE MODEL (additive-error, "innovations" form) $$ \underbrace{y_t = \ell_{t-1} + b_{t-1} + s_{t-m} + \varepsilon_t}_{\text{measurement}}, \qquad \underbrace{\ell_t = \ell_{t-1} + b_{t-1} + \alpha\,\varepsilon_t,\;\; b_t = b_{t-1} + \beta\,\varepsilon_t,\;\; s_t = s_{t-m} + \gamma\,\varepsilon_t}_{\text{state update}} $$ A single shock \(\varepsilon_t \sim \mathcal{N}(0,\sigma^2)\) drives both the observation and every state update — hence "single source of error". Recover EQ T3.6's smoothing constants by substituting \(\varepsilon_t = y_t - \hat{y}_{t\mid t-1}\). The payoff is enormous: a likelihood you can maximise, AIC/BIC for model selection, and — most importantly — exact prediction intervals, which the old recurrences could never produce. ETS classifies a model by a three-letter code: Error ∈ {A, M}, Trend ∈ {N, A, A d }, Season ∈ {N, A, M}. So ETS(A,N,N) is SES with additive noise, ETS(A,A,N) is Holt, ETS(A,A,A) is additive Holt-Winters, and ETS(M,A,M) is the multiplicative-error airline model. There are 30 admissible combinations; the practical recipe is to let software fit all of them and pick by AIC. Method Components (E,T,S) ETS code Forecast shape SES level only (A,N,N) flat line Holt level + trend (A,A,N) straight line Damped Holt level + damped trend (A,A d,N) bends to asymptote Additive HW level + trend + season (A,A,A) line + fixed season Multiplicative HW level + trend + ×season (M,A,M) line × growing season The empirical verdict. In the M3 competition (3,003 series) and again in M4 (100,000 series), simple exponential-smoothing and ETS variants — especially damped trend — were brutally hard to beat; the M4 winner was a hybrid that combined exponential smoothing with a neural net (Smyl's ES-RNN). The lesson the field keeps relearning: for a single, short, noisy series, a one-parameter smoother often beats a deep model, and any serious forecaster keeps ETS as the baseline that earns its keep. 3.5 Choosing the smoothing parameters You do not set \(\alpha, \beta, \gamma\) by hand. The standard procedure picks them — together with the initial states \(\ell_0, b_0, s_0\) — by minimising the in-sample sum of squared one-step errors (equivalently, maximising the Gaussian likelihood of EQ T3.8): EQ T3.9 — PARAMETER ESTIMATION BY MINIMISING SSE $$ (\hat{\alpha}, \hat{\beta}, \hat{\gamma},\, \hat{\ell}_0, \hat{b}_0, \hat{s}_0) \;=\; \arg\min \; \sum_{t=1}^{n} \big(y_t - \hat{y}_{t\mid t-1}\big)^2 \;=\; \arg\min \; \sum_{t=1}^{n} e_t^2 $$ Each \(\hat{y}_{t\mid t-1}\) is the model's one-step forecast computed from the recurrences, so the objective is a nonlinear function of the parameters — solved by numerical optimisation (Nelder-Mead, L-BFGS). The smoothing parameters are box-constrained to \((0,1)\); some references add an "admissible region" constraint that keeps the implied state-space model stable. SSE is minimised on one-step errors, not on the long-horizon forecast — a subtlety that matters when the two disagree. Two cautions experts will raise. First, do not minimise SSE on the data you will also report accuracy on; hold out the tail of the series, or use time-series cross-validation (rolling-origin evaluation, Time Series 01), or trust the AIC from the likelihood. Second, an optimiser will happily push \(\alpha \to 1\) on a series that is really a random walk — a correct answer that looks like overfitting but is not. The instrument below traces the SSE objective for SES so you can see its shape: usually convex with a clear minimum, occasionally flat (the data barely constrains \(\alpha\)). INSTRUMENT T3.3 — SMOOTHING-PARAMETER OPTIMIZER SES · SSE(α) CURVE · EQ T3.9 NOISE LEVEL σ 2.0 LEVEL DRIFT 1.0 YOUR α 0.30 SSE AT YOUR α — OPTIMAL α* — SSE AT α* — The mint curve is the SES objective SSE(α) swept across the whole \((0,1)\) range on a freshly simulated series; the blue dot marks the grid-search minimum α* and the grey dot marks your slider's α. Crank the noise up and the minimum slides left (a smoother level filters out observation noise); crank the drift up and it slides right (the level is genuinely moving, so trust recent data more). When the curve goes flat, the data simply does not pin α down — the honest answer is "any value in this basin forecasts about the same". NEXT Exponential smoothing models the mean of a series and treats the variance as a constant nuisance. For financial returns that assumption is exactly backwards: the mean is near-unforecastable but the variance clusters — calm begets calm, a shock begets shocks. Time Series 04 turns the smoothing machinery loose on the variance itself: ARCH, GARCH, and the volatility models that price risk. 3.R References Holt, C. C. (2004, orig. 1957). Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting 20(1) — reprint of the 1957 ONR memorandum that introduced double smoothing (EQ T3.4). Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management Science 6(3) — adds the seasonal component, completing Holt-Winters (EQ T3.6/T3.7). Hyndman, R. J., Koehler, A. B., Ord, J. K. & Snyder, R. D. (2008). Forecasting with Exponential Smoothing: The State Space Approach. Springer — the definitive treatment of the ETS innovations state-space framework (EQ T3.8). Hyndman, R. J., Koehler, A. B., Snyder, R. D. & Grose, S. (2002). A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting 18(3) — the taxonomy of 30 ETS models and automatic AIC selection (§3.4). Gardner, E. S. & McKenzie, E. (1985). Forecasting trends in time series. Management Science 31(10) — the damped-trend method (EQ T3.5), a perennial competition benchmark. Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2020). The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36(1) — the modern evidence that exponential smoothing remains a top baseline (§3.4). Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.), Ch. 8. OTexts — the freely available standard textbook treatment of SES, Holt-Winters, and ETS. ← PREVIOUS 02 ARIMA NEXT CHAPTER 04 Volatility & GARCH AI // ENCYCLOPEDIA — TIME SERIES & ECONOMETRICS · CH 03 FULL CONTENTS ↗
## TIME · Volatility Modeling (https://ai-encyclopedia.com/timeseries/04-volatility-garch.html)
Volatility Modeling — ARCH & GARCH — AI Encyclopedia AI // ENCYCLOPEDIA / TIME SERIES / 04 / GARCH INDEX NEXT: MULTIVARIATE (VAR) → TIME SERIES & ECONOMETRICS · CHAPTER 04 / 06 Volatility Modeling — ARCH & GARCH Returns are close to unforecastable, but their size is not. Volatility clusters: calm periods follow calm periods and large moves follow large moves. GARCH captures this by writing today's variance as an explicit function of yesterday's surprise and yesterday's variance, which is what lets you forecast risk, scale positions, and quantify the loss you should be prepared to absorb. LEVEL ADVANCED READING TIME ≈ 28 MIN BUILDS ON TIME SERIES 01–03 INSTRUMENTS GARCH SIM · RETURNS+VOL · TERM STRUCTURE IN THIS CHAPTER 4.1 Volatility clustering 4.2 ARCH 4.3 GARCH(1,1) 4.4 Asymmetry: EGARCH & GJR 4.5 Forecasting & VaR 4.R References 4.1 Volatility clustering — the stylized fact Plot the daily returns of any liquid asset and one thing jumps out: the wild days come in bunches. October 2008, March 2020, August 2024 — each is a dense thicket of large moves up and down, separated by long stretches of placid drift. Mandelbrot noticed it in 1963: "large changes tend to be followed by large changes, of either sign, and small changes by small changes." This is volatility clustering, and it is the single most robust empirical regularity in all of finance. The classical models of the previous chapters cannot represent it. They assume homoscedasticity — a constant variance \(\sigma^2\) for the noise term. Under that assumption a calm Tuesday and a panicked Thursday are draws from the same distribution, which is plainly false. What clustering demands instead is conditional heteroscedasticity: a variance that changes through time and, crucially, is predictable from the past even when the return itself is not. EQ T4.1 — THE STYLIZED FACTS, FORMALLY $$ \mathrm{Corr}(r_t,\, r_{t-k}) \approx 0 \qquad\text{but}\qquad \mathrm{Corr}(r_t^2,\, r_{t-k}^2) > 0 \;\; \text{for many lags } k $$ Raw returns \(r_t\) are serially uncorrelated (you cannot predict tomorrow's sign — markets are near-efficient). Yet squared (or absolute) returns are strongly positively autocorrelated, and that autocorrelation decays slowly. The level is noise; the magnitude has memory. Returns are also fat-tailed (leptokurtic) and, in equities, negatively skewed — a model of volatility must reproduce all three. Two more facts complete the picture and motivate everything below. First, the unconditional return distribution has fat tails — far more 4σ and 5σ days than a Gaussian allows — even when daily returns are conditionally normal, because mixing normals of different variances manufactures kurtosis for free. Second, in equity markets volatility responds asymmetrically: a 3% drop raises tomorrow's expected volatility more than a 3% gain does. That leverage effect (§4.4) is why the family did not stop at GARCH. A subtlety worth stating up front: GARCH does not predict returns, and it would be a category error to expect it to. It predicts the scale of returns — the width of tomorrow's distribution, not its center. That is exactly the quantity risk management, option pricing, and position sizing actually need. 4.2 ARCH — conditional variance Robert Engle's 1982 insight — which won the 2003 Nobel — was to let the variance of the current shock depend on the magnitudes of recent shocks. Write the return (after removing any mean) as a standardized innovation scaled by a time-varying volatility: EQ T4.2 — THE ARCH(q) MODEL $$ r_t = \mu + \varepsilon_t, \qquad \varepsilon_t = \sigma_t\, z_t, \quad z_t \overset{\text{iid}}{\sim} \mathcal{N}(0,1), \qquad \sigma_t^2 = \omega + \sum_{i=1}^{q}\alpha_i\, \varepsilon_{t-i}^2 $$ \(z_t\) is the unpredictable part — pure white noise of unit variance. All the structure lives in \(\sigma_t^2\), the conditional variance: a baseline \(\omega > 0\) plus a weighted sum of recent squared shocks. A big move yesterday (\(\varepsilon_{t-1}^2\) large) mechanically inflates today's variance, then feeds forward — that is clustering, written as a recursion. For the variance to stay positive we need \(\omega > 0,\ \alpha_i\ge 0\); for it to be stationary we need \(\sum_i \alpha_i < 1\). ARCH works, but it is clumsy. Real volatility persistence decays over many weeks, so capturing it with a finite sum of squared shocks forces a large \(q\) — often 5 to 10 lags — and a long parameter vector that is awkward to estimate and prone to overfitting. The model also imposes that variance reacts only to a fixed, short window of past shocks, with hard cutoffs. Engle's student Bollerslev fixed both problems in one stroke. Parameters are fit by maximum likelihood: choose \((\omega, \alpha)\) to maximize the Gaussian log-likelihood of the observed returns under the recursively computed \(\sigma_t^2\). The objective is non-linear but smooth, and the runnable cells below show the recursion that any optimizer would evaluate at each step. EQ T4.3 — THE GAUSSIAN LOG-LIKELIHOOD (WHAT MLE MAXIMIZES) $$ \ell(\theta) = -\frac{1}{2}\sum_{t=1}^{T}\!\left[\log(2\pi) + \log \sigma_t^2(\theta) + \frac{\varepsilon_t^2}{\sigma_t^2(\theta)}\right] $$ Each term rewards a \(\sigma_t^2\) that is large when the shock is large and small when it is small: the \(\varepsilon_t^2/\sigma_t^2\) penalty punishes underestimating a violent day, while \(\log\sigma_t^2\) punishes crying wolf on a calm one. Maximizing this is exactly learning to size tomorrow's distribution from today's. Heavy-tailed innovations (Student-t) replace the Gaussian when residuals stay fat-tailed after fitting — common for daily equity data. 4.3 GARCH(1,1) — the workhorse The Generalized ARCH model adds one term — yesterday's variance — and that single addition is why GARCH(1,1) has been the default for forty years. It captures slow-decaying persistence with just three parameters, an unbeatable parsimony-to-realism ratio: EQ T4.4 — GARCH(1,1) $$ \sigma_t^2 = \omega + \alpha\,\varepsilon_{t-1}^2 + \beta\,\sigma_{t-1}^2, \qquad \omega > 0,\ \alpha\ge 0,\ \beta\ge 0,\ \alpha+\beta < 1 $$ Three forces set tomorrow's variance: a constant floor \(\omega\); the news / reaction term \(\alpha\,\varepsilon_{t-1}^2\) (how hard yesterday's surprise hits); and the memory / persistence term \(\beta\,\sigma_{t-1}^2\) (how much of yesterday's variance carries over). Unrolling the recursion shows GARCH(1,1) is an infinite exponentially-weighted sum of all past squared shocks — an ARCH(∞) — which is exactly why three numbers do the work of ten. Two derived quantities carry most of the intuition. The persistence is the sum \(\alpha+\beta\): it governs how slowly a volatility shock dies out, and for daily equity indices it is famously close to one — typically \(0.95\) to \(0.99\). The unconditional (long-run) variance is the level the recursion reverts to: EQ T4.5 — LONG-RUN VARIANCE & MEAN REVERSION $$ \bar{\sigma}^2 \;=\; \frac{\omega}{1 - \alpha - \beta}, \qquad \sigma_t^2 - \bar{\sigma}^2 \;=\; \alpha\big(\varepsilon_{t-1}^2 - \bar{\sigma}^2\big) + (\alpha+\beta)\big(\sigma_{t-1}^2 - \bar{\sigma}^2\big) $$ Taking expectations of EQ T4.4 in the stationary state gives \(\bar\sigma^2(1-\alpha-\beta)=\omega\). Variance always pulls back toward \(\bar\sigma^2\): after a spike it decays, after a lull it rises. The closer \(\alpha+\beta\) is to 1, the slower that pull — at \(\alpha+\beta=1\) shocks never fully fade (the IGARCH boundary, where \(\bar\sigma^2\) is undefined and the EWMA / RiskMetrics model lives). This single number, the half-life \(\log(0.5)/\log(\alpha+\beta)\), is what a risk manager reads first. A GARCH(1,1) model has \(\omega = 0.00001\), \(\alpha = 0.1\), \(\beta = 0.85\). Yesterday's conditional variance was \(\sigma_{t-1}^2 = 0.0004\) and yesterday's squared shock was \(\varepsilon_{t-1}^2 = 0.0009\). What is today's conditional variance \(\sigma_t^2\)? Apply EQ T4.4 term by term: \(\alpha\,\varepsilon_{t-1}^2 = 0.1 \times 0.0009 = 0.00009\); \(\beta\,\sigma_{t-1}^2 = 0.85 \times 0.0004 = 0.00034\). Sum with \(\omega\): \(0.00001 + 0.00009 + 0.00034 = \) 0.00044. (Today's volatility is \(\sqrt{0.00044} \approx 0.021\), i.e. about a 2.1% daily move.) True or false: when \(\alpha + \beta\) is close to 1, a shock to volatility decays slowly, so today's turbulence stays elevated for a long time. (Answer true or false.) From EQ T4.5 the deviation \(\sigma_t^2 - \bar\sigma^2\) shrinks by a factor of \((\alpha+\beta)\) each step. If \(\alpha+\beta\) is near 1 that factor is near 1, so the deviation barely shrinks per day and volatility reverts to its mean only over many sessions — long memory, slow decay. The statement is true. PYTHON · RUNNABLE IN-BROWSER # Simulate a GARCH(1,1) process; plot returns and conditional vol import numpy as np rng = np.random.default_rng(1) omega, alpha, beta = 1e-5, 0.10, 0.85 # persistence alpha+beta = 0.95 T = 600 var_lr = omega / (1 - alpha - beta) # long-run (unconditional) variance r = np.zeros(T) s2 = np.zeros(T); s2[0] = var_lr # start at the long-run level for t in range(1, T): s2[t] = omega + alpha * r[t-1]**2 + beta * s2[t-1] r[t] = np.sqrt(s2[t]) * rng.standard_normal() vol = np.sqrt(s2) print(f"long-run daily vol: {np.sqrt(var_lr):.4f} (annualized ~{np.sqrt(var_lr)*np.sqrt(252):.1%})") print(f"realised daily vol: {r.std():.4f}") print(f"max |return|: {np.abs(r).max():.4f} on day {int(np.argmax(np.abs(r)))}") print(f"corr(r, lag r): {np.corrcoef(r[1:], r[:-1])[0,1]:+.3f} (near 0: level is noise)") print(f"corr(r^2, lag r^2): {np.corrcoef(r[1:]**2, (r[:-1])**2)[0,1]:+.3f} (positive: magnitude has memory)") plot_xy(list(range(T)), vol) # the clustering, made visible RUN ▶ edits are live — break it on purpose PYTHON · RUNNABLE IN-BROWSER # Run the GARCH(1,1) variance recursion on returns; print the one-step vol import numpy as np omega, alpha, beta = 1e-5, 0.10, 0.85 # a short return series ending in two violent days (a shock arriving) r = np.array([0.004, -0.006, 0.002, -0.003, 0.005, -0.028, 0.031, -0.004, 0.007, -0.002]) var_lr = omega / (1 - alpha - beta) s2 = var_lr # seed at the long-run variance print(" day return sigma^2 sigma (daily vol)") for t, rt in enumerate(r): s2 = omega + alpha * rt**2 + beta * s2 # EQ T4.4 print(f" {t:2d} {rt:+.4f} {s2:.6e} {np.sqrt(s2):.4%}") # one-step-ahead forecast uses the LAST observed shock and variance s2_next = omega + alpha * r[-1]**2 + beta * s2 print(f"\none-step-ahead sigma^2: {s2_next:.6e}") print(f"one-step-ahead vol: {np.sqrt(s2_next):.4%} (note the spike that lingers)") RUN ▶ edits are live — break it on purpose INSTRUMENT T4.1 — GARCH(1,1) SIMULATOR EQ T4.4 · CLUSTERING & PERSISTENCE · SEEDED REACTION α 0.10 PERSISTENCE β 0.85 α + β (PERSISTENCE) — SHOCK HALF-LIFE — LONG-RUN DAILY VOL — The mint line is the conditional volatility \(\sigma_t\); the faint grey bars are the simulated returns it scales. Push α up and volatility reacts violently to each shock but the spikes are jagged and short. Push β up and the spikes smooth into long plateaus — memory. When α + β crosses ~0.97 the half-life balloons and the series stops mean-reverting on any human timescale: that is the IGARCH regime where the model says "today's storm is the new normal until further notice." The same seed is reused so you compare regimes, not luck. 4.4 Asymmetry — EGARCH & GJR-GARCH Plain GARCH has a blind spot baked into its algebra: it reacts to \(\varepsilon_{t-1}^2\), and squaring throws away the sign. A −4% day and a +4% day produce identical forecasts. But equity volatility is emphatically not symmetric — bad news raises future volatility far more than equally-sized good news. This leverage effect (a falling stock raises its debt-to-equity ratio, mechanically raising risk; and falling prices trigger forced selling and fear) is one of the most reliable patterns in markets, and two extensions of GARCH were built to capture it. GJR-GARCH (Glosten–Jagannathan–Runkle, 1993) is the minimal fix: add one term that switches on only for negative shocks. EQ T4.6 — GJR-GARCH(1,1) $$ \sigma_t^2 = \omega + \alpha\,\varepsilon_{t-1}^2 + \gamma\, \mathbb{1}_{\{\varepsilon_{t-1} < 0\}}\,\varepsilon_{t-1}^2 + \beta\,\sigma_{t-1}^2 $$ The indicator \(\mathbb{1}_{\{\varepsilon_{t-1} < 0\}}\) equals 1 after a down day and 0 otherwise, so a negative shock contributes \((\alpha+\gamma)\varepsilon_{t-1}^2\) while a positive one contributes only \(\alpha\,\varepsilon_{t-1}^2\). A positive \(\gamma\) is the leverage effect made into a parameter — and for equity indices \(\gamma\) is reliably positive and often larger than \(\alpha\) itself. Persistence becomes \(\alpha + \beta + \tfrac{1}{2}\gamma\) (the \(\tfrac12\) is the probability a shock is negative). EGARCH (Nelson, 1991) takes a different route: model the log of variance, which guarantees positivity without any constraints on the signs of the coefficients, and let the news term depend on both the magnitude and the sign of the standardized shock \(z_{t-1} = \varepsilon_{t-1}/\sigma_{t-1}\). EQ T4.7 — EGARCH(1,1) $$ \log \sigma_t^2 = \omega + \beta \log \sigma_{t-1}^2 + \alpha\Big(\,|z_{t-1}| - \mathbb{E}|z_{t-1}|\,\Big) + \theta\, z_{t-1} $$ The \(\alpha\) term is the symmetric magnitude response; the \(\theta\, z_{t-1}\) term is the asymmetry — with \(\theta < 0\), a negative \(z\) (a down day) raises \(\log\sigma_t^2\) more than a positive one of equal size. Because it works in logs, EGARCH needs no positivity constraints and can express richer news-impact curves, at the cost of a likelihood that is fiddlier to optimize. Forecasting multiple steps ahead is also messier than GARCH's clean linear recursion. The news-impact curve — next period's variance plotted against this period's shock, holding \(\sigma_{t-1}^2\) fixed — is the cleanest way to see the difference. Plain GARCH gives a symmetric parabola centered at zero; GJR and EGARCH tilt it, steepening the left (bad-news) arm. INSTRUMENT T4.2 — NEWS-IMPACT CURVE GARCH vs GJR · σ²ₜ vs εₜ₋₁ · EQ T4.4 / T4.6 REACTION α 0.06 LEVERAGE γ 0.10 σ² AFTER −3% DAY — σ² AFTER +3% DAY — DOWN / UP RATIO — The blue parabola is symmetric GARCH — it does not care which way the market moved. The mint curve is GJR: raise the leverage γ and watch the left (loss) arm steepen while the right arm stays put, the kink at zero growing sharper. The down/up ratio is how much more a 3% loss inflates tomorrow's variance than a 3% gain — set γ = 0 and it snaps to exactly 1.0, recovering plain GARCH. For real equity indices this ratio is routinely 2 or more. Which to use is genuinely contested. GJR is simpler, nests GARCH cleanly (test \(\gamma=0\)), and is easy to forecast — most practitioners reach for it first. EGARCH is more flexible and unconstrained but harder to fit and to project forward, and its log scale makes the parameters less directly interpretable. Hansen & Lunde's large 2005 horse race found that for daily equity data nothing reliably beat a plain GARCH(1,1) for forecasting, while for exchange rates the asymmetric variants helped little — a useful humility check against over-engineering. 4.5 Forecasting volatility & the VaR link The payoff of a fitted GARCH model is a forecast of future variance, and the recursion makes multi-step forecasts almost free. The one-step forecast is just the recursion evaluated at the last observed values. For horizons beyond one, the unknown future shock \(\varepsilon_{t+h-1}^2\) is replaced by its expectation, which under the model is the forecast variance itself — collapsing the whole thing to clean geometric mean reversion toward \(\bar\sigma^2\): EQ T4.8 — h-STEP VARIANCE FORECAST $$ \mathbb{E}_t\!\left[\sigma_{t+h}^2\right] \;=\; \bar{\sigma}^2 \;+\; (\alpha+\beta)^{\,h-1}\big(\sigma_{t+1}^2 - \bar{\sigma}^2\big), \qquad h = 1, 2, 3, \ldots $$ The forecast is the long-run level \(\bar\sigma^2\) plus the current deviation, geometrically discounted by the persistence \((\alpha+\beta)\) per step. From a calm start it climbs toward \(\bar\sigma^2\); from a panic it decays toward it — the term structure of volatility. High persistence flattens the curve (a slow approach), low persistence snaps it back fast. Aggregating to an \(H\)-day variance sums these: \(\sum_{h=1}^{H}\mathbb{E}_t[\sigma_{t+h}^2]\), which under iid would just be \(H\sigma^2\) — the famous \(\sqrt{H}\) scaling, which GARCH corrects whenever you are not already at the long-run level. This term structure is precisely what an option's implied-volatility surface tries to price, and what a risk system needs to project losses over a 1-day or 10-day horizon. The most consequential application is Value-at-Risk (VaR): the loss threshold a portfolio will not exceed with probability \(1-p\) over a given horizon. Plug the GARCH conditional volatility into the quantile of the innovation distribution: EQ T4.9 — CONDITIONAL VALUE-AT-RISK (PARAMETRIC) $$ \mathrm{VaR}_{t}^{\,p} \;=\; -\Big(\mu + z_{p}\,\sigma_{t}\Big), \qquad z_{p} = \Phi^{-1}(p) $$ \(z_p\) is the lower-tail quantile of the standardized innovation (\(\Phi^{-1}(0.01) \approx -2.326\) for a Gaussian 1% VaR; use the Student-t quantile for fat tails). Because \(\sigma_t\) is conditional, the VaR breathes with the market — it widens automatically in turbulent clusters and tightens in calm, unlike a static historical VaR that lags the regime badly. The 10-day regulatory VaR scales by the GARCH variance forecast \(\sqrt{\sum_{h=1}^{10}\mathbb{E}_t[\sigma_{t+h}^2]}\), not by a naive \(\sqrt{10}\,\sigma_t\) — the difference is exactly the mean reversion of EQ T4.8. KEY Why a conditional VaR matters. A static VaR built on a trailing 250-day window treats March 2020 and a sleepy summer as equally likely tomorrow. It under-warns going into a crisis (the window is still full of calm days) and over-warns coming out of one (the window is still full of the crash). GARCH-based VaR reacts within a day or two because \(\sigma_t\) is recomputed every step — the practical reason banks adopted conditional volatility models for capital after 1996. INSTRUMENT T4.3 — VOLATILITY FORECAST TERM STRUCTURE EQ T4.8 · MEAN REVERSION TO σ̄ · LIVE PERSISTENCE α+β 0.94 TODAY'S DAILY VOL σₜ₊₁ 3.5% LONG-RUN DAILY VOL σ̄ — 10-DAY VOL (GARCH) — 10-DAY 99% VaR — Long-run vol is pinned at σ̄ = 1.5%/day (≈ 24% annualized). Start above it — a panic — and the mint term-structure curve decays back toward the dashed long-run line; drag today's vol below σ̄ and it climbs. Crank persistence toward 1 and the curve flattens to a near-horizontal plateau (shocks barely revert). The 10-day VaR reads off the aggregated GARCH variance with a Gaussian 99% quantile (z ≈ 2.326) — compare it mentally to a naive √10 × σₜ₊₁ and notice how much the mean reversion matters when you start far from σ̄. GARCH is not the last word. It assumes the variance process is driven only by past returns; realized-volatility models (HAR-RV) instead feed high-frequency intraday data straight in and routinely forecast better. Stochastic-volatility models give variance its own innovation term rather than making it a deterministic function of past shocks — more flexible, harder to estimate. And implied volatility from options markets is forward-looking in a way no return-based model can be. But for a three-parameter model you can fit in milliseconds and explain on a napkin, GARCH(1,1) remains the benchmark every richer model must beat — and frequently does not. NEXT We have modeled the volatility of one series in isolation — but risk lives in how series move together. A portfolio's variance is a quadratic form in a whole covariance matrix, and in a crisis correlations snap toward one exactly when diversification is supposed to save you. Chapter 05 turns the dial from one dimension to many: Vector Autoregression (VAR) for the joint dynamics of several series, the cross-correlations and Granger causality they encode, and the multivariate-GARCH machinery (DCC) that lets the whole covariance matrix breathe through time. 4.R References Engle, R. F. (1982). Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation. Econometrica 50(4) — the original ARCH model (EQ T4.2); the work cited for Engle's 2003 Nobel Prize. Bollerslev, T. (1986). Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics 31(3) — adds the lagged-variance term, giving GARCH(1,1) (EQ T4.4), the field's workhorse. Nelson, D. B. (1991). Conditional Heteroskedasticity in Asset Returns: A New Approach. Econometrica 59(2) — the EGARCH model (EQ T4.7), capturing the leverage effect in log-variance. Glosten, L. R., Jagannathan, R. & Runkle, D. E. (1993). On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. Journal of Finance 48(5) — the GJR-GARCH asymmetric extension (EQ T4.6). Hansen, P. R. & Lunde, A. (2005). A Forecast Comparison of Volatility Models: Does Anything Beat a GARCH(1,1)?. Journal of Applied Econometrics 20(7) — the large horse race finding GARCH(1,1) hard to beat for equities. Engle, R. F. (2002). Dynamic Conditional Correlation: A Simple Class of Multivariate GARCH Models. Journal of Business & Economic Statistics 20(3) — the DCC model bridging to the multivariate chapter. Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business 36(4) — the first clear statement of volatility clustering and fat tails (EQ T4.1). ← PREVIOUS 03 Smoothing NEXT CHAPTER 05 Multivariate (VAR) AI // ENCYCLOPEDIA — TIME SERIES · CH 04 FULL CONTENTS ↗
## TIME · Multivariate Time Series (https://ai-encyclopedia.com/timeseries/05-multivariate.html)
Multivariate Time Series — VAR, VECM & Cointegration — AI Encyclopedia AI // ENCYCLOPEDIA / TIME SERIES / 05 / VAR & COINTEGRATION INDEX NEXT: FORECASTING IN PRACTICE → TIME SERIES & ECONOMETRICS · CHAPTER 05 / 06 Multivariate Time Series — VAR, VECM & Cointegration When several series move together, a Vector Autoregression captures their feedback: every variable is regressed on the recent past of all the others, so the model encodes which series lead which. When those series are individually non-stationary random walks, cointegration identifies a long-run tether between them, a linear combination that does not wander and that anchors an error-correction dynamic pulling the system back toward equilibrium. LEVEL ADVANCED READING TIME ≈ 30 MIN BUILDS ON TIME SERIES 01–04 · STATS 06 INSTRUMENTS VAR · IRF · COINTEGRATION IN THIS CHAPTER 5.1 From AR to VAR 5.2 Estimation & order selection 5.3 Impulse response & FEVD 5.4 Cointegration & the VECM 5.5 Granger causality 5.R References 5.1 From AR to Vector Autoregression (VAR) A scalar autoregression \(y_t = \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \varepsilon_t\) (the AR\((p)\) of the earlier chapters) explains a series by its own past. But the world rarely hands you one series in isolation: interest rates, output and inflation move together; an order book's bid and ask co-evolve. The Vector Autoregression is the minimal generalization — stack \(K\) series into a vector \(y_t \in \mathbb{R}^K\) and let every component depend on the recent past of every component, itself included: EQ T5.1 — VAR(p), REDUCED FORM $$ y_t \;=\; c \;+\; A_1\, y_{t-1} \;+\; A_2\, y_{t-2} \;+\; \cdots \;+\; A_p\, y_{t-p} \;+\; \varepsilon_t, \qquad \varepsilon_t \sim (0,\ \Sigma) $$ \(y_t\) is \(K\times 1\); each \(A_i\) is a \(K\times K\) matrix of lag coefficients; \(c\) is a \(K\times 1\) intercept; the innovations \(\varepsilon_t\) are serially uncorrelated with contemporaneous covariance \(\Sigma\) (generally not diagonal — the series are shocked together). The off-diagonal entries of the \(A_i\) are the whole point: \(\big(A_1\big)_{12}\neq 0\) means yesterday's series 2 helps predict today's series 1. A VAR is just \(K\) ordinary regressions sharing the same right-hand side. Sims (1980) proposed the VAR as a deliberate rebellion against the "incredible" identifying restrictions of large structural macro models: let the data speak by regressing everything on lagged everything, then ask the model questions afterward (§5.3). Each equation has an intercept plus \(K\) coefficients per lag — so a VAR\((p)\) on \(K\) variables carries \(K(Kp + 1)\) parameters in the mean, and the count explodes as \(K\) and \(p\) grow. That parameter profligacy is the VAR's defining tension, and the reason §5.2 obsesses over order selection. A VAR(1) is fitted on \(K = 3\) variables. Ignoring the intercept, how many lag coefficients does each equation contain (the number of entries in one row of \(A_1\))? Each equation regresses one variable on the previous values of all \(K = 3\) variables, and there is \(p = 1\) lag, so a row of \(A_1\) has \(K\cdot p = 3\times 1 = \) 3 coefficients. (A VAR\((2)\) on 3 variables would carry \(3\times 2 = 6\) lag coefficients per equation.) For analysis it is convenient to fold a VAR\((p)\) into a VAR\((1)\) on a stacked state. Define the companion matrix \(F\), an exact analogue of the scalar companion form: a single matrix whose powers generate the entire dynamics. EQ T5.2 — COMPANION FORM & STABILITY $$ \underbrace{\begin{pmatrix} y_t \\ y_{t-1} \\ \vdots \\ y_{t-p+1} \end{pmatrix}}_{Y_t} = \underbrace{\begin{pmatrix} A_1 & A_2 & \cdots & A_p \\ I & 0 & \cdots & 0 \\ & \ddots & & \vdots \\ 0 & \cdots & I & 0 \end{pmatrix}}_{F}\, Y_{t-1} + \, E_t, \qquad \text{stable} \iff \max_i \lvert \lambda_i(F) \rvert < 1 $$ \(F\) is \(Kp \times Kp\). The VAR is stationary (stable) exactly when every eigenvalue of \(F\) lies strictly inside the unit circle — equivalently, every root of \(\det(I - A_1 z - \cdots - A_p z^p) = 0\) lies outside it. Shocks then decay geometrically and the process has a finite, time-invariant mean \((I - A_1 - \cdots - A_p)^{-1}c\). An eigenvalue on the unit circle is a unit root — the gateway to cointegration in §5.4. A VAR also has a clean infinite-history rewrite, the Wold / VMA(\(\infty\)) representation \(y_t = \mu + \sum_{i=0}^{\infty} \Psi_i\,\varepsilon_{t-i}\) with \(\Psi_0 = I\). The matrices \(\Psi_i\) are precisely the impulse responses of §5.3, and for a VAR\((1)\) they are simply \(\Psi_i = A_1^{\,i}\) — powers of the coefficient matrix. INSTRUMENT T5.1 — TWO-VARIABLE VAR SIMULATOR COEFFICIENT MATRIX A₁ DRIVES COUPLED DYNAMICS · EQ T5.1 a₁₁ (y₁ ← y₁) 0.50 a₁₂ (y₁ ← y₂) 0.30 a₂₁ (y₂ ← y₁) 0.20 a₂₂ (y₂ ← y₂) 0.40 SPECTRAL RADIUS |λ|ₘₐₓ — REGIME — CROSS-FEEDBACK a₁₂·a₂₁ — A single fixed seed of shocks drives both series so you compare apples to apples. Push the diagonal terms toward ±1 and the system slows and wanders; raise the off-diagonals and watch the two series lock into shared swings (cross-feedback). The instant the spectral radius crosses 1 the regime flips to UNSTABLE and trajectories diverge — exactly the eigenvalue boundary of EQ T5.2. 5.2 Estimation & order selection Because every equation of a reduced-form VAR has the identical regressor set — a constant plus the same stacked lags — the seemingly-unrelated-regressions efficiency gain vanishes: equation-by-equation OLS is the (conditional) maximum-likelihood estimator under Gaussian errors, and it is consistent and asymptotically normal whether or not \(\Sigma\) is diagonal. You can fit a VAR with nothing more than the normal equations. EQ T5.3 — MULTIVARIATE LEAST SQUARES $$ \widehat{B} \;=\; \big( Z^{\top} Z \big)^{-1} Z^{\top} Y, \qquad \widehat{\Sigma} \;=\; \frac{1}{T - Kp - 1}\, \widehat{U}^{\top}\widehat{U}, \qquad \widehat{U} = Y - Z\widehat{B} $$ Stack the \(T\) observations as rows of \(Y\) (\(T\times K\)); each row of the design \(Z\) is \([\,1,\ y_{t-1}^{\top},\ \ldots,\ y_{t-p}^{\top}\,]\). Then \(\widehat{B}\) holds the intercept and all \(A_i\) at once — one matrix solve recovers the entire VAR. \(\widehat\Sigma\) is the residual covariance, divided by the degrees of freedom \(T - (Kp+1)\) per equation. OLS row-by-row equals system MLE here because the regressors are shared. The hard part is not estimation but order selection: too few lags and residuals stay autocorrelated (biasing everything downstream); too many and you burn degrees of freedom on noise. The standard tools are information criteria, which add a complexity penalty to the log-likelihood. Let \(n = K(Kp+1)\) be the total parameter count and \(\lvert\widehat\Sigma_p\rvert\) the residual-covariance determinant at lag \(p\): EQ T5.4 — INFORMATION CRITERIA FOR VAR ORDER $$ \mathrm{AIC}(p) = \ln\lvert\widehat\Sigma_p\rvert + \frac{2}{T}\,n, \qquad \mathrm{BIC}(p) = \ln\lvert\widehat\Sigma_p\rvert + \frac{\ln T}{T}\,n, \qquad \mathrm{HQ}(p) = \ln\lvert\widehat\Sigma_p\rvert + \frac{2\ln\ln T}{T}\,n $$ All three reward fit (the determinant term falls as \(p\) grows) and punish parameters \(n = K(Kp+1)\). The penalties differ in strength: BIC's \(\ln T\) is harshest and is consistent (it selects the true order as \(T\to\infty\)); AIC's \(2\) is mild, tends to over-fit, but is asymptotically efficient for forecasting; Hannan–Quinn sits between. Pick the \(p\) that minimizes the criterion — and in practice prefer BIC for inference, AIC when prediction is the goal. Always confirm the chosen model leaves white-noise residuals. The curse of dimensionality is real here. A VAR\((4)\) on 8 macro variables already has \(8\times(8\times 4 + 1) = 264\) parameters — easily more than a typical quarterly sample of post-war data. The modern responses are Bayesian VARs with shrinkage priors (the Minnesota prior pulls coefficients toward a random walk), factor-augmented VARs that compress many series into a few factors, and large-VAR estimators with elementwise penalties. None of that changes the OLS skeleton above; they change the prior on \(B\). PYTHON · RUNNABLE IN-BROWSER # Fit a VAR(1) on two simulated series by OLS (EQ T5.3) and recover A1 import numpy as np rng = np.random.default_rng(0) A = np.array([[0.5, 0.3], # the true coefficient matrix A1 [0.2, 0.4]]) # off-diagonals = cross-feedback T = 600 y = np.zeros((T, 2)) for t in range(1, T): # simulate the VAR(1) data y[t] = A @ y[t - 1] + rng.normal(0, 1, 2) Z = y[:-1] # regressors: y_{t-1} (T-1 x 2) Y = y[1:] # targets: y_t (T-1 x 2) B = np.linalg.solve(Z.T @ Z, Z.T @ Y).T # OLS, one solve -> A1_hat (2x2) np.set_printoptions(precision=3, suppress=True) print("true A1:\n", A) print("OLS A1_hat:\n", B) ev = np.linalg.eigvals(B) print("\neigenvalue moduli:", np.round(np.abs(ev), 3), "-> stable" if np.all(np.abs(ev) UNSTABLE") print("max |lambda| =", round(float(np.max(np.abs(ev))), 3), "(must be RUN ▶ edits are live — break it on purpose 5.3 Impulse-response & variance decomposition A fitted VAR is a dense block of coefficients that almost no one can read directly. The two devices that make it interpretable are the impulse-response function (IRF) — how the whole system reacts over time to a one-off shock — and the forecast-error variance decomposition (FEVD) — what share of each variable's unpredictability traces back to each shock. Both fall straight out of the VMA(\(\infty\)) coefficients \(\Psi_i\). EQ T5.5 — IMPULSE-RESPONSE FUNCTION $$ \Psi_i \;=\; \frac{\partial\, y_{t+i}}{\partial\, \varepsilon_t^{\top}}, \qquad \Psi_0 = I, \qquad \Psi_i = \sum_{j=1}^{\min(i,p)} A_j\,\Psi_{i-j} \quad\Big(\text{VAR(1): } \Psi_i = A_1^{\,i}\Big) $$ \((\Psi_i)_{mn}\) is the response of variable \(m\) at horizon \(i\) to a unit reduced-form shock in variable \(n\) at time 0. In a stable VAR the \(\Psi_i\) decay to zero, so every shock is transient. For the worked default \(A_1=\big(\begin{smallmatrix}0.5&0.3\\0.2&0.4\end{smallmatrix}\big)\), a unit shock to variable 1 traces \((\Psi_i)_{11} = 1,\ 0.5,\ 0.31,\ 0.209,\ldots\) and the long-run cumulative multiplier is \((I-A_1)^{-1}\) — for variable 1 onto itself, \(2.5\). The IRF turns a coefficient matrix into a story. The identification caveat — say it out loud. Reduced-form shocks \(\varepsilon_t\) are contemporaneously correlated (\(\Sigma\) is not diagonal), so "a shock to variable 1 alone" is not well defined: in the data, variable 2 tends to move at the same instant. To read structural impulse responses you must impose identifying assumptions that orthogonalize the shocks. The textbook choice is a Cholesky (recursive) ordering — factor \(\Sigma = P P^{\top}\) with \(P\) lower-triangular and report \(\Psi_i P\) — which assumes a causal ordering of the variables (those earlier in the list can shock those later within the period, but not vice versa). Different orderings give different stories, and that ambiguity is the central, contested limitation of VAR analysis; sign restrictions, long-run (Blanchard–Quah) restrictions, and external-instrument (proxy-SVAR) methods are the modern alternatives. EQ T5.6 — FORECAST-ERROR VARIANCE DECOMPOSITION $$ \mathrm{MSE}\big(y_{t+h}\big) = \sum_{i=0}^{h-1} \Theta_i\,\Theta_i^{\top}, \quad \Theta_i = \Psi_i P, \qquad \omega_{mn}(h) = \frac{\sum_{i=0}^{h-1} \big(\Theta_i\big)_{mn}^2}{\big(\mathrm{MSE}(y_{t+h})\big)_{mm}} $$ \(\omega_{mn}(h)\) is the fraction of the \(h\)-step forecast-error variance of variable \(m\) attributable to (orthogonalized) shock \(n\); the row \(\sum_n \omega_{mn}(h) = 1\). At \(h=1\) a variable's variance is dominated by its own shock; as \(h\) grows, cross-effects accumulate and the decomposition reveals how much of one series' long-run uncertainty is really imported from another. FEVD answers "where does this variable's surprise come from?" — and like the IRF it inherits the ordering dependence above. INSTRUMENT T5.2 — IMPULSE-RESPONSE EXPLORER Ψᵢ = A₁ⁱ · UNIT SHOCK · EQ T5.5 a₁₁ 0.60 a₁₂ 0.20 a₂₁ 0.10 a₂₂ 0.50 SHOCK TO VARIABLE 1 VARIABLE 2 PEAK RESPONSE y₁ — PEAK RESPONSE y₂ — LONG-RUN MULTIPLIER (I−A)⁻¹ — A unit reduced-form shock hits the chosen variable at horizon 0; the curves are \(\Psi_i = A_1^{\,i}\) applied to that impulse. In a stable system both responses decay to zero — the long-run multiplier is the area under the cumulative response, \((I-A_1)^{-1}\). Note the model uses reduced-form shocks; structural IRFs would require the Cholesky ordering of EQ T5.6. 5.4 Cointegration & the VECM Everything above assumed stability — eigenvalues strictly inside the unit circle. But most economic and financial level series are integrated of order one, \(I(1)\): they have a unit root, wander like random walks, and only their differences are stationary. Run a VAR on the raw levels of two \(I(1)\) series and OLS will happily report a high \(R^2\) that is mostly spurious regression — two independent random walks look correlated purely because both trend. The remarkable exception is cointegration. Two (or more) \(I(1)\) series are cointegrated if some linear combination of them is \(I(0)\) — stationary. Intuitively, the series share a common stochastic trend, and although each wanders without bound, they cannot wander independently: a spread between them is mean-reverting. Engle and Granger (1987) formalized this and, crucially, the Granger representation theorem proved that cointegration is equivalent to the existence of an error-correction representation. EQ T5.7 — COINTEGRATION $$ y_t \sim I(1)^K, \qquad \exists\, \beta \neq 0:\ \ \beta^{\top} y_t \sim I(0). \qquad \text{Common-trend form: } \begin{cases} x_t = w_t + u_t \\ z_t = w_t + v_t \end{cases},\ \ w_t = w_{t-1} + \eta_t $$ \(\beta\) is a cointegrating vector; \(\beta^\top y_t\) is the stationary equilibrium error. In the two-series common-trend example, \(x_t\) and \(z_t\) each inherit the random walk \(w_t\) and are individually \(I(1)\), yet \(x_t - z_t = u_t - v_t\) cancels the trend and is \(I(0)\): here \(\beta = (1,-1)^\top\). The number of independent cointegrating vectors is the cointegration rank \(r\), \(0 \le r < K\). \(r=0\) means no long-run tie (difference everything and fit a VAR); \(r=K\) would mean the levels were stationary all along. True or false: if two series are each \(I(1)\) but some linear combination of them is stationary (\(I(0)\)), the series are cointegrated. (Answer true or false.) This is the definition of cointegration (EQ T5.7): individually non-stationary \(I(1)\) series whose linear combination \(\beta^\top y_t\) is stationary share a common stochastic trend that the combination cancels. The statement is true. The error-correction form is the Vector Error-Correction Model (VECM). Re-parameterize the levels VAR in differences, with one term in levels left behind: EQ T5.8 — VECTOR ERROR-CORRECTION MODEL $$ \Delta y_t \;=\; \Pi\, y_{t-1} \;+\; \sum_{i=1}^{p-1} \Gamma_i\, \Delta y_{t-i} \;+\; c \;+\; \varepsilon_t, \qquad \Pi \;=\; \alpha\,\beta^{\top}, \quad \mathrm{rank}(\Pi) = r $$ Everything except \(\Pi y_{t-1}\) is in stationary differences. The long-run information lives entirely in \(\Pi\): its rank is the cointegration rank \(r\). When \(0the hinge snapped after two weeks