The bias–variance decomposition
Assume the world generates labels as \(y = f(x) + \varepsilon\): a true function \(f\) corrupted by noise with variance \(\sigma^2\). You never see \(f\) — you see one training set \(\mathcal{D}\), a finite sample of that process, and you fit \(\hat{f}_{\mathcal{D}}\) to it. Had the sample come out differently, your model would have too. The honest question is therefore an average over training sets: how wrong is the procedure, not just this one fit? For squared error the answer splits exactly into three parts:
The archery reading: bias is your sights being misaligned — every arrow lands off-center the same way. Variance is an unsteady hand — arrows scatter, even though they center on the bullseye. A degree-1 polynomial fit to a cubic has misaligned sights: resample the data all you like, the average line is still a line, still wrong. A degree-12 polynomial has a violently unsteady hand: each resample produces a different contortion, and only their unreachable average is close to the truth.
The decomposition is exact for squared loss. For classification under 0–1 loss the clean additive split breaks down (bias and variance interact through the decision boundary), but the qualitative trade-off survives and the vocabulary is used everywhere regardless. Honest usage: treat EQ M6.1 as a precise statement about regression and a sharp metaphor for everything else.
Capacity and the U-curve
Capacity is the informal name for how rich a function family you are fitting — polynomial degree, tree depth, parameter count, training time. As capacity rises, training error falls monotonically: a bigger family always contains the smaller one, so the optimizer can only do better on the points it sees. Held-out error does something entirely different — it falls while added capacity is buying down bias, bottoms out, then climbs as the model starts spending its freedom on the noise. That is the classical U-curve, and the diagnosis table that goes with it is the most-used decision procedure in applied ML:
| Observation | Diagnosis | The move |
|---|---|---|
| Train error high, held-out error high and close to it | underfit · bias-dominated | More capacity, better features, train longer, weaken regularization |
| Train error near zero, held-out error far above it | overfit · variance-dominated | More data, stronger regularization, less capacity, early stopping |
| Both errors near the noise floor \(\sigma^2\) | converged | Stop. Further gains require better data, not a better model. |
import numpy as np
rng = np.random.default_rng(0)
def f(x): # the truth — unknown to the model
return 1.5*x**3 - 0.9*x
x_tr = rng.uniform(-1, 1, 18); y_tr = f(x_tr) + rng.normal(0, 0.18, 18)
x_te = rng.uniform(-1, 1, 200); y_te = f(x_te) + rng.normal(0, 0.18, 200)
def fit(x, y, d): # least squares on the Vandermonde matrix
w, *_ = np.linalg.lstsq(np.vander(x, d + 1), y, rcond=None)
return w
print(f"{'deg':>4}{'train MSE':>12}{'test MSE':>11}")
for d in (1, 3, 11):
w = fit(x_tr, y_tr, d)
tr = np.mean((np.vander(x_tr, d+1) @ w - y_tr)**2)
te = np.mean((np.vander(x_te, d+1) @ w - y_te)**2)
print(f"{d:>4}{tr:>12.4f}{te:>11.4f}")
print(f"\nirreducible noise floor sigma^2 = {0.18**2:.4f}")
The U-curve is true but incomplete. Push capacity far past the point where the model can interpolate its training data exactly, and held-out error often falls a second time — double descent (Belkin et al. 2019; Nakkiran et al. 2019, who also found it epoch-wise). In the heavily overparameterized regime, gradient descent among the many zero-train-error solutions implicitly prefers low-norm, smooth ones — the optimizer regularizes even when you don't ask it to. This is the regime modern LLMs live in, and part of why "bigger is better" holds there (Vol II · Ch 04 scaling laws). Honest status: the classical U still governs the small-data regime — this page's instruments, most tabular work, most fine-tunes — and a complete theory unifying both regimes remains open.
Regularization: paying for smoothness
Choosing capacity by deleting parameters (degree 3, not 9) is a blunt dial. Regularization keeps the big model and instead charges it for complexity: add a penalty on the size of the weights to the training loss, and let a continuous knob \(\lambda\) set the price. The two canonical currencies differ only in which norm they tax — and that one choice changes everything about the solution's character.
Weight decay is L2 — with one large caveat. For plain SGD, adding \(\lambda \lVert w \rVert_2^2\) to the loss and multiplying weights by \((1 - \eta\lambda)\) each step are the same update. For adaptive optimizers they are not: Adam rescales the penalty's gradient per-coordinate along with everything else, quietly distorting the regularizer. AdamW fixes this by decoupling the decay from the adaptive machinery (Vol II · EQ 4.3) — which is why every modern LLM recipe says "AdamW, weight decay 0.1" rather than "L2 in the loss". Same idea, different plumbing, measurably different result.
import numpy as np
rng = np.random.default_rng(1)
def f(x): return 1.5*x**3 - 0.9*x
x_tr = rng.uniform(-1, 1, 18); y_tr = f(x_tr) + rng.normal(0, 0.18, 18)
x_te = rng.uniform(-1, 1, 300); y_te = f(x_te) + rng.normal(0, 0.18, 300)
d = 9
Xtr, Xte = np.vander(x_tr, d + 1), np.vander(x_te, d + 1)
I = np.eye(d + 1); I[-1, -1] = 0.0 # vander puts the intercept last — leave it unpenalized
lams, mses = np.logspace(-6, 2, 41), []
for lam in lams:
w = np.linalg.solve(Xtr.T @ Xtr + lam * I, Xtr.T @ y_tr) # EQ M6.2
mses.append(float(np.mean((Xte @ w - y_te)**2)))
b = int(np.argmin(mses))
print(f"near-zero lam = 1e-6 : test MSE = {mses[0]:.4f}")
print(f"best lam = {lams[b]:<8.3g}: test MSE = {mses[b]:.4f}")
print(f"crushing lam = 1e2 : test MSE = {mses[-1]:.4f}")
plot_xy(np.log10(lams), np.array(mses)) # the regularization U-curve
Validation discipline
Every dial in this chapter — degree, \(\lambda\), stopping epoch — must be tuned against data the fit never saw, which forces the three-way split: train (fit parameters), validation (choose hyperparameters), test (touch once, report, stop). When data is scarce, k-fold cross-validation recycles it: split into \(k\) folds (5 or 10 is standard), train \(k\) times each holding out a different fold, and average the held-out scores. The average is a far lower-variance estimate of generalization than any single split — at \(k\times\) the compute. Once hyperparameters are chosen, refit on everything. If you also want an honest estimate of the whole selection pipeline, nest a second CV loop around it; people skip this and quietly report optimistic numbers.
The dominant failure mode is not bad math — it is leakage: information from the evaluation side contaminating the training side. Leakage produces beautiful validation scores and production disasters, and it is almost always a pipeline bug, not a modeling bug:
| Leak | Horror story | The fix |
|---|---|---|
| Preprocessing leak | Scaler / imputer / feature-selector fit on the full dataset before splitting — test-set statistics seep into training | fit every transform inside the training fold only |
| Duplicate leak | Near-identical rows land on both sides of the split; the model "generalizes" to data it memorized | dedup before splitting; fuzzy-match, not exact-match |
| Temporal leak | Random split of time-ordered data — the model trains on the future it will be asked to predict | split by time; validate strictly forward |
| Group leak | Same patient's scans in train and test; the model learns the patient, scores brilliantly, transfers to nobody | split by group id, never by row |
| Target leak | A feature is a downstream echo of the label ("account_closed_date" predicting churn) | audit features for post-outcome information |
The test set is an instrument you can use once. Every decision influenced by test numbers — "try one more λ", "rerun with the other seed" — silently moves test data into the training loop; iterate enough and the test score becomes fiction. Kaggle's public-vs-private leaderboard shakeups are this effect measured at scale. The same failure operates on LLMs as eval contamination — benchmarks leaking into web-scale training corpora (Vol II · Ch 04 decontamination, and the fine-tuning pitfall list in Vol II · Ch 06). Different scale, identical sin: testing on something the model has, in any form, already seen.
Early stopping & dropout as regularizers
Two of the most-used regularizers never touch the loss function. Early stopping exploits the fact that training time is itself a capacity dial: gradient descent fits broad, smooth structure first and noise last, so the validation curve traces the familiar U over epochs. The recipe is mechanical — evaluate on validation each epoch, checkpoint the best, stop after \(p\) epochs without improvement (patience), restore the best checkpoint. It is not merely a heuristic: for linear least squares, gradient descent stopped at step \(t\) is approximately ridge regression with \(\lambda \propto 1/(\eta t)\) — each direction of the solution gets pulled in at a rate set by its singular value, so stopping early leaves the weak, noise-dominated directions still near zero. Stopping early and penalizing weights are the same medicine through different needles.
Dropout attacks variance from a different angle: during training, zero each hidden unit independently with probability \(p\) (and scale survivors by \(1/(1-p)\) — "inverted dropout" — so activation magnitudes match at inference, when nothing is dropped). Two readings coexist. The ensemble view: each step trains a different random subnetwork, and inference approximates averaging exponentially many of them — and averaging is variance reduction by construction. The co-adaptation view: no unit can rely on a specific partner that might vanish, so features are forced to be individually useful. For linear models, dropout works out to an L2-like penalty scaled by each feature's second moment (Wager et al. 2013) — once again, a familiar uniform.
Honest modern footnote: dropout has largely vanished from LLM pre-training — one epoch over trillions of tokens means the binding constraint is underfitting, not overfitting — while weight decay and early stopping never left. But shrink the data and the classics return instantly: small-data fine-tunes ship with dropout on the adapters (the LoRA default of 0.05 in Vol II · Ch 06's recipe) and validation-based stopping. Regularization never became obsolete; it just follows the data-to-parameter ratio around.
The full toolbox, ordered by how often it is the right answer: more data (the only regularizer with no downside), weight decay / L2, early stopping, dropout, data augmentation, smaller model, L1 when you need the zeros. All of them buy the same thing — lower variance — and all charge the same currency: a little added bias.
You now own the budget every model must balance. Chapter 07 builds the first machine with enough capacity to need all of it: the multi-layer perceptron — perceptrons, hidden layers, activation functions, and a tiny network you can train on XOR in the page while you watch the decision boundary bend.
Further reading
- Geman, S., Bienenstock, E. & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. — the paper that introduced the bias–variance decomposition to the learning community.
- Tikhonov, A. N. (1963). Solution of Incorrectly Formulated Problems and the Regularization Method. — the origin of L2 (ridge) regularization as a cure for ill-posed problems.
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. — introduces the L1 penalty and the sparsity it induces.
- Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. — the canonical reference for dropout as stochastic regularization.
- Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. — the formal foundation of cross-validation and held-out model selection.
- Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019). Reconciling Modern Machine-Learning Practice and the Bias–Variance Trade-off. — the "double descent" result that complicates the classic U-curve.