06 · Generalization: Bias, Variance & Regularization

6.1

The bias–variance decomposition

Assume the world generates labels as $y = f(x) + \varepsilon$: a true function $f$ corrupted by noise with variance $\sigma^2$. You never see $f$ — you see one training set $\mathcal{D}$, a finite sample of that process, and you fit $\hat{f}_{\mathcal{D}}$ to it. Had the sample come out differently, your model would have too. The honest question is therefore an average over training sets: how wrong is the procedure, not just this one fit? For squared error the answer splits exactly into three parts:

EQ M6.1 — BIAS–VARIANCE DECOMPOSITION $$ \mathbb{E}_{\mathcal{D},\,\varepsilon}\!\left[\big(y - \hat{f}_{\mathcal{D}}(x)\big)^{2}\right] \;=\; \underbrace{\big(f(x) - \bar{f}(x)\big)^{2}}_{\text{bias}^{2}} \;+\; \underbrace{\mathbb{E}_{\mathcal{D}}\!\left[\big(\hat{f}_{\mathcal{D}}(x) - \bar{f}(x)\big)^{2}\right]}_{\text{variance}} \;+\; \underbrace{\sigma^{2}}_{\text{noise}} \qquad \bar{f}(x) = \mathbb{E}_{\mathcal{D}}\big[\hat{f}_{\mathcal{D}}(x)\big] $$

$\bar{f}$ is the average model — what your procedure produces averaged over all training sets it might have been dealt. Bias² is how far that average sits from the truth: the error your model family makes systematically, even with infinite resamples. Variance is how much any single fit scatters around that average: the error of trusting one particular sample. Noise $\sigma^2$ is the floor — no model, however clever, beats it. Capacity buys down bias by paying in variance; the exchange rate is the subject of this chapter.

At a probe point a procedure has $\text{bias}^2 = 0.49$, variance $= 0.04$, and label-noise variance $\sigma^2 = 0.09$. By EQ M6.1, what is the expected squared error?

The decomposition is additive: expected error $= \text{bias}^2 + \text{variance} + \sigma^2 = 0.49 + 0.04 + 0.09 = $ 0.62. Bias² dominates, so this family is underfitting — add capacity, not data.

The archery reading: bias is your sights being misaligned — every arrow lands off-center the same way. Variance is an unsteady hand — arrows scatter, even though they center on the bullseye. A degree-1 polynomial fit to a cubic has misaligned sights: resample the data all you like, the average line is still a line, still wrong. A degree-12 polynomial has a violently unsteady hand: each resample produces a different contortion, and only their unreachable average is close to the truth.

The decomposition is exact for squared loss. For classification under 0–1 loss the clean additive split breaks down (bias and variance interact through the decision boundary), but the qualitative trade-off survives and the vocabulary is used everywhere regardless. Honest usage: treat EQ M6.1 as a precise statement about regression and a sharp metaphor for everything else.

6.2

Capacity and the U-curve

Capacity is the informal name for how rich a function family you are fitting — polynomial degree, tree depth, parameter count, training time. As capacity rises, training error falls monotonically: a bigger family always contains the smaller one, so the optimizer can only do better on the points it sees. Held-out error does something entirely different — it falls while added capacity is buying down bias, bottoms out, then climbs as the model starts spending its freedom on the noise. That is the classical U-curve, and the diagnosis table that goes with it is the most-used decision procedure in applied ML:

Observation	Diagnosis	The move
Train error high, held-out error high and close to it	underfit · bias-dominated	More capacity, better features, train longer, weaken regularization
Train error near zero, held-out error far above it	overfit · variance-dominated	More data, stronger regularization, less capacity, early stopping
Both errors near the noise floor $\sigma^2$	converged	Stop. Further gains require better data, not a better model.

The truth at a probe point is $f(x_0) = 4.5$. The same procedure trained on three resampled datasets predicts 4, 5, and 6 there. What is the squared bias, $(f - \bar f)^2$?

First the average model: $\bar f = (4 + 5 + 6)/3 = 5$. Then bias$^2 = (f - \bar f)^2 = (4.5 - 5)^2 = (-0.5)^2 = $ 0.25. Bias measures how far the procedure's average prediction sits from the truth — independent of how much any single fit scatters.

INSTRUMENT M6.1 — DEGREE DIAL18 NOISY POINTS · TRUE f IS A CUBIC · NORMAL EQUATIONS, LIVE

POLYNOMIAL DEGREE d 3

RESAMPLE THE WORLD

TRAIN vs HELD-OUT MSE ACROSS ALL DEGREES · LOG SCALE · SWEET SPOT MARKED

TRAIN MSE · 18 PTS

—

HELD-OUT MSE · 160 PTS

—

GEN. GAP (HELD-OUT − TRAIN)

—

REGIME

—

The dashed ghost is the true cubic; the model never sees it. At d = 1, click NEW SAMPLE repeatedly: the line barely moves but is always wrong — pure bias. At d = 12, train MSE collapses while the curve thrashes wildly between resamples and held-out MSE explodes — pure variance. The lower chart is EQ M6.1 made empirical: train error only falls, held-out error is a U, and the sweet spot hugs the true degree 3.

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(0)

def f(x):                              # the truth — unknown to the model
    return 1.5*x**3 - 0.9*x

x_tr = rng.uniform(-1, 1, 18);  y_tr = f(x_tr) + rng.normal(0, 0.18, 18)
x_te = rng.uniform(-1, 1, 200); y_te = f(x_te) + rng.normal(0, 0.18, 200)

def fit(x, y, d):                      # least squares on the Vandermonde matrix
    w, *_ = np.linalg.lstsq(np.vander(x, d + 1), y, rcond=None)
    return w

print(f"{'deg':>4}{'train MSE':>12}{'test MSE':>11}")
for d in (1, 3, 11):
    w  = fit(x_tr, y_tr, d)
    tr = np.mean((np.vander(x_tr, d+1) @ w - y_tr)**2)
    te = np.mean((np.vander(x_te, d+1) @ w - y_te)**2)
    print(f"{d:>4}{tr:>12.4f}{te:>11.4f}")
print(f"\nirreducible noise floor sigma^2 = {0.18**2:.4f}")

edits are live — try d = 17, or 50 training points

MODERN

The U-curve is true but incomplete. Push capacity far past the point where the model can interpolate its training data exactly, and held-out error often falls a second time — double descent (Belkin et al. 2019; Nakkiran et al. 2019, who also found it epoch-wise). In the heavily overparameterized regime, gradient descent among the many zero-train-error solutions implicitly prefers low-norm, smooth ones — the optimizer regularizes even when you don't ask it to. This is the regime modern LLMs live in, and part of why "bigger is better" holds there (Vol II · Ch 04 scaling laws). Honest status: the classical U still governs the small-data regime — this page's instruments, most tabular work, most fine-tunes — and a complete theory unifying both regimes remains open.

6.3

Regularization: paying for smoothness

Choosing capacity by deleting parameters (degree 3, not 9) is a blunt dial. Regularization keeps the big model and instead charges it for complexity: add a penalty on the size of the weights to the training loss, and let a continuous knob $\lambda$ set the price. The two canonical currencies differ only in which norm they tax — and that one choice changes everything about the solution's character.

EQ M6.2 — RIDGE (L2) $$ \hat{w}_{\text{ridge}} \;=\; \arg\min_{w}\; \lVert y - Xw \rVert_2^2 \;+\; \lambda \lVert w \rVert_2^2 \;=\; \big(X^{\top}X + \lambda I\big)^{-1} X^{\top} y $$

Still closed-form — the penalty just fattens the diagonal of $X^\top X$, which is also why it cures the numerical singularity of high-degree fits. In the SVD picture, the component of the solution along a singular direction with singular value $\sigma_i$ gets multiplied by $\sigma_i^2 / (\sigma_i^2 + \lambda)$: strong, well-supported directions pass almost untouched while weak, noise-amplifying directions are crushed. Ridge shrinks every weight toward zero but never exactly to zero.

One-feature ridge regression with $X^\top X = 6$ and $X^\top y = 12$. At penalty $\lambda = 2$, the closed form is $\hat w = X^\top y / (X^\top X + \lambda)$. Compute $\hat w$.

Ridge fattens the denominator: $\hat w = 12 / (6 + 2) = 12/8 = $ 1.5. The unpenalized OLS weight would be $12/6 = 2.0$, so the penalty shrinks it by a factor $6/8 = 0.75$ — toward zero, but never reaching it.

EQ M6.3 — LASSO (L1) $$ \hat{w}_{\text{lasso}} \;=\; \arg\min_{w}\; \lVert y - Xw \rVert_2^2 \;+\; \lambda \lVert w \rVert_1 \qquad \lVert w \rVert_1 = \textstyle\sum_{j} \lvert w_j \rvert $$

No closed form — the kink of $\lvert \cdot \rvert$ at zero breaks the calculus, so lasso is solved by coordinate descent or proximal methods, whose core operation is the soft threshold $S(z, \lambda) = \mathrm{sign}(z)\max(\lvert z \rvert - \lambda,\, 0)$. That $\max$ is the point: weights whose evidence is weaker than $\lambda$ are set to exactly zero. Lasso doesn't just shrink — it selects features, and the surviving support is often the deliverable.

Apply the lasso soft threshold $S(z,\lambda) = \mathrm{sign}(z)\,\max(|z| - \lambda,\, 0)$ to the candidate weight $z = 0.7$ with penalty $\lambda = 0.3$. What is $S(z,\lambda)$?

The magnitude $|z| = 0.7$ exceeds the price $\lambda = 0.3$, so $\max(0.7 - 0.3,\, 0) = 0.4$; the sign is positive, giving $S = $ 0.4. The weight survives but is pulled 0.3 toward zero. Had $|z|$ been below 0.3, the result would have been exactly 0 — that is feature selection.

FIG M6.AWHY L1 ZEROS WEIGHTS AND L2 ONLY SHRINKS THEM

Penalized fitting ≡ minimizing loss subject to a weight-norm budget. Blue ellipses are loss contours around the unconstrained minimum; the fit lands where the smallest reachable contour first touches the constraint set. The L2 ball is round, so first contact is almost never on an axis; the L1 diamond has corners on the axes, and corners win — that geometry is the entire reason lasso produces sparse models.

Weight decay is L2 — with one large caveat. For plain SGD, adding $\lambda \lVert w \rVert_2^2$ to the loss and multiplying weights by $(1 - \eta\lambda)$ each step are the same update. For adaptive optimizers they are not: Adam rescales the penalty's gradient per-coordinate along with everything else, quietly distorting the regularizer. AdamW fixes this by decoupling the decay from the adaptive machinery (Vol II · EQ 4.3) — which is why every modern LLM recipe says "AdamW, weight decay 0.1" rather than "L2 in the loss". Same idea, different plumbing, measurably different result.

INSTRUMENT M6.2 — RIDGE PATHDEGREE-9 FIT · λ FROM 1e-4 TO 1e2 · EQ M6.2 LIVE

PENALTY λ (LOG SLIDER) 1.0e-4

COEFFICIENT MAGNITUDES |w₀| … |w₉| · LOG BAR SCALE · GREY = UNPENALIZED INTERCEPT

TRAIN MSE

—

HELD-OUT MSE

—

‖w‖₂ (EXCL. w₀)

—

Same data-generating world as Instrument M6.1 (a different draw of 18 points), but the model keeps all ten degree-9 coefficients and pays λ for their size. Drag right from 1e-4: the wiggle flattens, the coefficient bars collapse, and held-out MSE traces a U — too little λ re-creates overfitting, too much re-creates underfitting (the fit sags toward a flat line). λ is a capacity dial with infinite resolution. The intercept is conventionally left unpenalized; shrinking it would just bias predictions away from the data's mean.

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(1)

def f(x): return 1.5*x**3 - 0.9*x
x_tr = rng.uniform(-1, 1, 18);  y_tr = f(x_tr) + rng.normal(0, 0.18, 18)
x_te = rng.uniform(-1, 1, 300); y_te = f(x_te) + rng.normal(0, 0.18, 300)

d = 9
Xtr, Xte = np.vander(x_tr, d + 1), np.vander(x_te, d + 1)
I = np.eye(d + 1); I[-1, -1] = 0.0     # vander puts the intercept last — leave it unpenalized

lams, mses = np.logspace(-6, 2, 41), []
for lam in lams:
    w = np.linalg.solve(Xtr.T @ Xtr + lam * I, Xtr.T @ y_tr)   # EQ M6.2
    mses.append(float(np.mean((Xte @ w - y_te)**2)))

b = int(np.argmin(mses))
print(f"near-zero lam = 1e-6   : test MSE = {mses[0]:.4f}")
print(f"best      lam = {lams[b]:<8.3g}: test MSE = {mses[b]:.4f}")
print(f"crushing  lam = 1e2    : test MSE = {mses[-1]:.4f}")
plot_xy(np.log10(lams), np.array(mses))   # the regularization U-curve

x-axis is log10 λ — the U should bottom out mid-range

6.4

Validation discipline

Every dial in this chapter — degree, $\lambda$, stopping epoch — must be tuned against data the fit never saw, which forces the three-way split: train (fit parameters), validation (choose hyperparameters), test (touch once, report, stop). When data is scarce, k-fold cross-validation recycles it: split into $k$ folds (5 or 10 is standard), train $k$ times each holding out a different fold, and average the held-out scores. The average is a far lower-variance estimate of generalization than any single split — at $k\times$ the compute. Once hyperparameters are chosen, refit on everything. If you also want an honest estimate of the whole selection pipeline, nest a second CV loop around it; people skip this and quietly report optimistic numbers.

The dominant failure mode is not bad math — it is leakage: information from the evaluation side contaminating the training side. Leakage produces beautiful validation scores and production disasters, and it is almost always a pipeline bug, not a modeling bug:

Leak	Horror story	The fix
Preprocessing leak	Scaler / imputer / feature-selector fit on the full dataset before splitting — test-set statistics seep into training	fit every transform inside the training fold only
Duplicate leak	Near-identical rows land on both sides of the split; the model "generalizes" to data it memorized	dedup before splitting; fuzzy-match, not exact-match
Temporal leak	Random split of time-ordered data — the model trains on the future it will be asked to predict	split by time; validate strictly forward
Group leak	Same patient's scans in train and test; the model learns the patient, scores brilliantly, transfers to nobody	split by group id, never by row
Target leak	A feature is a downstream echo of the label ("account_closed_date" predicting churn)	audit features for post-outcome information

HYGIENE

The test set is an instrument you can use once. Every decision influenced by test numbers — "try one more λ", "rerun with the other seed" — silently moves test data into the training loop; iterate enough and the test score becomes fiction. Kaggle's public-vs-private leaderboard shakeups are this effect measured at scale. The same failure operates on LLMs as eval contamination — benchmarks leaking into web-scale training corpora (Vol II · Ch 04 decontamination, and the fine-tuning pitfall list in Vol II · Ch 06). Different scale, identical sin: testing on something the model has, in any form, already seen.

6.5

Early stopping & dropout as regularizers

Two of the most-used regularizers never touch the loss function. Early stopping exploits the fact that training time is itself a capacity dial: gradient descent fits broad, smooth structure first and noise last, so the validation curve traces the familiar U over epochs. The recipe is mechanical — evaluate on validation each epoch, checkpoint the best, stop after $p$ epochs without improvement (patience), restore the best checkpoint. It is not merely a heuristic: for linear least squares, gradient descent stopped at step $t$ is approximately ridge regression with $\lambda \propto 1/(\eta t)$ — each direction of the solution gets pulled in at a rate set by its singular value, so stopping early leaves the weak, noise-dominated directions still near zero. Stopping early and penalizing weights are the same medicine through different needles.

Dropout attacks variance from a different angle: during training, zero each hidden unit independently with probability $p$ (and scale survivors by $1/(1-p)$ — "inverted dropout" — so activation magnitudes match at inference, when nothing is dropped). Two readings coexist. The ensemble view: each step trains a different random subnetwork, and inference approximates averaging exponentially many of them — and averaging is variance reduction by construction. The co-adaptation view: no unit can rely on a specific partner that might vanish, so features are forced to be individually useful. For linear models, dropout works out to an L2-like penalty scaled by each feature's second moment (Wager et al. 2013) — once again, a familiar uniform.

Honest modern footnote: dropout has largely vanished from LLM pre-training — one epoch over trillions of tokens means the binding constraint is underfitting, not overfitting — while weight decay and early stopping never left. But shrink the data and the classics return instantly: small-data fine-tunes ship with dropout on the adapters (the LoRA default of 0.05 in Vol II · Ch 06's recipe) and validation-based stopping. Regularization never became obsolete; it just follows the data-to-parameter ratio around.

The full toolbox, ordered by how often it is the right answer: more data (the only regularizer with no downside), weight decay / L2, early stopping, dropout, data augmentation, smaller model, L1 when you need the zeros. All of them buy the same thing — lower variance — and all charge the same currency: a little added bias.

You now own the budget every model must balance. Chapter 07 builds the first machine with enough capacity to need all of it: the multi-layer perceptron — perceptrons, hidden layers, activation functions, and a tiny network you can train on XOR in the page while you watch the decision boundary bend.

§