Why a single split lies
You want to know how a model will perform on data it has never seen. The honest target is generalization error — the expected loss over the whole data-generating distribution \(\mathcal{D}\), not over the rows you happen to hold. You cannot compute it; you can only estimate it from a finite sample. The cheapest estimator is the holdout: carve off a test set once, train on the rest, score once.
Where does the wobble come from? A finite test set of size \(m\) estimates an accuracy \(p\) the way a coin of bias \(p\) estimates its bias from \(m\) flips. For a 0–1 loss the test-set accuracy has a binomial standard error:
The holdout therefore forces a bad trade. A big test set gives a stable estimate of a worse model (you trained on fewer rows); a small test set gives a noisy estimate of a better model. On small and medium datasets there is no good place to stand. Cross-validation refuses the trade: it lets every row serve as test data exactly once and as training data the rest of the time, then averages the per-fold estimates to cut the variance.
"My model gets 91% — here is the number." A point estimate with no spread is not a result, it is an anecdote. The first question any reviewer should ask is "across how many splits, and with what standard deviation?" A reported metric without a band silently claims a precision the single split cannot deliver. The fix is the rest of this chapter.
# The single split is a coin flip: 200 random holdouts of the SAME data + model.
import numpy as np
rng = np.random.default_rng(0)
# One fixed dataset: a 2D blob whose label is a noisy linear rule.
N, d = 400, 5
X = rng.normal(0, 1, (N, d))
w_true = rng.normal(0, 1, d)
y = ((X @ w_true + rng.normal(0, 1.2, N)) > 0).astype(int)
def fit_predict(Xtr, ytr, Xte): # ridge-ish least-squares classifier
w = np.linalg.solve(Xtr.T @ Xtr + 1e-2*np.eye(d), Xtr.T @ (2*ytr - 1))
return (Xte @ w > 0).astype(int)
accs = []
for _ in range(200): # vary ONLY the split, nothing else
perm = rng.permutation(N)
te, tr = perm[:80], perm[80:] # 80-row test set each time
pred = fit_predict(X[tr], y[tr], X[te])
accs.append((pred == y[te]).mean())
accs = np.array(accs)
print(f"holdout accuracy ranges over {accs.min():.3f} .. {accs.max():.3f}")
print(f"mean = {accs.mean():.3f} std across splits = {accs.std():.3f}")
print(f"so two single splits can disagree by ~{accs.max()-accs.min():.2f} on luck alone.")
plot_xy(np.arange(len(accs)), np.sort(accs)) # sorted: the spread you'd never see once
k-fold cross-validation
k-fold cross-validation partitions the data into \(k\) equal, disjoint folds. It then runs \(k\) experiments: in round \(i\), fold \(i\) is the validation set and the other \(k-1\) folds are the training set. Every row is validated exactly once. The cross-validation estimate is the average of the \(k\) fold scores:
The choice of \(k\) is a bias–variance dial. Small \(k\) (e.g. 2) trains each model on much less data, so each fold model is weaker and \(\widehat{\mathrm{CV}}\) is pessimistically biased. Large \(k\) trains on almost all the data — at \(k = N\) you get leave-one-out CV (LOOCV), nearly unbiased but with \(N\) tightly correlated, high-variance fold scores and \(N\) model fits. The empirical sweet spot, established by Kohavi's classic study and unchanged in 2026, is \(k = 5\) or \(k = 10\): low enough bias, manageable variance, affordable compute.
| k | Train size per fold | Bias of estimate | Variance / cost |
|---|---|---|---|
| 2 | N / 2 | high (pessimistic) | low cost, low variance |
| 5 | 0.8 N | small | the common default |
| 10 | 0.9 N | smaller | 2× the cost of k = 5 |
| N (LOOCV) | N − 1 | ~unbiased | N fits; high variance |
The total compute is exactly \(k\) model fits, each on a fraction \((k-1)/k\) of the data. That \(k\)-fold multiplier is the price of the error bars, and it is why §1.5's nested scheme — CV inside CV — is the expensive-but-honest end of the spectrum.
# k-fold CV from scratch in numpy: report mean +/- std of the metric.
import numpy as np
rng = np.random.default_rng(1)
N, d, k = 300, 6, 5
X = rng.normal(0, 1, (N, d))
w_true = rng.normal(0, 1, d)
y = ((X @ w_true + rng.normal(0, 1.0, N)) > 0).astype(int)
def fit_predict(Xtr, ytr, Xte):
w = np.linalg.solve(Xtr.T @ Xtr + 1e-2*np.eye(d), Xtr.T @ (2*ytr - 1))
return (Xte @ w > 0).astype(int)
idx = rng.permutation(N) # shuffle once, then cut into k folds
folds = np.array_split(idx, k) # k disjoint, near-equal index blocks
scores = []
for i in range(k):
val = folds[i]
tr = np.concatenate([folds[j] for j in range(k) if j != i])
pred = fit_predict(X[tr], y[tr], X[val])
acc = (pred == y[val]).mean()
scores.append(acc)
print(f"fold {i+1}: train {tr.size:3d} val {val.size:3d} acc {acc:.3f}")
scores = np.array(scores)
se = scores.std(ddof=1) / np.sqrt(k) # EQ V1.3 (optimistic: folds correlate)
print(f"\nCV accuracy = {scores.mean():.3f} +/- {scores.std(ddof=1):.3f} (std)")
print(f" = {scores.mean():.3f} +/- {se:.3f} (std error of the mean)")
print("One number with a band -- not a point estimate pretending to be the truth.")
Stratified & grouped k-fold
Plain k-fold shuffles rows and cuts blindly. That fails in two common situations, and both have a fix that costs nothing but a smarter partition.
Stratified k-fold: preserve the class balance
On a 1% positive fraud dataset, a random fold can easily land with zero positives — making its score meaningless and inflating the variance across folds. Stratified k-fold partitions within each class so every fold mirrors the overall label distribution:
Grouped k-fold: respect dependence between rows
If multiple rows share a hidden identity — several visits from one patient, many frames of one video, repeated measurements of one sensor — then a row in training and its sibling in validation creates leakage: the model effectively sees the answer. Grouped k-fold keeps every group entirely on one side of each split, so no group straddles the train/validation boundary.
The most expensive bug in applied ML is a leak you cannot see. If patient #42 has rows in both the training fold and the validation fold, your reported accuracy measures memorization of patient #42, not generalization to new patients — and it will collapse in production. The same trap appears with near-duplicate images, augmented copies, and any preprocessing (scaling, imputation, target encoding) fit on the full dataset before splitting. Rule: every transform must be fit inside the training fold only, and grouped splits are mandatory whenever rows are not independent.
These choices compose: stratified group k-fold keeps groups intact and balances classes across folds, the standard recipe for imbalanced, clustered data. The honest caveat: when groups are few and uneven, perfect stratification and perfect grouping can conflict, and you accept an approximate balance.
# Stratified vs blind folds on a 5%-positive set: blind folds vary wildly.
import numpy as np
rng = np.random.default_rng(3)
N, k = 1000, 5
y = (rng.random(N) < 0.05).astype(int) # ~5% positives -> ~50 of them
print(f"dataset positives: {y.sum()} / {N} ({100*y.mean():.1f}%)\n")
def blind_folds(idx):
return np.array_split(rng.permutation(idx), k)
def stratified_folds(y):
folds = [[] for _ in range(k)]
for c in (0, 1): # deal each class round-robin into folds
members = rng.permutation(np.where(y == c)[0])
for j, row in enumerate(members):
folds[j % k].append(row)
return [np.array(f) for f in folds]
print("blind fold positive-rates:", end=" ")
for f in blind_folds(np.arange(N)):
print(f"{y[f].mean():.3f}", end=" ")
print("\nstrat. fold positive-rates:", end=" ")
for f in stratified_folds(y):
print(f"{y[f].mean():.3f}", end=" ")
print("\n\nBlind folds scatter (one may even hit 0.00 -> a useless fold);")
print("stratified folds all sit near the 0.05 base rate, by construction.")
Time-series cross-validation
Everything above assumes the rows are exchangeable — that shuffling is harmless. For temporally ordered data it is not. Shuffling lets the model train on the future and validate on the past, which is impossible at deployment and produces gloriously optimistic, completely fake scores. The cardinal rule of temporal validation is brutal and simple:
Two schemes both satisfy EQ V1.5; they differ in what they do with old data:
- Expanding window. The training set grows each fold — every split keeps all history up to the cut and validates on the next block. Uses all data; assumes the past stays relevant; training cost grows over folds.
- Rolling (sliding) window. The training set is a fixed-length window that slides forward, dropping the oldest data as it adds new. Constant training cost, and — more importantly — it adapts to non-stationarity and concept drift, where ancient history actively misleads.
Which to prefer is genuinely contested and data-dependent: expanding windows win when the process is stable and data is scarce; rolling windows win when the world is drifting. Either way you typically report the average score across the forward-chained folds, exactly as in EQ V1.3 — just with splits that never look ahead.
# Forward-chained splits: expanding vs rolling. Verify NO split looks ahead.
import numpy as np
N, k = 24, 5
order = np.arange(N) # already time-ordered: 0 = oldest
fold = N // (k + 1) # size of each validation block
roll_train = 2 * fold # fixed window width for the rolling scheme
print("EXPANDING window (training set grows):")
ok = True
for i in range(1, k + 1):
tr = order[: i * fold]
va = order[i * fold : (i + 1) * fold]
leak = tr.max() >= va.min()
ok &= not leak
print(f" fold {i}: train {tr.min():2d}..{tr.max():2d} ({tr.size:2d}) "
f"val {va.min():2d}..{va.max():2d} leak? {leak}")
print("\nROLLING window (fixed width, slides forward):")
for i in range(1, k + 1):
end = i * fold
tr = order[max(0, end - roll_train): end]
va = order[end: end + fold]
if va.size == 0: break
leak = tr.max() >= va.min()
ok &= not leak
print(f" fold {i}: train {tr.min():2d}..{tr.max():2d} ({tr.size:2d}) "
f"val {va.min():2d}..{va.max():2d} leak? {leak}")
print(f"\nany split that trained on the future? {not ok} "
"(EQ V1.5 holds <=> this is False)")
Nested CV for honest tuning
Here is the subtle, costly mistake that even careful practitioners make. You run k-fold CV, try a hundred hyperparameter settings, pick the one with the best CV score, and report that score as the model's performance. That number is biased upward — sometimes badly. You used the validation folds twice: once to tune and once to report. Selecting the maximum over many noisy estimates is selecting partly for noise, so the winner's CV score is an optimistic estimate of its true error. This is the cross-validation cousin of the multiple-comparisons problem (STATS · §4.6).
Nested cross-validation does exactly that with two loops. The outer loop's folds are used only to estimate performance. Inside each outer training set, a separate inner CV loop performs the entire hyperparameter search and refits the chosen model. The outer fold — never seen by the inner search — then scores it. Because selection and evaluation use disjoint data, the outer score is an honest estimate of the whole pipeline, tuning included.
# Optimistic bias of tuning on the test fold vs nested CV (pure noise data).
import numpy as np
rng = np.random.default_rng(7)
N, k, G = 120, 5, 40 # G = number of hyperparameter settings tried
y = rng.integers(0, 2, N) # labels are PURE NOISE: true acc = 0.50
# Each "config" is a random predictor independent of y -> all truly ~50% accurate.
def config_preds(seed, idx): # deterministic per (config, rows)
r = np.random.default_rng(seed)
return r.integers(0, 2, len(idx))
def cv_acc(g, idx): # k-fold accuracy of config g on rows idx
folds = np.array_split(rng.permutation(idx), k)
accs = [(config_preds(g, f) == y[f]).mean() for f in folds]
return np.mean(accs)
# WRONG: tune AND report on the same CV -> pick the max over G noisy 0.5s.
flat = [cv_acc(g, np.arange(N)) for g in range(G)]
naive = max(flat)
# NESTED: inner CV selects the best config; the held-out outer fold scores it.
outer = np.array_split(rng.permutation(np.arange(N)), k)
nested = []
for i in range(k):
test = outer[i]
train = np.concatenate([outer[j] for j in range(k) if j != i])
best = max(range(G), key=lambda g: cv_acc(g, train)) # select on inner data
nested.append((config_preds(best, test) == y[test]).mean()) # score on sealed fold
print(f"truth (labels are noise) : 0.500")
print(f"naive 'best CV' (tune==report) : {naive:.3f} <- optimistic, > 0.5 on noise")
print(f"nested CV outer mean : {np.mean(nested):.3f} <- honest, hugs 0.5")
print(f"selection bias removed : {naive - np.mean(nested):+.3f}")
When is the full nested machinery worth it? When you must report a trustworthy performance number after tuning — a benchmark, a paper, a go/no-go decision. For the cheaper everyday workflow, a fixed three-way split (train / validation / test) approximates one outer fold: tune on validation, report once on the untouched test set. Nested CV is simply that idea applied \(k\) times so the honest estimate itself gets error bars. The cost — outer × inner × grid model fits — is the reason it is reserved for when honesty is non-negotiable.
Cross-validation tells you how to score a configuration honestly; it does not tell you which configurations to try. The inner loop of nested CV was a hand-wave — "search the hyperparameters." Chapter 02 opens that loop: grid and random search, Bayesian optimization, Hyperband and successive halving, and the budget arithmetic that decides how many of those expensive inner fits you can actually afford.
References
- Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions.
- Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.
- Varma, S. & Simon, R. (2006). Bias in Error Estimation When Using Cross-Validation for Model Selection.
- Bergmeir, C. & Benítez, J. M. (2012). On the Use of Cross-Validation for Time Series Predictor Evaluation.
- Arlot, S. & Celisse, A. (2010). A Survey of Cross-Validation Procedures for Model Selection.
- Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.