Resampling & Cross-Validation

1.1

Why a single split lies

You want to know how a model will perform on data it has never seen. The honest target is generalization error — the expected loss over the whole data-generating distribution $\mathcal{D}$, not over the rows you happen to hold. You cannot compute it; you can only estimate it from a finite sample. The cheapest estimator is the holdout: carve off a test set once, train on the rest, score once.

EQ V1.1 — GENERALIZATION ERROR vs ITS HOLDOUT ESTIMATE $$ \mathrm{Err} = \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\,L(y,\, \hat{f}(x))\,\big], \qquad \widehat{\mathrm{Err}}_{\text{holdout}} = \frac{1}{|\mathcal{T}|}\sum_{(x,y)\in\mathcal{T}} L\big(y,\, \hat{f}(x)\big) $$

$L$ is the loss (0–1 error, squared error, log loss…), $\hat{f}$ the model trained on the non-test rows, and $\mathcal{T}$ the held-out test set. The holdout estimate is unbiased for the error of the model trained on that particular training set — but it is a single random draw, and its variance is large precisely when the test set is small. Two unlucky splits of the same data can disagree by several points of accuracy on nothing but the luck of the draw.

Where does the wobble come from? A finite test set of size $m$ estimates an accuracy $p$ the way a coin of bias $p$ estimates its bias from $m$ flips. For a 0–1 loss the test-set accuracy has a binomial standard error:

EQ V1.2 — STANDARD ERROR OF A HOLDOUT ACCURACY $$ \mathrm{SE}(\hat{p}) = \sqrt{\frac{p\,(1-p)}{m}} \qquad\Longrightarrow\qquad \hat{p} \pm 1.96\,\mathrm{SE}(\hat{p}) \;\;(\text{95\% interval}) $$

At $p = 0.85$ on a test set of $m = 200$ rows, $\mathrm{SE} = \sqrt{0.85\cdot0.15/200} \approx 0.025$ — a 95% interval of roughly $\pm 5$ points. Two models that differ by three points of test accuracy may be statistically indistinguishable. Shrinking that interval means a larger test set — which steals rows from training — or reusing every row for both roles, which is the whole idea of cross-validation. This is the $1/\sqrt{m}$ law of STATS · EQ S4.4 wearing a model-evaluation hat.

The holdout therefore forces a bad trade. A big test set gives a stable estimate of a worse model (you trained on fewer rows); a small test set gives a noisy estimate of a better model. On small and medium datasets there is no good place to stand. Cross-validation refuses the trade: it lets every row serve as test data exactly once and as training data the rest of the time, then averages the per-fold estimates to cut the variance.

A COMMON ERROR

"My model gets 91% — here is the number." A point estimate with no spread is not a result, it is an anecdote. The first question any reviewer should ask is "across how many splits, and with what standard deviation?" A reported metric without a band silently claims a precision the single split cannot deliver. The fix is the rest of this chapter.

A model scores accuracy $ p = 0.85 $ on a holdout test set of $ m = 200 $ rows. Using EQ V1.2, what is the standard error of that accuracy, $ \sqrt{p(1-p)/m} $?

$ p(1-p) = 0.85 \times 0.15 = 0.1275 $; divide by $ m = 200 $ to get $ 0.0006375 $; the square root is $ \sqrt{0.0006375} = $ 0.025. A 95% interval is therefore about $ \pm 0.049 $ — nearly five points of accuracy of pure sampling noise around a single number.

PYTHON · RUNNABLE IN-BROWSER

# The single split is a coin flip: 200 random holdouts of the SAME data + model.
import numpy as np
rng = np.random.default_rng(0)

# One fixed dataset: a 2D blob whose label is a noisy linear rule.
N, d = 400, 5
X = rng.normal(0, 1, (N, d))
w_true = rng.normal(0, 1, d)
y = ((X @ w_true + rng.normal(0, 1.2, N)) > 0).astype(int)

def fit_predict(Xtr, ytr, Xte):            # ridge-ish least-squares classifier
    w = np.linalg.solve(Xtr.T @ Xtr + 1e-2*np.eye(d), Xtr.T @ (2*ytr - 1))
    return (Xte @ w > 0).astype(int)

accs = []
for _ in range(200):                        # vary ONLY the split, nothing else
    perm = rng.permutation(N)
    te, tr = perm[:80], perm[80:]           # 80-row test set each time
    pred = fit_predict(X[tr], y[tr], X[te])
    accs.append((pred == y[te]).mean())
accs = np.array(accs)

print(f"holdout accuracy ranges over {accs.min():.3f} .. {accs.max():.3f}")
print(f"mean = {accs.mean():.3f}   std across splits = {accs.std():.3f}")
print(f"so two single splits can disagree by ~{accs.max()-accs.min():.2f} on luck alone.")
plot_xy(np.arange(len(accs)), np.sort(accs))   # sorted: the spread you'd never see once

edits are live — break it on purpose

1.2

k-fold cross-validation

k-fold cross-validation partitions the data into $k$ equal, disjoint folds. It then runs $k$ experiments: in round $i$, fold $i$ is the validation set and the other $k-1$ folds are the training set. Every row is validated exactly once. The cross-validation estimate is the average of the $k$ fold scores:

EQ V1.3 — THE k-FOLD CV ESTIMATE $$ \widehat{\mathrm{CV}} = \frac{1}{k}\sum_{i=1}^{k} \frac{1}{|F_i|}\sum_{(x,y)\in F_i} L\big(y,\, \hat{f}^{\,(-i)}(x)\big), \qquad \widehat{\mathrm{SE}} = \frac{s}{\sqrt{k}} $$

$F_i$ is the $i$-th fold and $\hat{f}^{\,(-i)}$ is the model trained on everything except $F_i$. The estimate $\widehat{\mathrm{CV}}$ is the mean of the $k$ fold scores; $s$ is their sample standard deviation, and $s/\sqrt{k}$ is the usual standard error of that mean. Averaging $k$ estimates is what buys the error bars the single split could not give you. A caveat experts insist on: the $k$ fold scores are not independent (their training sets overlap heavily), so $s/\sqrt{k}$ understates the true uncertainty — treat it as a useful indicator, not a calibrated interval.

The choice of $k$ is a bias–variance dial. Small $k$ (e.g. 2) trains each model on much less data, so each fold model is weaker and $\widehat{\mathrm{CV}}$ is pessimistically biased. Large $k$ trains on almost all the data — at $k = N$ you get leave-one-out CV (LOOCV), nearly unbiased but with $N$ tightly correlated, high-variance fold scores and $N$ model fits. The empirical sweet spot, established by Kohavi's classic study and unchanged in 2026, is $k = 5$ or $k = 10$: low enough bias, manageable variance, affordable compute.

k	Train size per fold	Bias of estimate	Variance / cost
2	N / 2	high (pessimistic)	low cost, low variance
5	0.8 N	small	the common default
10	0.9 N	smaller	2× the cost of k = 5
N (LOOCV)	N − 1	~unbiased	N fits; high variance

The total compute is exactly $k$ model fits, each on a fraction $(k-1)/k$ of the data. That $k$-fold multiplier is the price of the error bars, and it is why §1.5's nested scheme — CV inside CV — is the expensive-but-honest end of the spectrum.

You run 5-fold cross-validation on a dataset of $ N = 100 $ rows. With equal folds, how many rows are in the validation set of each fold ($ N/k $)?

k-fold splits the data into $k$ equal disjoint folds, and each fold is the validation set exactly once. So each fold holds $ N/k = 100/5 = $ 20 rows for validation, leaving the other 80 for training that round.

PYTHON · RUNNABLE IN-BROWSER

# k-fold CV from scratch in numpy: report mean +/- std of the metric.
import numpy as np
rng = np.random.default_rng(1)

N, d, k = 300, 6, 5
X = rng.normal(0, 1, (N, d))
w_true = rng.normal(0, 1, d)
y = ((X @ w_true + rng.normal(0, 1.0, N)) > 0).astype(int)

def fit_predict(Xtr, ytr, Xte):
    w = np.linalg.solve(Xtr.T @ Xtr + 1e-2*np.eye(d), Xtr.T @ (2*ytr - 1))
    return (Xte @ w > 0).astype(int)

idx = rng.permutation(N)                      # shuffle once, then cut into k folds
folds = np.array_split(idx, k)               # k disjoint, near-equal index blocks

scores = []
for i in range(k):
    val = folds[i]
    tr  = np.concatenate([folds[j] for j in range(k) if j != i])
    pred = fit_predict(X[tr], y[tr], X[val])
    acc  = (pred == y[val]).mean()
    scores.append(acc)
    print(f"fold {i+1}: train {tr.size:3d}  val {val.size:3d}  acc {acc:.3f}")

scores = np.array(scores)
se = scores.std(ddof=1) / np.sqrt(k)         # EQ V1.3 (optimistic: folds correlate)
print(f"\nCV accuracy = {scores.mean():.3f} +/- {scores.std(ddof=1):.3f} (std)")
print(f"            = {scores.mean():.3f} +/- {se:.3f} (std error of the mean)")
print("One number with a band -- not a point estimate pretending to be the truth.")

edits are live — break it on purpose

INSTRUMENT V1.1 — FOLD VISUALIZER & VARIANCESINGLE SPLIT vs k-FOLD · EQ V1.3

NUMBER OF FOLDS k 5

ESTIMATOR

CV / HOLDOUT ESTIMATE

—

SPREAD ACROSS RESHUFFLES (STD)

—

MODELS FIT

—

The bar of 30 cells is your dataset; mint cells are validation, grey are training, one row per fold. Press RESHUFFLE a dozen times and watch the right-hand readout. In SINGLE SPLIT the estimate jumps around wildly between reshuffles — the coin flip of §1.1. Switch to k-FOLD: the same reshuffles now barely move the averaged estimate, because the $k$ folds cancel each other's luck. Raise $k$ to shrink the spread further, at the cost of more model fits.

1.3

Stratified & grouped k-fold

Plain k-fold shuffles rows and cuts blindly. That fails in two common situations, and both have a fix that costs nothing but a smarter partition.

Stratified k-fold: preserve the class balance

On a 1% positive fraud dataset, a random fold can easily land with zero positives — making its score meaningless and inflating the variance across folds. Stratified k-fold partitions within each class so every fold mirrors the overall label distribution:

EQ V1.4 — STRATIFICATION CONSTRAINT $$ \frac{|\{(x,y)\in F_i : y = c\}|}{|F_i|} \;\approx\; \frac{|\{(x,y)\in \mathcal{D} : y = c\}|}{N} \quad\text{for every fold } F_i \text{ and class } c $$

Each fold's class proportions match the dataset's, up to rounding. For classification, stratification is the default, not an option — it removes a needless source of fold-to-fold variance and is essential under class imbalance, where a non-stratified fold may contain no minority examples at all. The same idea extends to regression by stratifying on binned targets.

Grouped k-fold: respect dependence between rows

If multiple rows share a hidden identity — several visits from one patient, many frames of one video, repeated measurements of one sensor — then a row in training and its sibling in validation creates leakage: the model effectively sees the answer. Grouped k-fold keeps every group entirely on one side of each split, so no group straddles the train/validation boundary.

LEAKAGE

The most expensive bug in applied ML is a leak you cannot see. If patient #42 has rows in both the training fold and the validation fold, your reported accuracy measures memorization of patient #42, not generalization to new patients — and it will collapse in production. The same trap appears with near-duplicate images, augmented copies, and any preprocessing (scaling, imputation, target encoding) fit on the full dataset before splitting. Rule: every transform must be fit inside the training fold only, and grouped splits are mandatory whenever rows are not independent.

These choices compose: stratified group k-fold keeps groups intact and balances classes across folds, the standard recipe for imbalanced, clustered data. The honest caveat: when groups are few and uneven, perfect stratification and perfect grouping can conflict, and you accept an approximate balance.

A dataset has $ N = 20{,}000 $ rows with a $ 1\% $ positive rate. How many positive rows are there in total ($ 0.01 \times N $)?

$ 0.01 \times 20{,}000 = $ 200 positives. With only 200 positives spread across folds, a blind random split can easily hand one fold far fewer than its share — even zero — which is exactly the failure mode stratification (EQ V1.4) is built to prevent.

For that same dataset (200 positives total), under 5-fold stratified CV, how many positives sit in each validation fold ($ 200/5 $)?

Stratification forces each fold to carry the dataset's class proportions, so the positives are divided evenly: $ 200 / 5 = $ 40 per fold. Every fold therefore has enough minority examples to produce a meaningful score — the whole point of EQ V1.4.

PYTHON · RUNNABLE IN-BROWSER

# Stratified vs blind folds on a 5%-positive set: blind folds vary wildly.
import numpy as np
rng = np.random.default_rng(3)

N, k = 1000, 5
y = (rng.random(N) < 0.05).astype(int)       # ~5% positives -> ~50 of them
print(f"dataset positives: {y.sum()} / {N}  ({100*y.mean():.1f}%)\n")

def blind_folds(idx):
    return np.array_split(rng.permutation(idx), k)

def stratified_folds(y):
    folds = [[] for _ in range(k)]
    for c in (0, 1):                          # deal each class round-robin into folds
        members = rng.permutation(np.where(y == c)[0])
        for j, row in enumerate(members):
            folds[j % k].append(row)
    return [np.array(f) for f in folds]

print("blind   fold positive-rates:", end=" ")
for f in blind_folds(np.arange(N)):
    print(f"{y[f].mean():.3f}", end=" ")
print("\nstrat.  fold positive-rates:", end=" ")
for f in stratified_folds(y):
    print(f"{y[f].mean():.3f}", end=" ")
print("\n\nBlind folds scatter (one may even hit 0.00 -> a useless fold);")
print("stratified folds all sit near the 0.05 base rate, by construction.")

edits are live — break it on purpose

1.4

Time-series cross-validation

Everything above assumes the rows are exchangeable — that shuffling is harmless. For temporally ordered data it is not. Shuffling lets the model train on the future and validate on the past, which is impossible at deployment and produces gloriously optimistic, completely fake scores. The cardinal rule of temporal validation is brutal and simple:

EQ V1.5 — THE FORWARD-CHAINING CONSTRAINT $$ \max_{t \in \text{train}_i} t \;<\; \min_{t \in \text{val}_i} t \qquad \text{for every fold } i $$

Every timestamp used for training must precede every timestamp used for validation, in every fold. This is forward chaining (also "walk-forward" or "rolling-origin" validation): the validation window always lives strictly in the future relative to its training window. Standard k-fold violates this on roughly half of its train/validation pairs and is therefore invalid for any series with temporal structure. A further refinement inserts an embargo / purge gap between train and validation to kill leakage from overlapping feature windows or label horizons (the López de Prado correction for financial data).

Two schemes both satisfy EQ V1.5; they differ in what they do with old data:

Expanding window. The training set grows each fold — every split keeps all history up to the cut and validates on the next block. Uses all data; assumes the past stays relevant; training cost grows over folds.
Rolling (sliding) window. The training set is a fixed-length window that slides forward, dropping the oldest data as it adds new. Constant training cost, and — more importantly — it adapts to non-stationarity and concept drift, where ancient history actively misleads.

Which to prefer is genuinely contested and data-dependent: expanding windows win when the process is stable and data is scarce; rolling windows win when the world is drifting. Either way you typically report the average score across the forward-chained folds, exactly as in EQ V1.3 — just with splits that never look ahead.

INSTRUMENT V1.2 — TIME-SERIES SPLIT VISUALIZEREXPANDING vs ROLLING · FORWARD-CHAINED · EQ V1.5

SPLITS 5

EMBARGO GAP 0

WINDOW

SCHEME

—

FORWARD-CHAINED?

—

TRAIN BLOCKS · FOLD 1 → LAST

—

Time runs left → right across 24 ordered periods, one row per fold. Grey is training, mint is validation, and any blue cell is the embargo gap that is thrown away to prevent leakage. Notice that validation is always to the right of training — the future is never used to predict the past. Switch to ROLLING and the grey training block becomes a fixed-width window that slides forward, forgetting the oldest data; EXPANDING keeps accumulating it. Raise the embargo to punch a blue moat between the two.

In time-series cross-validation, the training data must always come before the validation data in time (no future rows in training). True or false? (Answer true or false.)

This is the forward-chaining constraint of EQ V1.5: $\max_{t\in\text{train}} t < \min_{t\in\text{val}} t$ in every fold. Training on the future to predict the past is impossible at deployment and produces fake optimism, so the statement is true.

PYTHON · RUNNABLE IN-BROWSER

# Forward-chained splits: expanding vs rolling. Verify NO split looks ahead.
import numpy as np

N, k = 24, 5
order = np.arange(N)                          # already time-ordered: 0 = oldest
fold = N // (k + 1)                           # size of each validation block
roll_train = 2 * fold                         # fixed window width for the rolling scheme

print("EXPANDING window (training set grows):")
ok = True
for i in range(1, k + 1):
    tr = order[: i * fold]
    va = order[i * fold : (i + 1) * fold]
    leak = tr.max() >= va.min()
    ok &= not leak
    print(f"  fold {i}: train {tr.min():2d}..{tr.max():2d} ({tr.size:2d})  "
          f"val {va.min():2d}..{va.max():2d}   leak? {leak}")

print("\nROLLING window (fixed width, slides forward):")
for i in range(1, k + 1):
    end = i * fold
    tr = order[max(0, end - roll_train): end]
    va = order[end: end + fold]
    if va.size == 0: break
    leak = tr.max() >= va.min()
    ok &= not leak
    print(f"  fold {i}: train {tr.min():2d}..{tr.max():2d} ({tr.size:2d})  "
          f"val {va.min():2d}..{va.max():2d}   leak? {leak}")

print(f"\nany split that trained on the future? {not ok}  "
      "(EQ V1.5 holds <=> this is False)")

edits are live — break it on purpose

1.5

Nested CV for honest tuning

Here is the subtle, costly mistake that even careful practitioners make. You run k-fold CV, try a hundred hyperparameter settings, pick the one with the best CV score, and report that score as the model's performance. That number is biased upward — sometimes badly. You used the validation folds twice: once to tune and once to report. Selecting the maximum over many noisy estimates is selecting partly for noise, so the winner's CV score is an optimistic estimate of its true error. This is the cross-validation cousin of the multiple-comparisons problem (STATS · §4.6).

EQ V1.6 — THE OPTIMISM OF SELECTION $$ \mathbb{E}\Big[\min_{\theta\in\Theta}\widehat{\mathrm{CV}}(\theta)\Big] \;\le\; \min_{\theta\in\Theta}\,\mathbb{E}\big[\widehat{\mathrm{CV}}(\theta)\big] \qquad\text{(Jensen / max-of-noisy-estimates)} $$

The expected score of the selected configuration is better (lower error) than the true error of the best configuration — the gap is pure selection bias, and it grows with the number of configurations tried $|\Theta|$ and with the noise in each estimate. The fold you select on can no longer give an unbiased estimate of performance. The fix is to wall off an estimation set the selection never touches.

Nested cross-validation does exactly that with two loops. The outer loop's folds are used only to estimate performance. Inside each outer training set, a separate inner CV loop performs the entire hyperparameter search and refits the chosen model. The outer fold — never seen by the inner search — then scores it. Because selection and evaluation use disjoint data, the outer score is an honest estimate of the whole pipeline, tuning included.

INSTRUMENT V1.3 — NESTED CV STRUCTUREOUTER = SCORE · INNER = SELECT · EQ V1.6

OUTER FOLDS 3

INNER FOLDS 3

HIGHLIGHT OUTER FOLD 1

TOTAL MODEL FITS

—

OUTER × INNER × GRID

—

OUTER SCORE IS…

UNBIASED

The top band is one highlighted outer split: grey = outer-train, mint = the outer-test fold that is sealed away. The lower bands show the inner CV that runs inside the outer-train portion to pick hyperparameters — and never touches the mint band. Drag HIGHLIGHT OUTER FOLD to step through outer rounds. The fit-count readout makes the cost concrete: nested CV runs (outer × inner × grid-size) fits, which is why people reach for it only when an honest number actually matters.

PYTHON · RUNNABLE IN-BROWSER

# Optimistic bias of tuning on the test fold vs nested CV (pure noise data).
import numpy as np
rng = np.random.default_rng(7)

N, k, G = 120, 5, 40                          # G = number of hyperparameter settings tried
y = rng.integers(0, 2, N)                     # labels are PURE NOISE: true acc = 0.50

# Each "config" is a random predictor independent of y -> all truly ~50% accurate.
def config_preds(seed, idx):                  # deterministic per (config, rows)
    r = np.random.default_rng(seed)
    return r.integers(0, 2, len(idx))

def cv_acc(g, idx):                           # k-fold accuracy of config g on rows idx
    folds = np.array_split(rng.permutation(idx), k)
    accs = [(config_preds(g, f) == y[f]).mean() for f in folds]
    return np.mean(accs)

# WRONG: tune AND report on the same CV -> pick the max over G noisy 0.5s.
flat = [cv_acc(g, np.arange(N)) for g in range(G)]
naive = max(flat)

# NESTED: inner CV selects the best config; the held-out outer fold scores it.
outer = np.array_split(rng.permutation(np.arange(N)), k)
nested = []
for i in range(k):
    test = outer[i]
    train = np.concatenate([outer[j] for j in range(k) if j != i])
    best = max(range(G), key=lambda g: cv_acc(g, train))      # select on inner data
    nested.append((config_preds(best, test) == y[test]).mean())  # score on sealed fold

print(f"truth (labels are noise)        : 0.500")
print(f"naive 'best CV' (tune==report)  : {naive:.3f}  <- optimistic, > 0.5 on noise")
print(f"nested CV outer mean            : {np.mean(nested):.3f}  <- honest, hugs 0.5")
print(f"selection bias removed          : {naive - np.mean(nested):+.3f}")

edits are live — break it on purpose

When is the full nested machinery worth it? When you must report a trustworthy performance number after tuning — a benchmark, a paper, a go/no-go decision. For the cheaper everyday workflow, a fixed three-way split (train / validation / test) approximates one outer fold: tune on validation, report once on the untouched test set. Nested CV is simply that idea applied $k$ times so the honest estimate itself gets error bars. The cost — outer × inner × grid model fits — is the reason it is reserved for when honesty is non-negotiable.

Cross-validation tells you how to score a configuration honestly; it does not tell you which configurations to try. The inner loop of nested CV was a hand-wave — "search the hyperparameters." Chapter 02 opens that loop: grid and random search, Bayesian optimization, Hyperband and successive halving, and the budget arithmetic that decides how many of those expensive inner fits you can actually afford.

1.R

References

Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. B 36(2) — the foundational formalization of cross-validation for model assessment.
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI 1995 — the empirical case for stratified 10-fold CV (§1.2).
Varma, S. & Simon, R. (2006). Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinformatics 7:91 — the selection-bias result behind nested CV (EQ V1.6).
Bergmeir, C. & Benítez, J. M. (2012). On the Use of Cross-Validation for Time Series Predictor Evaluation. Information Sciences 191 — forward-chaining validation for temporal data (§1.4).
Arlot, S. & Celisse, A. (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys 4 — the comprehensive modern reference on CV variants and their bias/variance.
Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning and reporting on the same folds inflates scores.