AI // ENCYCLOPEDIA / DATA / 01 / THE DATA PROBLEM INDEX NEXT: MISSING DATA →
DATA & FEATURE ENGINEERING · CHAPTER 01 / 05

The Data Problem — Quality, Leakage & Splits

Most experiments are decided before a single gradient is computed. A model is only as trustworthy as its data split, and the most expensive bugs in machine learning are leakage rather than math. This chapter covers what counts as good data, the train/validation/test contract that makes a measurement defensible, and the three families of leakage that inflate scores until deployment.

LEVELINTRO READING TIME≈ 22 MIN BUILDS ONSTATS 01–08 INSTRUMENTSLEAKAGE · SPLITS · ALLOCATOR
1.1

Garbage in, gospel out — the data-quality dimensions

The old slogan is "garbage in, garbage out." The modern danger is worse: garbage in, gospel out. A model trained on broken data does not announce its brokenness — it produces a confident number, a clean dashboard, a benchmark you can put in a slide. The failure is invisible precisely because the machinery downstream is so good at manufacturing certainty. So the first skill is not modelling; it is auditing the inputs along a handful of concrete dimensions.

DimensionThe question it asksHow it bites a model
AccuracyDo recorded values match reality?Mislabeled targets cap achievable accuracy; the model learns the error.
CompletenessAre values present where they should be?Missingness is rarely random (Chapter 02); naïve imputation injects bias.
ConsistencyDo the same facts agree across sources?"NY" vs "New York" vs "N.Y." fractures a category into noise.
TimelinessWas the value knowable when the prediction is made?A value that arrives after the decision point is temporal leakage (§1.3).
UniquenessAre records deduplicated?Duplicates that straddle the split leak the answer and inflate the score.
ValidityDo values obey their schema and range?An age of −3 or 999 is a sentinel masquerading as data.
RepresentativenessDoes the sample resemble deployment?If train ≠ production, every metric is measuring the wrong world (§1.4).

None of these is exotic, and that is the point. The expensive mistakes in machine learning are almost never a missing regularizer or the wrong learning rate — they are a duplicated row, a label that encodes the answer, a timestamp read in the wrong direction. The rest of this chapter is a tour of how those mistakes hide, and the protocol that smokes them out.

Labels are data too — and often the worst of it. Benchmark audits have found error rates of a few percent in the test sets of canonical datasets such as ImageNet and the others surveyed by Northcutt et al. (2021). When the yardstick itself is bent, two models a percentage point apart may simply be fitting different label mistakes. Read your data by hand before you trust any number computed from it.

1.2

The train / validation / test contract

Machine learning makes one promise — this model will generalize to data it has never seen — and that promise is only credible if you have held some data genuinely unseen. The standard protocol splits the dataset into three disjoint roles, and the discipline is in keeping them disjoint:

  • Training set — the model fits its parameters here. It is allowed to memorize this.
  • Validation set — used to choose things the training does not: hyperparameters, model family, when to stop. Every decision you make from it spends a little of its independence.
  • Test set — touched once, at the very end, to estimate generalization. The moment you tune against it, it stops being a test set and becomes a second validation set.
EQ D1.1 — THE THREE-WAY SPLIT $$ \mathcal{D} = \mathcal{D}_{\text{train}} \,\sqcup\, \mathcal{D}_{\text{val}} \,\sqcup\, \mathcal{D}_{\text{test}}, \qquad N = N_{\text{train}} + N_{\text{val}} + N_{\text{test}} $$
The \(\sqcup\) is a disjoint union: no row, and no information about a row, may appear in more than one part. A common allocation is 70 / 15 / 15, but the right ratio depends on size — at \(N = 10^6\) a 1% test set already holds 10,000 rows, plenty to measure with. The split is a contract, not a suggestion: the entire value of the test number is that nothing downstream of it ever saw the test data.

Why three and not two? Because evaluating on the same data you used to select a model is optimistic in exactly the way it is convenient. Each time you peek at the validation set and pick the better-scoring option, you fit a little to its particular noise. After fifty such peeks, the validation number is part wisdom, part overfitting to that one sample — so you keep a separate test set, used once, to get an honest read.

You have a dataset of 1000 rows and use a 70 / 15 / 15 train/validation/test split. How many rows land in the test set?
The test fraction is \(15\% = 0.15\), so \(N_{\text{test}} = 0.15 \times 1000 = \) 150 rows. (Train gets \(0.70 \times 1000 = 700\); validation the remaining \(150\). They sum back to 1000 — a disjoint union, EQ D1.1.)

The estimate you read off the test set is itself a random quantity — a different test sample would give a slightly different number. For a metric that is an average over the test rows (accuracy is the mean of correct/incorrect indicators), the spread of that estimate shrinks with the test-set size:

EQ D1.2 — STANDARD ERROR OF A TEST METRIC $$ \mathrm{SE}(\hat{a}) = \sqrt{\frac{\hat{a}\,(1 - \hat{a})}{N_{\text{test}}}} $$
\(\hat{a}\) is a measured proportion (e.g. accuracy); \(N_{\text{test}}\) is the number of test rows. The error falls like \(1/\sqrt{N_{\text{test}}}\): to halve your uncertainty you must quadruple the test set. This is the real trade in EQ D1.1 — every row you move into the test set sharpens the measurement but starves the model, and vice versa. The allocator instrument below makes that tension tangible.

When data is scarce, fold instead of waste. A single small validation set is itself a noisy sample. \(k\)-fold cross-validation rotates the validation role across \(k\) disjoint folds, trains \(k\) times, and averages — turning one anxious split into \(k\) estimates and using every row for both training and validation (never simultaneously). It costs \(k\times\) the compute and assumes the rows are exchangeable, an assumption §1.3 and §1.4 will complicate. A held-out test set still sits outside the whole cross-validation loop.

INSTRUMENT D1.3 — TRAIN / VAL / TEST ALLOCATOREQ D1.1 · EQ D1.2 · SE READOUT
SPLIT OF THE DATASET
■ TRAIN■ VAL■ TEST
TRAIN / VAL / TEST ROWS
TEST SE @ a = 0.85
95% CI HALF-WIDTH
Drag the size and the two percentages. The bar shows the disjoint split (EQ D1.1); the readouts apply EQ D1.2 to a model measured at 85% accuracy. Shrink the test set and watch the standard error — and the ±95% confidence band — balloon: a 100-row test set cannot tell an 85% model from an 80% one. Quadruple N and the band halves.
1.3

Data leakage — the most expensive bug in ML

Leakage is any path by which information from outside the training data — most dangerously, information that will not exist at prediction time — reaches the model and inflates its measured performance. It is the canonical reason a model scores beautifully in the notebook and falls apart in production. Kaufman et al. (2012) gave the field its working definition and taxonomy; the failure has not gotten rarer since. Three families cover almost every case.

Target leakage

A feature is a proxy for the answer because it is a consequence of the label, not a cause available beforehand. The classic example: predicting whether a patient has a disease, with prescribed_treatment_for_that_disease sitting in the feature table. The feature is enormously predictive and completely useless — at prediction time, the diagnosis hasn't happened, so the treatment doesn't exist yet. The model has learned to read the answer key.

EQ D1.3 — THE LEAKAGE CRITERION $$ \text{feature } x_j \text{ is admissible} \iff x_j \text{ is knowable at prediction time } t_{\text{pred}}, \text{ before } y \text{ is realized} $$
The single test that catches most leakage: for every feature, ask whether its value would actually be available, with that value, at the moment the model must predict. If the feature is computed from the future, from the label, or from the test rows, it fails. This is a question about your process, not your statistics — no correlation analysis substitutes for it.

Temporal leakage

Information from the future bleeds into the past. The subtle version isn't an obvious "future" column — it's a statistic computed over the whole dataset and then used on early rows. Fitting a scaler, an imputer, an encoder, or any preprocessing on the full data before splitting lets the training rows "see" the mean, variance, or category set of the test rows. The leak is small per feature and devastating in aggregate, because it happens silently inside a one-line call.

SILENT KILLER

Fit preprocessing inside the split, never across it. scaler.fit(X) then train_test_split(...) is leakage: the scaler's mean and variance were computed using test rows. The correct order is split first, then scaler.fit(X_train) and scaler.transform(X_test) with the training statistics. Inside cross-validation, the scaler must be re-fit on each fold's training portion — which is exactly what an sklearn Pipeline does for you, and why pipelines exist.

Group leakage

Rows are not independent — several belong to the same entity (the same patient, user, document, or molecule), and a random split scatters that entity across train and test. The model then "generalizes" to a test patient it has already met under a different row. The fix is a grouped split: partition by the group key so an entire entity lands wholly in one side. The instrument below shows random, time-based, and grouped splits side by side.

You compute the mean of a feature over the full dataset and use it to standardize every value, then split into train and test. Is this data leakage? (Answer yes or no.)
The full-data mean was computed using the test rows, so the training rows were transformed with information about the test set — a textbook case of temporal/preprocessing leakage. The honest procedure splits first, fits the mean on \(\mathcal{D}_{\text{train}}\) only, and applies that same mean to the test rows. Answer: yes.
PYTHON · RUNNABLE IN-BROWSER
# Target leakage: a feature that is a consequence of the label
# A logistic model fit WITH vs WITHOUT the leaked column.
import numpy as np
rng = np.random.default_rng(0)
N = 600
y = rng.integers(0, 2, N)                       # the label
honest = 0.6*y + rng.normal(0, 1.0, N)          # a weak, legitimate signal
leak   = y + rng.normal(0, 0.05, N)             # a near-copy of the label (leak!)

def fit_score(cols):
    X = np.column_stack(cols + [np.ones(N)])     # add bias
    tr, te = slice(0, 400), slice(400, N)        # honest holdout
    w = np.zeros(X.shape[1])
    for _ in range(400):                          # gradient-descent logistic fit
        p = 1/(1+np.exp(-X[tr] @ w))
        w -= 0.05 * X[tr].T @ (p - y[tr]) / 400
    pred = (1/(1+np.exp(-X[te] @ w))) > 0.5
    return (pred == y[te]).mean()

print(f"honest features only      : {fit_score([honest]):.3f} test accuracy")
print(f"+ leaked 'future' feature : {fit_score([honest, leak]):.3f} test accuracy")
print("\nThe leak looks like a miracle feature -- because it IS the answer.")
print("On real holdout where 'leak' is unavailable, that gain evaporates.")
edits are live — break it on purpose
PYTHON · RUNNABLE IN-BROWSER
# Scaling-before-split leakage: same model, two preprocessing orders.
# Manual 5-fold CV; the only difference is WHERE the scaler is fit.
import numpy as np
rng = np.random.default_rng(1)
N, d = 400, 8
X = rng.normal(0, 1, (N, d))
y = (X[:, 0] + 0.5*rng.normal(0, 1, N) > 0).astype(float)   # weak signal

def cv(leaky):
    folds = np.array_split(rng.permutation(N), 5); accs = []
    for k in range(5):
        te = folds[k]; tr = np.concatenate([folds[j] for j in range(5) if j != k])
        rows = slice(None) if leaky else tr     # WRONG vs RIGHT: which rows scale?
        mu, sd = X[rows].mean(0), X[rows].std(0) + 1e-9
        Xb = np.column_stack([(X - mu) / sd, np.ones(N)])
        w = np.zeros(d + 1)
        for _ in range(300):
            p = 1/(1+np.exp(-Xb[tr] @ w)); w -= 0.1*Xb[tr].T @ (p - y[tr])/len(tr)
        accs.append(((1/(1+np.exp(-Xb[te] @ w)) > 0.5) == y[te]).mean())
    return np.mean(accs)

print(f"scaler fit on FULL data (leaky) : CV acc {cv(True):.3f}")
print(f"scaler fit INSIDE each fold     : CV acc {cv(False):.3f}")
print("\nThe gap is the leak. It is tiny per feature and pure illusion;")
print("a Pipeline re-fits the scaler per fold so the honest number is all you see.")
edits are live — break it on purpose
INSTRUMENT D1.1 — LEAKAGE DEMONSTRATORVALIDATION (LEAKY) vs TRUE HOLDOUT · EQ D1.3
VALIDATION ACC (REPORTED)
TRUE HOLDOUT ACC
ILLUSION (THE DROP)
With the leak OFF, both bars sit at the model's honest skill (~0.82). Turn the leak ON and the validation bar climbs toward 1.0 as you raise leak strength — because the validation rows share the leaked feature — while the true holdout bar barely moves, since the leaked column is unavailable at real prediction time (EQ D1.3). The gap between the bars is the size of the lie you would have shipped.
1.4

Sampling, representativeness & distribution shift

A split keeps the test data unseen, but it does not guarantee the data resembles the world the model will face. The whole evaluation rests on one assumption — that training, test, and deployment data are drawn from the same distribution. When that fails, even a flawless split measures the wrong thing.

EQ D1.4 — THE i.i.d. ASSUMPTION (AND ITS FAILURE) $$ \text{evaluation is valid} \iff p_{\text{train}}(x, y) \approx p_{\text{test}}(x, y) \approx p_{\text{deploy}}(x, y) $$
"i.i.d." = independent and identically distributed. Covariate shift moves \(p(x)\) (the input mix changes — new users, new regions); label shift moves \(p(y)\) (the base rate changes — fraud surges); concept drift moves \(p(y\mid x)\) (the rule itself changes — last year's spam looks innocuous today). A random split hides all three, because it makes train and test identical by construction while deployment quietly diverges.

The remedy depends on the structure of the data. When time matters — anything forecasting, anything where today's model predicts tomorrow — a random split is a lie, because it lets the model train on the future and test on the past. The honest protocol is a time-based split: train on the past, validate and test on strictly later periods, exactly as deployment will run. When records cluster by entity, use a grouped split so no entity straddles the boundary (§1.3). Sometimes you need both at once.

Sampling bias is upstream of all of this. If the data was collected in a way that over- or under-represents part of the world — survivorship bias, self-selection, a sensor that only logged failures — no split or model can recover what was never sampled. Stratified sampling (preserving class or subgroup proportions in every split) protects measurement when classes are imbalanced, but it cannot conjure a population that was never observed. The cheapest fix to a representativeness problem is almost always collecting better data, not a cleverer estimator.

INSTRUMENT D1.2 — SPLIT VISUALIZERRANDOM · TIME-BASED · GROUPED
STRATEGY
ENTITIES SPANNING THE SPLIT
FUTURE→PAST ORDER VIOLATIONS
Each cell is one row, ordered left-to-right by time, its letter the entity it belongs to. RANDOM scatters train (mint) and test (grey) freely — and the readouts flag both group leakage (an entity in both sets) and time violations (test rows earlier than train rows). TIME-BASED puts every test row strictly after every train row: zero order violations. GROUPED keeps each lettered entity wholly on one side: zero spanning entities. Notice no single strategy zeroes out every risk — that is the real lesson.
1.5

Building the modeling dataset — a protocol

Pulling the pieces together, here is the order of operations that keeps every later number honest. The sequence matters more than any single step: most leakage is an ordering bug, a transformation that happened one line too early.

# A leakage-safe pipeline. The ORDER is the point, not any one line.
1 define:     the prediction target y AND the exact moment t_pred it is made
2 audit:      every feature against EQ D1.3 — knowable at t_pred? drop if not
3 dedup:      remove duplicate / near-duplicate rows BEFORE splitting
4 split:      choose random / time-based / grouped to match the real task
             (split FIRST — everything below sees only its own partition)
5 fit prep:   fit scalers / imputers / encoders on TRAIN only
6 transform:  apply TRAIN-fitted statistics to val and test
7 decontaminate: check no train row (or its duplicate) is in val/test
8 evaluate:   tune on val; touch test ONCE; report with EQ D1.2 error bars

Two habits make this durable. First, wrap steps 5–6 in a single object — an sklearn Pipeline or its equivalent — so the preprocessing is re-fit automatically inside every cross-validation fold and can never accidentally span the split. Second, treat decontamination as a first-class step: hash your rows and confirm no training example (or a trivial variant of one) appears in validation or test. This is the same discipline that fine-tuning a language model demands against its eval sets (Vol II · CH 06), and the same arithmetic that information theory gives the loss it minimizes (STATS · 08).

Same 70 / 15 / 15 split on the 1000-row dataset from §1.2. After you correctly split first and will fit your scaler on training data only, how many rows is that scaler fit on — i.e. how many training rows are there?
The training fraction is \(70\% = 0.70\), so \(N_{\text{train}} = 0.70 \times 1000 = \) 700 rows. The scaler's mean and variance are computed from these 700 rows alone (step 5), then applied unchanged to the 150 validation and 150 test rows (step 6) — never the reverse.
NEXT

This chapter assumed your rows were at least present. They rarely are. The most common quality defect — the empty cell — turns out to carry information of its own: why a value is missing often predicts the value itself, and the wrong imputation quietly biases everything downstream. Next: Data · 02 — Missing Data, where we make the absence itself a feature.

1.R

References

  1. Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM TKDD 6(4) — the field's working definition and taxonomy of leakage (§1.3, EQ D1.3).
  2. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — free online; cross-validation, the train/test contract, and the right vs wrong way to cross-validate (§1.2, §1.5).
  3. Northcutt, C. G., Athalye, A. & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets & Benchmarks — measured label-error rates in ImageNet and nine other canonical test sets (§1.1).
  4. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. (2019). Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 366(6464) — a real-world label / proxy that leaked the wrong target into a deployed model (§1.1, §1.4).
  5. Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. JMLR 11 — why tuning on the test set inflates results, and nested cross-validation as the fix (§1.2).
  6. Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press — covariate shift, label shift, and concept drift formalized (§1.4, EQ D1.4).