Garbage in, gospel out — the data-quality dimensions
The old slogan is "garbage in, garbage out." The modern danger is worse: garbage in, gospel out. A model trained on broken data does not announce its brokenness — it produces a confident number, a clean dashboard, a benchmark you can put in a slide. The failure is invisible precisely because the machinery downstream is so good at manufacturing certainty. So the first skill is not modelling; it is auditing the inputs along a handful of concrete dimensions.
| Dimension | The question it asks | How it bites a model |
|---|---|---|
| Accuracy | Do recorded values match reality? | Mislabeled targets cap achievable accuracy; the model learns the error. |
| Completeness | Are values present where they should be? | Missingness is rarely random (Chapter 02); naïve imputation injects bias. |
| Consistency | Do the same facts agree across sources? | "NY" vs "New York" vs "N.Y." fractures a category into noise. |
| Timeliness | Was the value knowable when the prediction is made? | A value that arrives after the decision point is temporal leakage (§1.3). |
| Uniqueness | Are records deduplicated? | Duplicates that straddle the split leak the answer and inflate the score. |
| Validity | Do values obey their schema and range? | An age of −3 or 999 is a sentinel masquerading as data. |
| Representativeness | Does the sample resemble deployment? | If train ≠ production, every metric is measuring the wrong world (§1.4). |
None of these is exotic, and that is the point. The expensive mistakes in machine learning are almost never a missing regularizer or the wrong learning rate — they are a duplicated row, a label that encodes the answer, a timestamp read in the wrong direction. The rest of this chapter is a tour of how those mistakes hide, and the protocol that smokes them out.
Labels are data too — and often the worst of it. Benchmark audits have found error rates of a few percent in the test sets of canonical datasets such as ImageNet and the others surveyed by Northcutt et al. (2021). When the yardstick itself is bent, two models a percentage point apart may simply be fitting different label mistakes. Read your data by hand before you trust any number computed from it.
The train / validation / test contract
Machine learning makes one promise — this model will generalize to data it has never seen — and that promise is only credible if you have held some data genuinely unseen. The standard protocol splits the dataset into three disjoint roles, and the discipline is in keeping them disjoint:
- Training set — the model fits its parameters here. It is allowed to memorize this.
- Validation set — used to choose things the training does not: hyperparameters, model family, when to stop. Every decision you make from it spends a little of its independence.
- Test set — touched once, at the very end, to estimate generalization. The moment you tune against it, it stops being a test set and becomes a second validation set.
Why three and not two? Because evaluating on the same data you used to select a model is optimistic in exactly the way it is convenient. Each time you peek at the validation set and pick the better-scoring option, you fit a little to its particular noise. After fifty such peeks, the validation number is part wisdom, part overfitting to that one sample — so you keep a separate test set, used once, to get an honest read.
The estimate you read off the test set is itself a random quantity — a different test sample would give a slightly different number. For a metric that is an average over the test rows (accuracy is the mean of correct/incorrect indicators), the spread of that estimate shrinks with the test-set size:
When data is scarce, fold instead of waste. A single small validation set is itself a noisy sample. \(k\)-fold cross-validation rotates the validation role across \(k\) disjoint folds, trains \(k\) times, and averages — turning one anxious split into \(k\) estimates and using every row for both training and validation (never simultaneously). It costs \(k\times\) the compute and assumes the rows are exchangeable, an assumption §1.3 and §1.4 will complicate. A held-out test set still sits outside the whole cross-validation loop.
Data leakage — the most expensive bug in ML
Leakage is any path by which information from outside the training data — most dangerously, information that will not exist at prediction time — reaches the model and inflates its measured performance. It is the canonical reason a model scores beautifully in the notebook and falls apart in production. Kaufman et al. (2012) gave the field its working definition and taxonomy; the failure has not gotten rarer since. Three families cover almost every case.
Target leakage
A feature is a proxy for the answer because it is a consequence of the label, not a cause available beforehand. The classic example: predicting whether a patient has a disease, with prescribed_treatment_for_that_disease sitting in the feature table. The feature is enormously predictive and completely useless — at prediction time, the diagnosis hasn't happened, so the treatment doesn't exist yet. The model has learned to read the answer key.
Temporal leakage
Information from the future bleeds into the past. The subtle version isn't an obvious "future" column — it's a statistic computed over the whole dataset and then used on early rows. Fitting a scaler, an imputer, an encoder, or any preprocessing on the full data before splitting lets the training rows "see" the mean, variance, or category set of the test rows. The leak is small per feature and devastating in aggregate, because it happens silently inside a one-line call.
Fit preprocessing inside the split, never across it. scaler.fit(X) then train_test_split(...) is leakage: the scaler's mean and variance were computed using test rows. The correct order is split first, then scaler.fit(X_train) and scaler.transform(X_test) with the training statistics. Inside cross-validation, the scaler must be re-fit on each fold's training portion — which is exactly what an sklearn Pipeline does for you, and why pipelines exist.
Group leakage
Rows are not independent — several belong to the same entity (the same patient, user, document, or molecule), and a random split scatters that entity across train and test. The model then "generalizes" to a test patient it has already met under a different row. The fix is a grouped split: partition by the group key so an entire entity lands wholly in one side. The instrument below shows random, time-based, and grouped splits side by side.
# Target leakage: a feature that is a consequence of the label
# A logistic model fit WITH vs WITHOUT the leaked column.
import numpy as np
rng = np.random.default_rng(0)
N = 600
y = rng.integers(0, 2, N) # the label
honest = 0.6*y + rng.normal(0, 1.0, N) # a weak, legitimate signal
leak = y + rng.normal(0, 0.05, N) # a near-copy of the label (leak!)
def fit_score(cols):
X = np.column_stack(cols + [np.ones(N)]) # add bias
tr, te = slice(0, 400), slice(400, N) # honest holdout
w = np.zeros(X.shape[1])
for _ in range(400): # gradient-descent logistic fit
p = 1/(1+np.exp(-X[tr] @ w))
w -= 0.05 * X[tr].T @ (p - y[tr]) / 400
pred = (1/(1+np.exp(-X[te] @ w))) > 0.5
return (pred == y[te]).mean()
print(f"honest features only : {fit_score([honest]):.3f} test accuracy")
print(f"+ leaked 'future' feature : {fit_score([honest, leak]):.3f} test accuracy")
print("\nThe leak looks like a miracle feature -- because it IS the answer.")
print("On real holdout where 'leak' is unavailable, that gain evaporates.")
# Scaling-before-split leakage: same model, two preprocessing orders.
# Manual 5-fold CV; the only difference is WHERE the scaler is fit.
import numpy as np
rng = np.random.default_rng(1)
N, d = 400, 8
X = rng.normal(0, 1, (N, d))
y = (X[:, 0] + 0.5*rng.normal(0, 1, N) > 0).astype(float) # weak signal
def cv(leaky):
folds = np.array_split(rng.permutation(N), 5); accs = []
for k in range(5):
te = folds[k]; tr = np.concatenate([folds[j] for j in range(5) if j != k])
rows = slice(None) if leaky else tr # WRONG vs RIGHT: which rows scale?
mu, sd = X[rows].mean(0), X[rows].std(0) + 1e-9
Xb = np.column_stack([(X - mu) / sd, np.ones(N)])
w = np.zeros(d + 1)
for _ in range(300):
p = 1/(1+np.exp(-Xb[tr] @ w)); w -= 0.1*Xb[tr].T @ (p - y[tr])/len(tr)
accs.append(((1/(1+np.exp(-Xb[te] @ w)) > 0.5) == y[te]).mean())
return np.mean(accs)
print(f"scaler fit on FULL data (leaky) : CV acc {cv(True):.3f}")
print(f"scaler fit INSIDE each fold : CV acc {cv(False):.3f}")
print("\nThe gap is the leak. It is tiny per feature and pure illusion;")
print("a Pipeline re-fits the scaler per fold so the honest number is all you see.")
Sampling, representativeness & distribution shift
A split keeps the test data unseen, but it does not guarantee the data resembles the world the model will face. The whole evaluation rests on one assumption — that training, test, and deployment data are drawn from the same distribution. When that fails, even a flawless split measures the wrong thing.
The remedy depends on the structure of the data. When time matters — anything forecasting, anything where today's model predicts tomorrow — a random split is a lie, because it lets the model train on the future and test on the past. The honest protocol is a time-based split: train on the past, validate and test on strictly later periods, exactly as deployment will run. When records cluster by entity, use a grouped split so no entity straddles the boundary (§1.3). Sometimes you need both at once.
Sampling bias is upstream of all of this. If the data was collected in a way that over- or under-represents part of the world — survivorship bias, self-selection, a sensor that only logged failures — no split or model can recover what was never sampled. Stratified sampling (preserving class or subgroup proportions in every split) protects measurement when classes are imbalanced, but it cannot conjure a population that was never observed. The cheapest fix to a representativeness problem is almost always collecting better data, not a cleverer estimator.
Building the modeling dataset — a protocol
Pulling the pieces together, here is the order of operations that keeps every later number honest. The sequence matters more than any single step: most leakage is an ordering bug, a transformation that happened one line too early.
# A leakage-safe pipeline. The ORDER is the point, not any one line.
1 define: the prediction target y AND the exact moment t_pred it is made
2 audit: every feature against EQ D1.3 — knowable at t_pred? drop if not
3 dedup: remove duplicate / near-duplicate rows BEFORE splitting
4 split: choose random / time-based / grouped to match the real task
(split FIRST — everything below sees only its own partition)
5 fit prep: fit scalers / imputers / encoders on TRAIN only
6 transform: apply TRAIN-fitted statistics to val and test
7 decontaminate: check no train row (or its duplicate) is in val/test
8 evaluate: tune on val; touch test ONCE; report with EQ D1.2 error bars
Two habits make this durable. First, wrap steps 5–6 in a single object — an sklearn Pipeline or its equivalent — so the preprocessing is re-fit automatically inside every cross-validation fold and can never accidentally span the split. Second, treat decontamination as a first-class step: hash your rows and confirm no training example (or a trivial variant of one) appears in validation or test. This is the same discipline that fine-tuning a language model demands against its eval sets (Vol II · CH 06), and the same arithmetic that information theory gives the loss it minimizes (STATS · 08).
This chapter assumed your rows were at least present. They rarely are. The most common quality defect — the empty cell — turns out to carry information of its own: why a value is missing often predicts the value itself, and the wrong imputation quietly biases everything downstream. Next: Data · 02 — Missing Data, where we make the absence itself a feature.
References
- Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).
- Northcutt, C. G., Athalye, A. & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.
- Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. (2019). Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.
- Cawley, G. C. & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.
- Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. (2009). Dataset Shift in Machine Learning.