Missingness mechanisms: MCAR, MAR, MNAR
Before you fill a single cell, ask why it is empty. Donald Rubin's 1976 framework — still the field's bedrock — sorts the reason into three mechanisms by asking what the probability of a value being missing depends on. Write \(R\) for the missingness indicator (\(R=1\) if a cell is observed, \(0\) if missing), \(X_{\text{obs}}\) for the data you can see, and \(X_{\text{mis}}\) for the values that are hidden.
The distinction is not academic — it dictates what is recoverable:
| Mechanism | Depends on | Complete-case analysis | Imputation |
|---|---|---|---|
| MCAR | nothing | unbiased (just less efficient) | Optional; any sensible fill is safe |
| MAR | observed \(X_{\text{obs}}\) | Biased in general | Recoverable — condition on the observed predictors |
| MNAR | unseen \(X_{\text{mis}}\) | Biased | Not fixable from the data alone — needs a model of the missingness |
You cannot test MAR vs MNAR from the data. The only difference between them lives in the values you never saw, so no statistic computed on the observed data can distinguish them — this is the contested, uncomfortable heart of the field. In practice you assume MAR (it makes the math tractable and the assumption is often defensible once you condition on enough covariates), then probe sensitivity to MNAR with explicit what-if models. Honesty about this assumption is the difference between an imputation that helps and one that launders bias into a clean-looking table.
Simple imputation — and what it costs
The reflex fix is to replace every missing entry of a column with a single constant: the mean (numeric, roughly symmetric), the median (numeric, skewed or outlier-prone), or the mode (categorical). It is one line of code and it is the most over-used tool in applied machine learning.
The variance damage is exact and worth committing to memory. If you mean-impute \(m\) of the \(n\) entries in a column, the population variance of the completed column is the original observed variance scaled down by the fraction of real data:
# Mean vs kNN imputation: RMSE-to-truth on a masked, correlated column
import numpy as np
rng = np.random.default_rng(0)
n = 300
x = rng.normal(0, 1, n) # a predictor we always observe
y = 2.0 * x + rng.normal(0, 0.4, n) # truth: y is strongly tied to x
mask = rng.random(n) < 0.35 # 35% of y goes missing (MAR on nothing here = MCAR)
y_obs = y.copy(); y_obs[mask] = np.nan
# (1) mean imputation: one flat number for every hole
y_mean = y_obs.copy()
y_mean[mask] = np.nanmean(y_obs)
# (2) kNN imputation in x-space: average the k nearest observed neighbours' y
def knn_impute(x, y_obs, mask, k=7):
out = y_obs.copy()
obs = ~mask
for i in np.where(mask)[0]:
d = np.abs(x[obs] - x[i]) # distance in the observed feature
nn = np.argsort(d)[:k]
out[i] = y_obs[obs][nn].mean()
return out
y_knn = knn_impute(x, y_obs, mask)
rmse = lambda a: np.sqrt(np.mean((a[mask] - y[mask])**2))
print(f"mean-impute RMSE to truth : {rmse(y_mean):.3f}")
print(f"kNN-impute RMSE to truth : {rmse(y_knn):.3f}")
print(f"std(observed y) : {np.nanstd(y_obs):.3f}")
print(f"std(after mean-impute) : {np.std(y_mean):.3f} <- shrunk")
plot_scatter(x[mask], y[mask], [0]*mask.sum()) # the points we had to guess
Median and mode share the same structural flaw — a single constant per column — but resist outliers (median) and apply to categories (mode). They are reasonable defaults for a quick baseline or a column you do not believe carries much signal; they are never the right answer for a feature whose relationships matter.
kNN imputation: borrow from your neighbours
The first real upgrade is to stop filling with a global constant and start filling with a local one. k-nearest-neighbour imputation finds the \(k\) most similar rows (by distance over the columns you do observe) and fills each hole with their average — a weighted average if you weight by distance. It made its name imputing DNA microarray expression matrices, where it beat row-average filling decisively.
Two parameters decide its behaviour. Small \(k\) is flexible but noisy — a single odd neighbour swings the fill; large \(k\) smooths toward the global mean and re-introduces the very shrinkage you were trying to avoid. As always with kNN, you must scale your features first: an unscaled distance is dominated by whichever column happens to have the largest units, and the "nearest" neighbours become an artifact of measurement choice rather than similarity.
MICE: multiple imputation by chained equations
Every method so far fills a single best guess and then proceeds as if it were ground truth — which pretends the imputed values carry no uncertainty. Multiple imputation fixes that at the root: generate several complete datasets, each with plausibly different fills, analyse each, and pool the results so the extra variance from imputation flows into your final standard errors. MICE (also called fully conditional specification) is the dominant way to generate those datasets.
The chained-equations idea is elegant. Initialize every hole with a simple fill, then sweep the columns one at a time: for each column with missing data, regress it on all the others using the currently-filled rows, and draw new imputations from that conditional model. Repeat the sweep until the fills stop changing.
The payoff is the pooling step, Rubin's rules: average the \(M\) point estimates, and combine their variances so the total reflects both within-imputation and between-imputation uncertainty:
# Mini-MICE: iteratively regress each column on the others; watch it converge
import numpy as np
rng = np.random.default_rng(1)
n, p = 200, 3
# correlated columns: a shared factor plus noise
f = rng.normal(0, 1, (n, 1))
X = f * np.array([1.0, 0.8, -0.6]) + rng.normal(0, 0.5, (n, p))
M = rng.random((n, p)) < 0.20 # 20% missing, scattered
Xm = X.copy(); Xm[M] = np.nan
col_mean = np.nanmean(Xm, axis=0)
Xf = Xm.copy()
for c in range(p): # step 0: mean-init every hole
Xf[M[:, c], c] = col_mean[c]
for sweep in range(8): # chained equations
prev = Xf.copy()
for c in range(p): # regress column c on the rest
rows = M[:, c]
if not rows.any():
continue
others = [k for k in range(p) if k != c]
A = np.column_stack([np.ones(n), Xf[:, others]])
beta, *_ = np.linalg.lstsq(A[~rows], Xf[~rows, c], rcond=None)
Xf[rows, c] = A[rows] @ beta # conditional-mean fill
delta = np.abs(Xf - prev)[M].mean()
print(f"sweep {sweep+1}: mean change in filled cells = {delta:.5f}")
err = np.sqrt(np.mean((Xf[M] - X[M])**2))
print(f"\nfinal RMSE of MICE fills to truth: {err:.3f}")
print("change shrinks toward 0 -> the chained equations reached a fixed point.")
The honest caveat. Chained equations specify each column's conditional separately, so there is no guarantee a coherent joint distribution exists that matches all of them — yet the procedure is remarkably robust in practice and is the default in R's mice and scikit-learn's IterativeImputer. Convergence is monitored by eye (trace plots of imputed means across sweeps), not a hard stopping rule.
Model-based & indicator strategies; choosing in practice
Two more tools round out the kit. Model-based imputation fits a single probabilistic model of the whole feature matrix and reads the missing values off it — Gaussian/EM imputation (maximum-likelihood under a multivariate-normal assumption), low-rank matrix completion (SVD/soft-impute, the engine behind recommender systems), and increasingly tree- and neural-network-based imputers. The missing-indicator method adds a binary "was-this-missing" column alongside the (imputed) feature, letting a flexible model learn whether the fact of missingness is itself predictive — which it very often is under MNAR.
| Strategy | Preserves variance | Preserves correlation | Quantifies uncertainty | Reach for it when… |
|---|---|---|---|---|
| Mean / median / mode | no | no | no | Quick baseline; a low-signal column; MCAR and you only need a point estimate |
| kNN | partly | yes (local) | no | Nonlinear local structure, modest dimensionality, features already scaled |
| MICE | yes | yes | yes | Inference, reported standard errors, MAR data — the statistical gold standard |
| Model-based (EM / low-rank) | yes | yes | partly | A defensible global model; wide sparse matrices (completion) |
| Missing-indicator | n/a | adds signal | no | Suspected MNAR; tree/GBM models that can use the flag directly |
A few rules survive contact with reality. Impute inside the cross-validation fold, never before — fitting the imputer on the full dataset leaks test information into training and inflates your scores. Match the method to the mechanism: MCAR forgives anything, MAR rewards conditioning on observed predictors (kNN, MICE, model-based), MNAR demands you model the missingness explicitly and report a sensitivity analysis. And when you need honest standard errors, single imputation is not enough — multiple imputation is the only one of these that carries the uncertainty of the guess into the final answer. Some learners (notably gradient-boosted trees like XGBoost and LightGBM) handle NaN natively by learning a default split direction, which is frequently the strongest baseline of all — try it before you impute.
The four ways imputation goes wrong: (1) imputing before the train/test split — leakage that makes offline metrics fiction; (2) mean-filling a feature whose correlations matter — quiet variance collapse and washed-out relationships; (3) assuming MAR when the value hides itself — MNAR bias dressed up as a tidy table; (4) reporting single-imputation standard errors as if the fill were certain — overconfident intervals.
Once the holes are filled, the values still need to be made comparable. kNN and most distance- or gradient-based methods assume features share a scale and that categories are numbers a model can read — Chapter 03 covers encoding categoricals and scaling numerics, the step that makes everything in this chapter actually work.
References
- Rubin, D. B. (1976). Inference and Missing Data.
- Little, R. J. A. & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.).
- van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.).
- Troyanskaya, O. et al. (2001). Missing value estimation methods for DNA microarrays.
- White, I. R., Royston, P. & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice.
- scikit-learn developers. Imputation of missing values (User Guide).