AI // ENCYCLOPEDIA / DATA / 02 / MISSING DATA INDEX NEXT: ENCODING & SCALING →
DATA & FEATURE ENGINEERING · CHAPTER 02 / 05

Missing Data & Imputation

Real datasets arrive with holes, and the holes are rarely random. How a value went missing constrains how you may fill it, and naive mean-imputation degrades the relationships a model depends on. This chapter starts with Rubin's three missingness mechanisms, then works through the fixes: simple fills, kNN, multiple imputation by chained equations (MICE), and model-based strategies, noting where each one fails.

LEVELCORE READING TIME≈ 22 MIN BUILDS ONDATA 01 INSTRUMENTSIMPUTATION COMPARATOR · MECHANISM TOY · VARIANCE SHRINKAGE
2.1

Missingness mechanisms: MCAR, MAR, MNAR

Before you fill a single cell, ask why it is empty. Donald Rubin's 1976 framework — still the field's bedrock — sorts the reason into three mechanisms by asking what the probability of a value being missing depends on. Write \(R\) for the missingness indicator (\(R=1\) if a cell is observed, \(0\) if missing), \(X_{\text{obs}}\) for the data you can see, and \(X_{\text{mis}}\) for the values that are hidden.

EQ D2.1 — THE THREE MECHANISMS $$ \begin{aligned} \textbf{MCAR:}\quad & P(R \mid X_{\text{obs}}, X_{\text{mis}}) = P(R) \\ \textbf{MAR:}\quad & P(R \mid X_{\text{obs}}, X_{\text{mis}}) = P(R \mid X_{\text{obs}}) \\ \textbf{MNAR:}\quad & P(R \mid X_{\text{obs}}, X_{\text{mis}}) \text{ depends on } X_{\text{mis}} \end{aligned} $$
MCAR (missing completely at random): the holes are pure coincidence — a dropped sensor reading, a corrupted row. MAR (missing at random): the chance of missingness depends only on things you did observe — older respondents skip the income question, but you recorded age. MNAR (missing not at random): missingness depends on the hidden value itself — high earners refuse to state their income because it is high. The names are notoriously misleading: MAR is not "random", it is "explainable by observed data".

The distinction is not academic — it dictates what is recoverable:

MechanismDepends onComplete-case analysisImputation
MCARnothingunbiased (just less efficient)Optional; any sensible fill is safe
MARobserved \(X_{\text{obs}}\)Biased in generalRecoverable — condition on the observed predictors
MNARunseen \(X_{\text{mis}}\)BiasedNot fixable from the data alone — needs a model of the missingness

You cannot test MAR vs MNAR from the data. The only difference between them lives in the values you never saw, so no statistic computed on the observed data can distinguish them — this is the contested, uncomfortable heart of the field. In practice you assume MAR (it makes the math tractable and the assumption is often defensible once you condition on enough covariates), then probe sensitivity to MNAR with explicit what-if models. Honesty about this assumption is the difference between an imputation that helps and one that launders bias into a clean-looking table.

Under which missingness mechanism is a complete-case analysis (simply dropping rows with any missing value) guaranteed unbiased — losing only efficiency, not correctness? Answer with the acronym.
Only when missingness is independent of everything — observed and unobserved — are the complete cases a representative subsample of the full data. That is the definition of missing completely at random: MCAR. Under MAR or MNAR the surviving rows are a skewed slice and dropping them biases estimates.
INSTRUMENT D2.1 — MECHANISM TOYMEAN-IMPUTATION BIAS BY MECHANISM
TRUE MEAN
OBSERVED MEAN (FILL)
BIAS
Each column is a value of \(Y\); grey dots are hidden, mint dots observed. Mean-imputation fills holes at the observed mean (mint line) and reports it as the estimate. Under MCAR the observed mean tracks the true mean (dashed) — bias near zero. Switch to MNAR, where the largest values hide themselves, and watch the fill collapse downward: the estimate is biased no matter how cleverly you fill, because the information is gone.
2.2

Simple imputation — and what it costs

The reflex fix is to replace every missing entry of a column with a single constant: the mean (numeric, roughly symmetric), the median (numeric, skewed or outlier-prone), or the mode (categorical). It is one line of code and it is the most over-used tool in applied machine learning.

EQ D2.2 — MEAN IMPUTATION $$ \hat{x}_i = \bar{x}_{\text{obs}} = \frac{1}{|O|}\sum_{j \in O} x_j, \qquad O = \{\, j : x_j \text{ observed} \,\} $$
Every hole in the column gets the same number. This is unbiased for the column mean only under MCAR, and even then it commits two quieter crimes: it shrinks the variance (every imputed point sits exactly on the mean, contributing zero spread) and it destroys correlations (a flat fill is unrelated to every other column). Your downstream model sees a column that is artificially calm and artificially independent.

The variance damage is exact and worth committing to memory. If you mean-impute \(m\) of the \(n\) entries in a column, the population variance of the completed column is the original observed variance scaled down by the fraction of real data:

EQ D2.3 — VARIANCE SHRINKAGE $$ \mathrm{Var}_{\text{filled}} \;=\; \frac{n-m}{n}\,\mathrm{Var}_{\text{obs}}, \qquad \text{so } 40\% \text{ missing} \Rightarrow \text{variance} \times 0.6 $$
The mean of the column is preserved, but the spread is not: \(m\) points contribute a squared deviation of exactly zero. Standard errors computed downstream are too small, confidence intervals too narrow, and significance overstated — the model is confident about data it never had. The collapse is linear in the missing fraction, which is why mean-imputing a column that is 50% empty halves its variance.
You mean-impute the column \([\,2,\ 4,\ \text{NA},\ 8\,]\) per EQ D2.2. What single value fills the missing entry?
Average the observed entries only: \(\bar{x}_{\text{obs}} = \dfrac{2 + 4 + 8}{3} = \dfrac{14}{3} = 4.6\overline{6} \approx\) 4.67. Every hole in the column would be filled with this same number.
A column has \(n = 1000\) rows, of which \(m = 400\) are missing and get mean-imputed. By what factor is the column's variance multiplied, relative to the variance of the observed values (EQ D2.3)?
The multiplier is \(\dfrac{n-m}{n} = \dfrac{1000 - 400}{1000} = \dfrac{600}{1000} = \) 0.6. The filled column keeps 60% of its true spread; the missing 40% all pile onto the mean and contribute nothing.
PYTHON · RUNNABLE IN-BROWSER
# Mean vs kNN imputation: RMSE-to-truth on a masked, correlated column
import numpy as np
rng = np.random.default_rng(0)
n = 300
x = rng.normal(0, 1, n)                       # a predictor we always observe
y = 2.0 * x + rng.normal(0, 0.4, n)           # truth: y is strongly tied to x

mask = rng.random(n) < 0.35                    # 35% of y goes missing (MAR on nothing here = MCAR)
y_obs = y.copy(); y_obs[mask] = np.nan

# (1) mean imputation: one flat number for every hole
y_mean = y_obs.copy()
y_mean[mask] = np.nanmean(y_obs)

# (2) kNN imputation in x-space: average the k nearest observed neighbours' y
def knn_impute(x, y_obs, mask, k=7):
    out = y_obs.copy()
    obs = ~mask
    for i in np.where(mask)[0]:
        d = np.abs(x[obs] - x[i])             # distance in the observed feature
        nn = np.argsort(d)[:k]
        out[i] = y_obs[obs][nn].mean()
    return out
y_knn = knn_impute(x, y_obs, mask)

rmse = lambda a: np.sqrt(np.mean((a[mask] - y[mask])**2))
print(f"mean-impute RMSE to truth : {rmse(y_mean):.3f}")
print(f"kNN-impute  RMSE to truth : {rmse(y_knn):.3f}")
print(f"std(observed y)           : {np.nanstd(y_obs):.3f}")
print(f"std(after mean-impute)    : {np.std(y_mean):.3f}   <- shrunk")
plot_scatter(x[mask], y[mask], [0]*mask.sum())   # the points we had to guess
edits are live — break it on purpose
INSTRUMENT D2.2 — VARIANCE SHRINKAGEEQ D2.3 · MEAN-FILL COLLAPSES A DISTRIBUTION
ORIGINAL VARIANCE
AFTER MEAN-FILL
MULTIPLIER (n−m)/n
The mint curve is the true distribution; the bar at the mean is the spike of imputed points that mean-fill manufactures. As you raise the missing fraction, real spread is replaced by a stack of identical values at the center — the variance multiplier drops exactly as \((n-m)/n\). At 80% missing, four-fifths of the column is a single repeated number masquerading as data.

Median and mode share the same structural flaw — a single constant per column — but resist outliers (median) and apply to categories (mode). They are reasonable defaults for a quick baseline or a column you do not believe carries much signal; they are never the right answer for a feature whose relationships matter.

2.3

kNN imputation: borrow from your neighbours

The first real upgrade is to stop filling with a global constant and start filling with a local one. k-nearest-neighbour imputation finds the \(k\) most similar rows (by distance over the columns you do observe) and fills each hole with their average — a weighted average if you weight by distance. It made its name imputing DNA microarray expression matrices, where it beat row-average filling decisively.

EQ D2.4 — WEIGHTED kNN FILL $$ \hat{x}_{ic} \;=\; \frac{\sum_{j \in N_k(i)} w_{ij}\, x_{jc}}{\sum_{j \in N_k(i)} w_{ij}}, \qquad w_{ij} = \frac{1}{d(i,j) + \varepsilon}, \quad d(i,j) = \!\!\sqrt{\sum_{c' \in O_{ij}} (x_{ic'} - x_{jc'})^2} $$
\(N_k(i)\) are the \(k\) donors nearest to row \(i\); the distance \(d(i,j)\) is computed only over columns \(O_{ij}\) that both rows observe (so missingness does not poison the metric). Because the fill is conditioned on a row's own neighbourhood, kNN preserves local structure and inter-column correlation that a flat mean erases. The price: distances need features on a comparable scale (Chapter 03), it is sensitive to the curse of dimensionality, and scoring is \(O(n^2)\) in the naive form.

Two parameters decide its behaviour. Small \(k\) is flexible but noisy — a single odd neighbour swings the fill; large \(k\) smooths toward the global mean and re-introduces the very shrinkage you were trying to avoid. As always with kNN, you must scale your features first: an unscaled distance is dominated by whichever column happens to have the largest units, and the "nearest" neighbours become an artifact of measurement choice rather than similarity.

INSTRUMENT D2.3 — IMPUTATION COMPARATORMEAN vs kNN vs REGRESSION · RMSE TO TRUTH
METHOD
RMSE TO TRUTH
CORR x↔ŷ RECOVERED
A fixed scatter of \(y\) against \(x\); the largest-\(x\) points have their \(y\) hidden (open circles) and each method guesses them (mint crosses). MEAN fills a flat horizontal line — RMSE high, the \(x\)–\(y\) correlation gone. kNN tracks the local trend; the \(k\) slider trades noise for over-smoothing. REGRESSION fits the line and lands the crosses on it — lowest RMSE here precisely because the truth is linear. Change the method and read how RMSE and recovered correlation move.
2.4

MICE: multiple imputation by chained equations

Every method so far fills a single best guess and then proceeds as if it were ground truth — which pretends the imputed values carry no uncertainty. Multiple imputation fixes that at the root: generate several complete datasets, each with plausibly different fills, analyse each, and pool the results so the extra variance from imputation flows into your final standard errors. MICE (also called fully conditional specification) is the dominant way to generate those datasets.

The chained-equations idea is elegant. Initialize every hole with a simple fill, then sweep the columns one at a time: for each column with missing data, regress it on all the others using the currently-filled rows, and draw new imputations from that conditional model. Repeat the sweep until the fills stop changing.

EQ D2.5 — ONE MICE SWEEP (FULLY CONDITIONAL) $$ \text{for each } c:\quad x^{(t+1)}_{\cdot c\,\in\,\text{mis}} \;\sim\; P\!\left(x_{\cdot c} \,\middle|\, x_{\cdot 1}, \ldots, x_{\cdot c-1}, x_{\cdot c+1}, \ldots, x_{\cdot p};\ \hat{\theta}_c\right) $$
Each column gets its own conditional model \(\hat{\theta}_c\) (linear regression for a continuous column, logistic for binary, and so on) fit on the other columns. Sweeping cycles until convergence — a Gibbs-sampler-style procedure that, under MAR, draws from the joint posterior of the missing data. Run it \(M\) times with different random draws to get \(M\) complete datasets. Drawing from the conditional distribution — not just its mean — is what injects honest uncertainty: take the conditional mean instead and you get a sharper point estimate but lose the variance MICE exists to preserve.

The payoff is the pooling step, Rubin's rules: average the \(M\) point estimates, and combine their variances so the total reflects both within-imputation and between-imputation uncertainty:

EQ D2.6 — RUBIN'S POOLING $$ \bar{Q} = \frac{1}{M}\sum_{m=1}^{M} \hat{Q}_m, \qquad T = \underbrace{\frac{1}{M}\sum_{m=1}^{M} U_m}_{\text{within } \bar{U}} \;+\; \underbrace{\left(1 + \tfrac{1}{M}\right) \frac{1}{M-1}\sum_{m=1}^{M}\!\big(\hat{Q}_m - \bar{Q}\big)^2}_{\text{between } B} $$
\(\hat{Q}_m\) is your estimate (a coefficient, a mean) from imputed dataset \(m\); \(U_m\) is its own variance. The total variance \(T\) adds the between-imputation spread \(B\) — the part single-imputation throws away. The \((1 + 1/M)\) factor corrects for using a finite number of imputations. This is why \(M = 5\!-\!20\) imputations beat one perfect-looking fill: the disagreement between them is the uncertainty you would otherwise hide.
PYTHON · RUNNABLE IN-BROWSER
# Mini-MICE: iteratively regress each column on the others; watch it converge
import numpy as np
rng = np.random.default_rng(1)
n, p = 200, 3
# correlated columns: a shared factor plus noise
f = rng.normal(0, 1, (n, 1))
X = f * np.array([1.0, 0.8, -0.6]) + rng.normal(0, 0.5, (n, p))

M = rng.random((n, p)) < 0.20                  # 20% missing, scattered
Xm = X.copy(); Xm[M] = np.nan
col_mean = np.nanmean(Xm, axis=0)
Xf = Xm.copy()
for c in range(p):                            # step 0: mean-init every hole
    Xf[M[:, c], c] = col_mean[c]

for sweep in range(8):                         # chained equations
    prev = Xf.copy()
    for c in range(p):                         # regress column c on the rest
        rows = M[:, c]
        if not rows.any():
            continue
        others = [k for k in range(p) if k != c]
        A = np.column_stack([np.ones(n), Xf[:, others]])
        beta, *_ = np.linalg.lstsq(A[~rows], Xf[~rows, c], rcond=None)
        Xf[rows, c] = A[rows] @ beta           # conditional-mean fill
    delta = np.abs(Xf - prev)[M].mean()
    print(f"sweep {sweep+1}: mean change in filled cells = {delta:.5f}")

err = np.sqrt(np.mean((Xf[M] - X[M])**2))
print(f"\nfinal RMSE of MICE fills to truth: {err:.3f}")
print("change shrinks toward 0 -> the chained equations reached a fixed point.")
edits are live — break it on purpose

The honest caveat. Chained equations specify each column's conditional separately, so there is no guarantee a coherent joint distribution exists that matches all of them — yet the procedure is remarkably robust in practice and is the default in R's mice and scikit-learn's IterativeImputer. Convergence is monitored by eye (trace plots of imputed means across sweeps), not a hard stopping rule.

2.5

Model-based & indicator strategies; choosing in practice

Two more tools round out the kit. Model-based imputation fits a single probabilistic model of the whole feature matrix and reads the missing values off it — Gaussian/EM imputation (maximum-likelihood under a multivariate-normal assumption), low-rank matrix completion (SVD/soft-impute, the engine behind recommender systems), and increasingly tree- and neural-network-based imputers. The missing-indicator method adds a binary "was-this-missing" column alongside the (imputed) feature, letting a flexible model learn whether the fact of missingness is itself predictive — which it very often is under MNAR.

StrategyPreserves variancePreserves correlationQuantifies uncertaintyReach for it when…
Mean / median / modenononoQuick baseline; a low-signal column; MCAR and you only need a point estimate
kNNpartlyyes (local)noNonlinear local structure, modest dimensionality, features already scaled
MICEyesyesyesInference, reported standard errors, MAR data — the statistical gold standard
Model-based (EM / low-rank)yesyespartlyA defensible global model; wide sparse matrices (completion)
Missing-indicatorn/aadds signalnoSuspected MNAR; tree/GBM models that can use the flag directly

A few rules survive contact with reality. Impute inside the cross-validation fold, never before — fitting the imputer on the full dataset leaks test information into training and inflates your scores. Match the method to the mechanism: MCAR forgives anything, MAR rewards conditioning on observed predictors (kNN, MICE, model-based), MNAR demands you model the missingness explicitly and report a sensitivity analysis. And when you need honest standard errors, single imputation is not enough — multiple imputation is the only one of these that carries the uncertainty of the guess into the final answer. Some learners (notably gradient-boosted trees like XGBoost and LightGBM) handle NaN natively by learning a default split direction, which is frequently the strongest baseline of all — try it before you impute.

PITFALLS

The four ways imputation goes wrong: (1) imputing before the train/test split — leakage that makes offline metrics fiction; (2) mean-filling a feature whose correlations matter — quiet variance collapse and washed-out relationships; (3) assuming MAR when the value hides itself — MNAR bias dressed up as a tidy table; (4) reporting single-imputation standard errors as if the fill were certain — overconfident intervals.

NEXT

Once the holes are filled, the values still need to be made comparable. kNN and most distance- or gradient-based methods assume features share a scale and that categories are numbers a model can read — Chapter 03 covers encoding categoricals and scaling numerics, the step that makes everything in this chapter actually work.

2.R

References

  1. Rubin, D. B. (1976). Inference and Missing Data. Biometrika 63(3):581–592 — the paper that defined MCAR, MAR, and MNAR.
  2. Little, R. J. A. & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley — the canonical textbook on mechanisms, likelihood-based, and multiple imputation.
  3. van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). CRC Press — the practical, freely-readable reference for MICE / chained equations.
  4. Troyanskaya, O. et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525 — the kNN-impute (KNNimpute) paper.
  5. White, I. R., Royston, P. & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 30(4):377–399 — practical guidance on running and pooling MICE.
  6. scikit-learn developers. Imputation of missing values (User Guide). Official docs — SimpleImputer, KNNImputer, and IterativeImputer (MICE).