Distribution shift — covariate, label & concept drift
Supervised learning rests on one quiet assumption: the data you serve is drawn from the same distribution as the data you trained on. Write the joint distribution of inputs \(x\) and labels \(y\) as \(P(x, y) = P(y \mid x)\,P(x)\). Training estimates \(\hat{f}\) against a fixed \(P_{\text{train}}\); production feeds it some \(P_{\text{prod}}\). When the two diverge, the model is being asked a question it was never taught to answer. There are three textbook ways for them to diverge, and they are not interchangeable.
The distinction is operational, not academic, because it dictates the fix. Covariate shift can sometimes be corrected by importance weighting — reweight training examples by \(w(x) = P_{\text{prod}}(x)/P_{\text{train}}(x)\) so the old data resembles the new — without any fresh labels. Label shift is corrected by re-estimating the priors. Concept drift admits no such trick: the mapping moved, so the model must relearn it from freshly labelled data. Worse, concept drift can be real (the world genuinely changed — a new fraud tactic) or virtual (only \(P(x)\) moved, \(P(y\mid x)\) is intact); Gama et al. carefully separate the two, because virtual drift may need nothing more than a wider training set.
Drift also has a shape in time, and the shape decides how you watch for it. The canonical taxonomy (Gama et al., 2014):
| Pattern | What happens | Example |
|---|---|---|
| Sudden | An abrupt jump to a new concept | a sensor is replaced; a regulation flips overnight |
| Gradual | The new concept slowly overtakes the old, the two coexisting for a while | a product preference migrating between cohorts |
| Incremental | A slow, continuous slide through intermediate concepts | inflation eroding a price model month by month |
| Recurring | Old concepts return on a cycle (seasonality) | holiday shopping, weekday/weekend traffic |
Seasonality is the great impostor. A recurring pattern looks like drift to a naïve detector but needs no retraining — only a model that already encodes the cycle, or a baseline that compares like-for-like (this December against last December, not against November). Treating seasonality as drift is the most common false alarm in production monitoring, and the reason §5.4 insists on a sensible reference window.
Population Stability Index (PSI) & CSI
Before you can react to drift you have to measure it, and the industry's workhorse — born in credit-risk scorecards and now ubiquitous — is the Population Stability Index. Take a feature (or the model's output score), bin it once on a reference period to get expected proportions \(E_i\), then count the same bins on the live period to get actual proportions \(A_i\). PSI is the symmetric relative-entropy-style sum over bins:
PSI earns its keep because, empirically, its magnitude maps onto a stable rule of thumb that has survived decades of scorecard practice:
| PSI | Interpretation | Action |
|---|---|---|
| < 0.10 | No significant population change | continue monitoring |
| 0.10 – 0.25 | Moderate shift — worth investigating | investigate, watch closely |
| > 0.25 | Significant shift | act — retrain or recalibrate |
Those thresholds (0.1 and 0.25) are heuristic, not theorems — they predate any distributional theory and assume roughly 10 bins of reasonable size. Treat them as alarm levels, not laws: with very large samples even a trivial, harmless shift can clear 0.25, and with tiny samples noise inflates PSI. Always pair the number with a look at which bins moved.
Apply EQ V5.2 to the model's output score and people call it PSI; apply the identical formula to a single input feature and the same community calls it the Characteristic Stability Index (CSI). The math is the same; only the target differs — and the pairing is diagnostic. A stable PSI with a drifting CSI says an input moved but the model's score has so far absorbed it; a drifting PSI tells you the score distribution itself has shifted, which is what actually feeds downstream decisions and cutoffs.
# PSI between an expected (reference) and actual (live) binned distribution.
import numpy as np
# Fixed reference scores; fit 10 equal-width bins ONCE on the reference.
rng = np.random.default_rng(0)
ref = rng.normal(0.0, 1.0, 5000) # training-time score distribution
live = rng.normal(0.5, 1.1, 5000) # production: shifted right + wider
edges = np.linspace(-4, 4, 11) # 10 bins, frozen on the reference
E = np.histogram(ref, edges)[0] / len(ref)
A = np.histogram(live, edges)[0] / len(live)
eps = 1e-6 # guard empty bins (ln 0 is undefined)
E = np.clip(E, eps, None); A = np.clip(A, eps, None)
terms = (A - E) * np.log(A / E) # EQ V5.3, one per bin
psi = terms.sum() # EQ V5.2
print("bin E A contribution")
for i in range(len(E)):
print(f"{i:2d} {E[i]:.3f} {A[i]:.3f} {terms[i]:+.4f}")
band = "STABLE" if psi < 0.10 else "MODERATE" if psi < 0.25 else "SIGNIFICANT SHIFT"
print(f"\nPSI = {psi:.3f} -> {band}")
print(f"biggest single bucket: bin {int(np.argmax(terms))} "
f"({terms.max():.4f}) -- read the breakdown, not just the total.")
Detecting drift in production
PSI is a batch statistic: you compute it over a window. Production also wants streaming detectors that raise a flag the moment a process changes, and they split cleanly by what they watch.
Watch the inputs (label-free)
Labels usually arrive late — a loan defaults months after approval, a churn label resolves a quarter later — so the first line of defence watches the feature distribution, which is available instantly. The tools are statistical two-sample tests between a reference window and a recent window:
- Kolmogorov–Smirnov for a continuous feature: the maximum gap between the two empirical CDFs.
- Chi-squared for a categorical feature: observed-vs-expected counts per category.
- PSI / CSI (§5.2) as a thresholded scalar, the operations-friendly summary.
- Maximum Mean Discrepancy (MMD) for the joint multivariate input, when per-feature tests miss a shift in the correlations.
The hard truth of label-free detection: it can only ever see covariate shift. A pure concept drift — \(P(y\mid x)\) moves while \(P(x)\) stays put — leaves every input test silent while accuracy quietly rots. Input monitoring is necessary and cheap, but it is not sufficient.
Watch the errors (label-dependent)
The only thing that directly sees concept drift is the model's own error stream. The classic online detector is DDM (Drift Detection Method): treat the per-example error as a Bernoulli sequence whose error rate \(p_t\) should fall or hold as a stable model sees more data. Track the running rate and its standard deviation \(s_t = \sqrt{p_t(1-p_t)/t}\), remember the minimum point \((p_{\min}, s_{\min})\) reached, and alarm when the current point drifts a few standard deviations above that best:
The honest framing is a detection-theory trade-off, not a free lunch: a sensitive detector catches drift early but cries wolf on noise and seasonality; a conservative one is quiet but lets the model rot longer before it fires. There is no setting that is both early and silent — you tune the operating point to the cost of a missed drift versus the cost of a needless retrain.
# Concept-drift detection with a rolling error monitor (DDM-style, EQ V5.4).
import numpy as np
rng = np.random.default_rng(1)
# A stream of 0/1 errors: stable ~8% for 600 steps, then concept drift -> ~32%.
n1, n2 = 600, 400
errors = np.concatenate([rng.random(n1) < 0.08,
rng.random(n2) < 0.32]).astype(int)
p, s = 0.0, 0.0 # running error rate and its std
p_min, s_min = 1.0, 1.0 # best (lowest) point reached so far
warn_at = drift_at = None
err_sum = 0
for t in range(1, len(errors) + 1):
err_sum += errors[t - 1]
p = err_sum / t
s = np.sqrt(p * (1 - p) / t)
if p + s < p_min + s_min: # new best -> reset the floor
p_min, s_min = p, s
if drift_at is None and p + s >= p_min + 3 * s_min and t > 30:
drift_at = t
elif warn_at is None and p + s >= p_min + 2 * s_min and t > 30:
warn_at = t
print(f"true change point : {n1}")
print(f"DDM warning raised at step: {warn_at}")
print(f"DDM drift declared at step: {drift_at}")
print(f"detection delay : {drift_at - n1} steps after the real shift")
plot_xy(np.arange(len(errors)), np.cumsum(errors) / np.arange(1, len(errors) + 1))
Monitoring & retraining triggers
Detection is only half the loop. A monitoring system has to turn a signal into a decision: do nothing, alert a human, or retrain. Three trigger philosophies, roughly in order of maturity:
- Scheduled retraining. Refit on a fixed cadence — nightly, weekly, monthly. Dead simple and predictable, but it is both wasteful (you retrain when nothing changed) and dangerous (you wait until the next cycle while the model rots). It is a default, not an answer.
- Performance-triggered. Retrain when a live metric — accuracy, AUC, calibration, a business KPI — crosses a threshold. The gold standard, because it reacts to what you actually care about, but it needs ground-truth labels, and those often arrive with a long, costly delay.
- Drift-triggered. Retrain when an input statistic (PSI/CSI, KS, a streaming detector) crosses a threshold. Available immediately and label-free — the proxy you reach for while labels are in flight — but it can fire on harmless covariate shift and stay silent on pure concept drift. In practice you run drift triggers as an early warning and performance triggers as the authoritative one.
Every trigger needs a reference window to compare against, and the choice is consequential. A fixed reference (the training set) detects drift relative to the world the model actually learned — the correct baseline for "is my model still valid?" A sliding reference (last month) detects change but normalizes away slow incremental drift, so the model can boil like the proverbial frog while every week looks like the last. Most mature stacks keep the training distribution as the anchor and add seasonality-aware comparisons on top.
The cost side has its own arithmetic. Suppose drift erodes value at a roughly linear rate after each retrain, so the average performance gap you carry scales with the time between retrains. Retrain too often and you pay compute and review for nothing; too rarely and you eat accumulating decay. The optimum balances the two — a classic inventory-style trade-off:
Four ways drift monitoring goes wrong: (1) alarm fatigue — a detector tuned so hot it fires on every Monday; teams learn to ignore it and miss the real one. (2) seasonality mistaken for drift — comparing December to November instead of to last December. (3) retraining on contaminated data — the freshly buffered window includes the very anomaly that triggered the alarm, so you retrain the model to expect the disaster. (4) silent label delay — your performance trigger cannot fire because the labels for the drifted period have not arrived yet, and your input triggers cannot see concept drift; the gap between them is where models die quietly.
A model decaying in the wild
Put the pieces together and a deployed model's life has a characteristic arc: a fresh fit performs near its validation score, holds for a while, then bends downward as the world drifts away from the snapshot it learned. The slope of that bend is the decay rate \(d\); a retrain snaps performance back toward the top and the clock restarts. The whole job of this chapter is to see the bend early enough — through PSI on the inputs and error monitors on the outputs — to retrain on the way down rather than at the bottom.
# Sawtooth decay (EQ V5.6): no-retrain vs periodic retrain -> value recovered.
import numpy as np
rng = np.random.default_rng(2)
T = 180 # days in service
acc0, d, noise = 0.90, 0.0015, 0.004 # ceiling, decay/day, measurement noise
# Scenario A: never retrain -> monotone decay from the ceiling.
never = acc0 - d * np.arange(T) + rng.normal(0, noise, T)
# Scenario B: retrain every 30 days -> reset the clock each cycle.
period, retr = 30, acc0 - d * (np.arange(T) % 30) + rng.normal(0, noise, T)
print(f"day 0 : never {never[0]:.3f} retrained {retr[0]:.3f}")
print(f"day 90 : never {never[90]:.3f} retrained {retr[90]:.3f}")
print(f"day 179 : never {never[179]:.3f} retrained {retr[179]:.3f}")
print(f"\nmean accuracy, never-retrain : {never.mean():.3f}")
print(f"mean accuracy, retrain @30d : {retr.mean():.3f}")
print(f"value recovered by retraining: {retr.mean() - never.mean():+.3f} acc")
plot_xy(np.arange(T), retr) # the sawtooth: decay, snap, decay, snap
Drift monitoring tells you that the model changed; it never tells you why. When PSI spikes and accuracy bends, the next question is always "which feature, which interaction, which case?" — and answering it is the job of the explainability toolkit. Chapter 06: SHAP and its game-theoretic guarantees, LIME's local surrogates, partial dependence and ICE, and the honest limits of post-hoc explanation.
References
- Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia, A. (2014). A Survey on Concept Drift Adaptation.
- Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L. & Petitjean, F. (2016). Characterizing Concept Drift.
- Gama, J., Medas, P., Castillo, G. & Rodrigues, P. (2004). Learning with Drift Detection (DDM).
- Bifet, A. & Gavaldà, R. (2007). Learning from Time-Changing Data with Adaptive Windowing (ADWIN).
- Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. (2012). A Kernel Two-Sample Test (MMD).
- Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. (eds.) (2009). Dataset Shift in Machine Learning.