AI // ENCYCLOPEDIA / MODEL RISK / 05 / STABILITY & DRIFT INDEX NEXT: 06 EXPLAINABILITY →
MODEL VALIDATION & RISK · CHAPTER 05 / 07

Stability & Drift

A model is trained once on a fixed snapshot, then deployed into an environment that keeps changing. As the input distribution and the input-output relationships shift, an unchanged model gradually loses accuracy. Every deployed model decays; the open question is whether you detect the drift before users do.

LEVELCORE READING TIME≈ 27 MIN BUILDS ONMLOPS 01 · STATS 04 INSTRUMENTSPSI · STREAM DETECTOR · DECAY
5.1

Distribution shift — covariate, label & concept drift

Supervised learning rests on one quiet assumption: the data you serve is drawn from the same distribution as the data you trained on. Write the joint distribution of inputs \(x\) and labels \(y\) as \(P(x, y) = P(y \mid x)\,P(x)\). Training estimates \(\hat{f}\) against a fixed \(P_{\text{train}}\); production feeds it some \(P_{\text{prod}}\). When the two diverge, the model is being asked a question it was never taught to answer. There are three textbook ways for them to diverge, and they are not interchangeable.

EQ V5.1 — THE THREE SHIFTS $$ \underbrace{P_{\text{prod}}(x)\neq P_{\text{train}}(x)}_{\text{covariate shift}},\qquad \underbrace{P_{\text{prod}}(y)\neq P_{\text{train}}(y)}_{\text{label / prior shift}},\qquad \underbrace{P_{\text{prod}}(y\mid x)\neq P_{\text{train}}(y\mid x)}_{\text{concept drift}} $$
Covariate shift moves the inputs (\(P(x)\) changes) while the rule \(P(y\mid x)\) holds — your traffic now skews toward regions of feature space the model rarely saw. Label / prior shift moves the class balance \(P(y)\) — fraud spikes, the base rate moves. Concept drift is the dangerous one: the relationship itself, \(P(y\mid x)\), changes, so the function you learned is now wrong, not merely under-sampled. Critically, only the first two are visible from inputs alone; concept drift can be invisible in the features and surface only as a collapse in accuracy — which is why you monitor both.

The distinction is operational, not academic, because it dictates the fix. Covariate shift can sometimes be corrected by importance weighting — reweight training examples by \(w(x) = P_{\text{prod}}(x)/P_{\text{train}}(x)\) so the old data resembles the new — without any fresh labels. Label shift is corrected by re-estimating the priors. Concept drift admits no such trick: the mapping moved, so the model must relearn it from freshly labelled data. Worse, concept drift can be real (the world genuinely changed — a new fraud tactic) or virtual (only \(P(x)\) moved, \(P(y\mid x)\) is intact); Gama et al. carefully separate the two, because virtual drift may need nothing more than a wider training set.

Drift also has a shape in time, and the shape decides how you watch for it. The canonical taxonomy (Gama et al., 2014):

PatternWhat happensExample
SuddenAn abrupt jump to a new concepta sensor is replaced; a regulation flips overnight
GradualThe new concept slowly overtakes the old, the two coexisting for a whilea product preference migrating between cohorts
IncrementalA slow, continuous slide through intermediate conceptsinflation eroding a price model month by month
RecurringOld concepts return on a cycle (seasonality)holiday shopping, weekday/weekend traffic

Seasonality is the great impostor. A recurring pattern looks like drift to a naïve detector but needs no retraining — only a model that already encodes the cycle, or a baseline that compares like-for-like (this December against last December, not against November). Treating seasonality as drift is the most common false alarm in production monitoring, and the reason §5.4 insists on a sensible reference window.

A spam filter's inputs look unchanged, but spammers adopt a brand-new phrasing so the same words now mean something different and accuracy collapses. Which of the three shifts is this — covariate, label, or concept? (one word)
The feature distribution \(P(x)\) is stable, but the mapping \(P(y\mid x)\) — which words imply spam — has moved. That is concept drift, the one invisible in the inputs and the one that genuinely requires relearning the function (EQ V5.1).
5.2

Population Stability Index (PSI) & CSI

Before you can react to drift you have to measure it, and the industry's workhorse — born in credit-risk scorecards and now ubiquitous — is the Population Stability Index. Take a feature (or the model's output score), bin it once on a reference period to get expected proportions \(E_i\), then count the same bins on the live period to get actual proportions \(A_i\). PSI is the symmetric relative-entropy-style sum over bins:

EQ V5.2 — POPULATION STABILITY INDEX $$ \mathrm{PSI} \;=\; \sum_{i=1}^{B}\big(A_i - E_i\big)\,\ln\!\frac{A_i}{E_i} $$
\(B\) bins; \(E_i\) is the expected (reference) fraction of mass in bin \(i\), \(A_i\) the actual (current) fraction; both sets sum to 1. Each term is \(\ge 0\) — a bin that gained or lost mass contributes a positive amount, and the larger the relative move, the larger the term. PSI is exactly the symmetrized KL divergence (the Jeffreys divergence) between the two binned distributions: \(\mathrm{KL}(A\Vert E) + \mathrm{KL}(E\Vert A)\). It is zero only when every bin matches and grows without bound as mass migrates. The number is a single scalar you can alarm on.

PSI earns its keep because, empirically, its magnitude maps onto a stable rule of thumb that has survived decades of scorecard practice:

PSIInterpretationAction
< 0.10No significant population changecontinue monitoring
0.10 – 0.25Moderate shift — worth investigatinginvestigate, watch closely
> 0.25Significant shiftact — retrain or recalibrate

Those thresholds (0.1 and 0.25) are heuristic, not theorems — they predate any distributional theory and assume roughly 10 bins of reasonable size. Treat them as alarm levels, not laws: with very large samples even a trivial, harmless shift can clear 0.25, and with tiny samples noise inflates PSI. Always pair the number with a look at which bins moved.

Apply EQ V5.2 to the model's output score and people call it PSI; apply the identical formula to a single input feature and the same community calls it the Characteristic Stability Index (CSI). The math is the same; only the target differs — and the pairing is diagnostic. A stable PSI with a drifting CSI says an input moved but the model's score has so far absorbed it; a drifting PSI tells you the score distribution itself has shifted, which is what actually feeds downstream decisions and cutoffs.

EQ V5.3 — ONE PSI BUCKET'S CONTRIBUTION $$ \mathrm{psi}_i \;=\; (A_i - E_i)\,\ln\!\frac{A_i}{E_i}, \qquad \mathrm{PSI} = \sum_i \mathrm{psi}_i $$
The per-bucket term is the unit you actually reason about. A bucket whose expected mass was \(E_i = 0.20\) and whose actual mass rose to \(A_i = 0.30\) contributes \((0.30-0.20)\ln(0.30/0.20) = 0.10 \times \ln 1.5 = 0.10 \times 0.405 = \mathbf{0.0405}\). Sum these across bins and one or two large terms usually dominate — read the bucket breakdown, not just the total, because it points straight at the feature region that moved.
Using EQ V5.3, a PSI bucket has expected proportion \( E_i = 0.20 \) and actual proportion \( A_i = 0.30 \). What is this single bucket's contribution to PSI, \( (A_i - E_i)\ln(A_i/E_i) \)? (Use \( \ln 1.5 = 0.405 \).)
\( A_i - E_i = 0.30 - 0.20 = 0.10 \); the ratio \( A_i/E_i = 0.30/0.20 = 1.5 \), so \( \ln 1.5 = 0.405 \). The contribution is \( 0.10 \times 0.405 = \) 0.0405. Four or five buckets of that size already push the total past the 0.25 alarm.
A PSI above 0.25 usually signals a significant population shift that warrants action (retrain or recalibrate). True or false? (Answer true or false.)
By the standard scorecard rule of thumb, PSI < 0.1 is stable, 0.1–0.25 is a moderate shift worth investigating, and PSI > 0.25 is a significant shift that calls for action. So the statement is true — with the honest caveat that the 0.25 line is a heuristic, not a proof, and must be read alongside sample size and the per-bucket breakdown.
PYTHON · RUNNABLE IN-BROWSER
# PSI between an expected (reference) and actual (live) binned distribution.
import numpy as np

# Fixed reference scores; fit 10 equal-width bins ONCE on the reference.
rng = np.random.default_rng(0)
ref  = rng.normal(0.0, 1.0, 5000)            # training-time score distribution
live = rng.normal(0.5, 1.1, 5000)            # production: shifted right + wider

edges = np.linspace(-4, 4, 11)               # 10 bins, frozen on the reference
E = np.histogram(ref,  edges)[0] / len(ref)
A = np.histogram(live, edges)[0] / len(live)

eps = 1e-6                                    # guard empty bins (ln 0 is undefined)
E = np.clip(E, eps, None); A = np.clip(A, eps, None)
terms = (A - E) * np.log(A / E)              # EQ V5.3, one per bin
psi = terms.sum()                            # EQ V5.2

print("bin   E      A      contribution")
for i in range(len(E)):
    print(f"{i:2d}  {E[i]:.3f}  {A[i]:.3f}    {terms[i]:+.4f}")
band = "STABLE" if psi < 0.10 else "MODERATE" if psi < 0.25 else "SIGNIFICANT SHIFT"
print(f"\nPSI = {psi:.3f}  ->  {band}")
print(f"biggest single bucket: bin {int(np.argmax(terms))} "
      f"({terms.max():.4f}) -- read the breakdown, not just the total.")
edits are live — break it on purpose
INSTRUMENT V5.1 — PSI CALCULATORSHIFT A DISTRIBUTION · CROSS 0.10 / 0.25 · EQ V5.2
PSI
VERDICT
TOP BUCKET TERM
The grey outline is the reference (expected) distribution; the mint bars are the live (actual) one. Bins are frozen on the reference. Push MEAN SHIFT from 0 and watch PSI climb through the dashed 0.10 line into the 0.25 danger zone; widening the spread alone moves both tails and also raises PSI even with zero mean shift. Add bins to see the total wobble — PSI is bin-count sensitive, which is why a fixed binning matters.
5.3

Detecting drift in production

PSI is a batch statistic: you compute it over a window. Production also wants streaming detectors that raise a flag the moment a process changes, and they split cleanly by what they watch.

Watch the inputs (label-free)

Labels usually arrive late — a loan defaults months after approval, a churn label resolves a quarter later — so the first line of defence watches the feature distribution, which is available instantly. The tools are statistical two-sample tests between a reference window and a recent window:

  • Kolmogorov–Smirnov for a continuous feature: the maximum gap between the two empirical CDFs.
  • Chi-squared for a categorical feature: observed-vs-expected counts per category.
  • PSI / CSI (§5.2) as a thresholded scalar, the operations-friendly summary.
  • Maximum Mean Discrepancy (MMD) for the joint multivariate input, when per-feature tests miss a shift in the correlations.

The hard truth of label-free detection: it can only ever see covariate shift. A pure concept drift — \(P(y\mid x)\) moves while \(P(x)\) stays put — leaves every input test silent while accuracy quietly rots. Input monitoring is necessary and cheap, but it is not sufficient.

Watch the errors (label-dependent)

The only thing that directly sees concept drift is the model's own error stream. The classic online detector is DDM (Drift Detection Method): treat the per-example error as a Bernoulli sequence whose error rate \(p_t\) should fall or hold as a stable model sees more data. Track the running rate and its standard deviation \(s_t = \sqrt{p_t(1-p_t)/t}\), remember the minimum point \((p_{\min}, s_{\min})\) reached, and alarm when the current point drifts a few standard deviations above that best:

EQ V5.4 — DDM WARNING & DRIFT LEVELS $$ \text{warning: } p_t + s_t \ge p_{\min} + 2\,s_{\min}, \qquad \text{drift: } p_t + s_t \ge p_{\min} + 3\,s_{\min} $$
As long as the model is stable, \(p_t\) drifts down and \(p_{\min}+2s_{\min}\) tracks the best-so-far error. When the error climbs two standard deviations above that floor, DDM enters a warning zone (start buffering recent data); at three it declares drift (the buffered window becomes the retraining set). The \(2\sigma/3\sigma\) bands are the Gaussian-tail logic of a control chart applied to a learning curve. Variants — EDDM (watches the distance between errors, better for gradual drift), ADWIN (an adaptive window with a formal false-positive bound), Page-Hinkley (a CUSUM on the error) — trade sensitivity against false alarms.

The honest framing is a detection-theory trade-off, not a free lunch: a sensitive detector catches drift early but cries wolf on noise and seasonality; a conservative one is quiet but lets the model rot longer before it fires. There is no setting that is both early and silent — you tune the operating point to the cost of a missed drift versus the cost of a needless retrain.

PYTHON · RUNNABLE IN-BROWSER
# Concept-drift detection with a rolling error monitor (DDM-style, EQ V5.4).
import numpy as np
rng = np.random.default_rng(1)

# A stream of 0/1 errors: stable ~8% for 600 steps, then concept drift -> ~32%.
n1, n2 = 600, 400
errors = np.concatenate([rng.random(n1) < 0.08,
                         rng.random(n2) < 0.32]).astype(int)

p, s = 0.0, 0.0                       # running error rate and its std
p_min, s_min = 1.0, 1.0              # best (lowest) point reached so far
warn_at = drift_at = None
err_sum = 0
for t in range(1, len(errors) + 1):
    err_sum += errors[t - 1]
    p = err_sum / t
    s = np.sqrt(p * (1 - p) / t)
    if p + s < p_min + s_min:        # new best -> reset the floor
        p_min, s_min = p, s
    if drift_at is None and p + s >= p_min + 3 * s_min and t > 30:
        drift_at = t
    elif warn_at is None and p + s >= p_min + 2 * s_min and t > 30:
        warn_at = t

print(f"true change point         : {n1}")
print(f"DDM warning raised at step: {warn_at}")
print(f"DDM drift declared at step: {drift_at}")
print(f"detection delay           : {drift_at - n1} steps after the real shift")
plot_xy(np.arange(len(errors)), np.cumsum(errors) / np.arange(1, len(errors) + 1))
edits are live — break it on purpose
INSTRUMENT V5.2 — STREAMING DRIFT DETECTORROLLING z-TEST ON A FEATURE · WARNING → DRIFT
DETECTED AT STEP
DETECTION DELAY
FALSE ALARMS (PRE-DRIFT)
A feature streams across 240 steps. It is stationary until the dashed change line at step 120, then its mean jumps by DRIFT MAGNITUDE. The detector keeps a reference window and a recent window of width \(W\) and fires when their means differ by more than \(z\) standard errors; the mint marker is the first detection. Crank sensitivity down (low \(z\)) to catch tiny drifts at the cost of false alarms before the change; raise it for silence-but-late. There is no setting that is both early and quiet — that is the detection trade-off made visible.
5.4

Monitoring & retraining triggers

Detection is only half the loop. A monitoring system has to turn a signal into a decision: do nothing, alert a human, or retrain. Three trigger philosophies, roughly in order of maturity:

  • Scheduled retraining. Refit on a fixed cadence — nightly, weekly, monthly. Dead simple and predictable, but it is both wasteful (you retrain when nothing changed) and dangerous (you wait until the next cycle while the model rots). It is a default, not an answer.
  • Performance-triggered. Retrain when a live metric — accuracy, AUC, calibration, a business KPI — crosses a threshold. The gold standard, because it reacts to what you actually care about, but it needs ground-truth labels, and those often arrive with a long, costly delay.
  • Drift-triggered. Retrain when an input statistic (PSI/CSI, KS, a streaming detector) crosses a threshold. Available immediately and label-free — the proxy you reach for while labels are in flight — but it can fire on harmless covariate shift and stay silent on pure concept drift. In practice you run drift triggers as an early warning and performance triggers as the authoritative one.

Every trigger needs a reference window to compare against, and the choice is consequential. A fixed reference (the training set) detects drift relative to the world the model actually learned — the correct baseline for "is my model still valid?" A sliding reference (last month) detects change but normalizes away slow incremental drift, so the model can boil like the proverbial frog while every week looks like the last. Most mature stacks keep the training distribution as the anchor and add seasonality-aware comparisons on top.

The cost side has its own arithmetic. Suppose drift erodes value at a roughly linear rate after each retrain, so the average performance gap you carry scales with the time between retrains. Retrain too often and you pay compute and review for nothing; too rarely and you eat accumulating decay. The optimum balances the two — a classic inventory-style trade-off:

EQ V5.5 — RETRAIN-CADENCE COST $$ \text{Cost}(T) \;=\; \underbrace{\frac{c_{\text{retrain}}}{T}}_{\text{amortized retrain}} \;+\; \underbrace{\tfrac{1}{2}\,d\,T}_{\text{average decay carried}} \qquad\Longrightarrow\qquad T^\star = \sqrt{\frac{2\,c_{\text{retrain}}}{d}} $$
\(T\) is the interval between retrains, \(c_{\text{retrain}}\) the cost (compute + validation + risk) of one retrain, and \(d\) the per-unit-time rate at which value decays after a fresh fit. The first term falls with \(T\) (retrain less, amortize more); the second rises with \(T\) (carry more accumulated decay on average). Setting the derivative to zero gives the square-root cadence \(T^\star=\sqrt{2c_{\text{retrain}}/d}\) — the same shape as the economic-order-quantity rule. Faster-drifting models (large \(d\)) should retrain more often; expensive retrains (large \(c\)) push the cadence out. It is a back-of-envelope model, not gospel — real decay is rarely linear and seasonality breaks the smoothness — but it gives the right instinct for the dial.
Using EQ V5.5, one retrain costs \( c_{\text{retrain}} = 200 \) units and value decays at \( d = 1 \) unit per day. What is the cost-optimal interval between retrains, \( T^\star = \sqrt{2c_{\text{retrain}}/d} \), in days?
\( 2 c_{\text{retrain}}/d = 2 \times 200 / 1 = 400 \), and \( \sqrt{400} = \) 20 days. Halve the retrain cost and the cadence tightens to \(\sqrt{200}\approx14\) days; double the drift rate and it tightens to \(\sqrt{200}\approx14\) days too — the square root makes the dial gentle.
PITFALLS

Four ways drift monitoring goes wrong: (1) alarm fatigue — a detector tuned so hot it fires on every Monday; teams learn to ignore it and miss the real one. (2) seasonality mistaken for drift — comparing December to November instead of to last December. (3) retraining on contaminated data — the freshly buffered window includes the very anomaly that triggered the alarm, so you retrain the model to expect the disaster. (4) silent label delay — your performance trigger cannot fire because the labels for the drifted period have not arrived yet, and your input triggers cannot see concept drift; the gap between them is where models die quietly.

5.5

A model decaying in the wild

Put the pieces together and a deployed model's life has a characteristic arc: a fresh fit performs near its validation score, holds for a while, then bends downward as the world drifts away from the snapshot it learned. The slope of that bend is the decay rate \(d\); a retrain snaps performance back toward the top and the clock restarts. The whole job of this chapter is to see the bend early enough — through PSI on the inputs and error monitors on the outputs — to retrain on the way down rather than at the bottom.

EQ V5.6 — PERFORMANCE DECAY & SAWTOOTH RECOVERY $$ \mathrm{Acc}(t) \;=\; \mathrm{Acc}_0 \;-\; d\,(t - t_{\text{last}}) \;+\; \varepsilon_t, \qquad \text{retrain at } t \;\Rightarrow\; t_{\text{last}}\leftarrow t,\;\; \mathrm{Acc}\leftarrow \mathrm{Acc}_0 $$
Between retrains, accuracy falls roughly linearly from its post-fit ceiling \(\mathrm{Acc}_0\) at rate \(d\), buried in measurement noise \(\varepsilon_t\); a retrain resets the elapsed-time clock \(t-t_{\text{last}}\) and lifts performance back toward the ceiling. Trace this over many cycles and you get the familiar sawtooth: decay, snap, decay, snap. The area between the ceiling and the sawtooth is the value lost to drift — and retraining more often trades compute to shrink it, exactly the EQ V5.5 balance. Real curves are noisier, sometimes step rather than slope, and a retrain on bad data can fail to recover at all.
PYTHON · RUNNABLE IN-BROWSER
# Sawtooth decay (EQ V5.6): no-retrain vs periodic retrain -> value recovered.
import numpy as np
rng = np.random.default_rng(2)

T = 180                                  # days in service
acc0, d, noise = 0.90, 0.0015, 0.004    # ceiling, decay/day, measurement noise

# Scenario A: never retrain -> monotone decay from the ceiling.
never = acc0 - d * np.arange(T) + rng.normal(0, noise, T)

# Scenario B: retrain every 30 days -> reset the clock each cycle.
period, retr = 30, acc0 - d * (np.arange(T) % 30) + rng.normal(0, noise, T)

print(f"day  0  : never {never[0]:.3f}   retrained {retr[0]:.3f}")
print(f"day 90  : never {never[90]:.3f}   retrained {retr[90]:.3f}")
print(f"day 179 : never {never[179]:.3f}   retrained {retr[179]:.3f}")
print(f"\nmean accuracy, never-retrain : {never.mean():.3f}")
print(f"mean accuracy, retrain @30d  : {retr.mean():.3f}")
print(f"value recovered by retraining: {retr.mean() - never.mean():+.3f} acc")
plot_xy(np.arange(T), retr)              # the sawtooth: decay, snap, decay, snap
edits are live — break it on purpose
INSTRUMENT V5.3 — PERFORMANCE-DECAY SIMULATORSAWTOOTH RECOVERY · EQ V5.5 / V5.6
MEAN ACCURACY HELD
TOTAL COST (RETRAIN + DECAY)
COST-OPTIMAL T★
The grey ceiling is the post-fit accuracy \(\mathrm{Acc}_0\); the mint sawtooth is live accuracy decaying at rate \(d\) and snapping back at every retrain. The shaded gap between them is value lost to drift. Slide RETRAIN EVERY down to chase the ceiling — but watch TOTAL COST, which adds the price of all those retrains via EQ V5.5. The readout marks \(T^\star=\sqrt{2c/d}\): set the interval near it and the total cost sits in its valley. Raise \(d\) (faster-drifting world) and the optimal cadence tightens; raise \(c\) and it loosens.
NEXT

Drift monitoring tells you that the model changed; it never tells you why. When PSI spikes and accuracy bends, the next question is always "which feature, which interaction, which case?" — and answering it is the job of the explainability toolkit. Chapter 06: SHAP and its game-theoretic guarantees, LIME's local surrogates, partial dependence and ICE, and the honest limits of post-hoc explanation.

5.R

References

  1. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia, A. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys 46(4) — the canonical taxonomy of drift types and adaptation strategies (§5.1).
  2. Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L. & Petitjean, F. (2016). Characterizing Concept Drift. Data Mining and Knowledge Discovery 30 — a quantitative framework for describing how concepts drift over time.
  3. Gama, J., Medas, P., Castillo, G. & Rodrigues, P. (2004). Learning with Drift Detection (DDM). SBIA 2004, LNCS 3171 — the error-rate drift detector behind EQ V5.4.
  4. Bifet, A. & Gavaldà, R. (2007). Learning from Time-Changing Data with Adaptive Windowing (ADWIN). SIAM SDM 2007 — an adaptive-window detector with a formal false-positive bound.
  5. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. (2012). A Kernel Two-Sample Test (MMD). JMLR 13 — the maximum-mean-discrepancy test for multivariate covariate-shift detection (§5.3).
  6. Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. (eds.) (2009). Dataset Shift in Machine Learning. MIT Press — the reference volume formalizing covariate, prior, and concept shift (EQ V5.1).