AI // ENCYCLOPEDIA / MODEL RISK / 07 / MLOPS & GOVERNANCE INDEX NEXT: LLM FIELD MANUAL · 01 →
MODEL VALIDATION & RISK · CHAPTER 07 / 07

MLOps & Model Governance

Training a model is the easy part. Keeping it trustworthy after the notebook closes requires a reproducible pipeline, a registry that records which artifact is live, monitoring that catches drift, and an audit trail an examiner will accept. MLOps is the set of practices that turns a one-off model into a maintained production asset with monitoring, lineage, and sign-off.

LEVELADVANCED READING TIME≈ 28 MIN BUILDS ONMLOPS 01–06 · ML 06 INSTRUMENTSMATURITY · PIPELINE DAG · RETRAIN TRIGGER
7.1

From notebook to production pipeline

Almost every real ML failure happens outside the model. The famous diagram from Sculley et al. makes the point: the box labelled "ML code" is a small square surrounded by configuration, data collection, feature extraction, serving infrastructure, monitoring, and process management — the model is a few percent of the system. A notebook captures only that small square, and it captures it badly: hidden cell-execution order, an un-pinned environment, a CSV that was edited by hand, a random seed nobody set. None of that survives a redeploy.

The discipline that fixes this is to treat the path from raw data to served prediction as a single, versioned, re-runnable pipeline — a directed acyclic graph (DAG) of typed stages. Every edge is an artifact (a dataset, a feature table, a model file, an eval report); every node is a deterministic transform pinned to a code commit and a config. The asset you ship is not the weights file — it is the recipe that regenerates the weights file.

EQ V7.1 — REPRODUCIBILITY AS A FUNCTION OF INPUTS $$ \text{artifact} \;=\; f\big(\,\text{data}_{\,v},\ \text{code}_{\,c},\ \text{config}_{\,h},\ \text{env}_{\,e},\ \text{seed}_{\,s}\,\big) $$
A run is reproducible iff fixing all five inputs fixes the output. \(v\) is a content hash of the data snapshot, \(c\) a git commit, \(h\) the hyperparameter config, \(e\) the pinned environment (container digest + library versions), \(s\) the RNG seed. Drop any one and you have a story, not a result. The single most common reproducibility failure is an un-versioned \(\text{data}_v\): the same code on "today's table" silently trains a different model tomorrow. Pipelines exist to make all five explicit and to cache stages whose inputs have not changed.

The payoff is concrete. If stage inputs are content-addressed, a pipeline can skip any stage whose inputs are unchanged and rerun only what is downstream of an edit — the same idea as a build system, applied to data and models. Change one feature definition and the framework knows exactly which models must be retrained and which evals must be rerun; change nothing and the whole pipeline is a cache hit.

NOTEBOOK
1 machine
Hidden state, manual order, un-pinned env. Reproducible by luck.
SCRIPT + CONFIG
N runs
Deterministic given inputs, but no lineage and no caching.
VERSIONED PIPELINE
DAG
Typed stages, content-addressed artifacts, partial reruns, full lineage.

There is an honest tension here. Notebooks are unmatched for exploration — the friction of a full pipeline would kill the iteration speed that finds the model in the first place. The mature workflow is therefore not "no notebooks" but a clear promotion boundary: explore freely in a notebook, then graduate the winning recipe into pipeline stages before anything touches production. The maturity instrument below is exactly a tour of that boundary.

INSTRUMENT V7.1 — MLOPS MATURITY SELF-ASSESSMENTDECISION-TREE WALKTHROUGH · LEVELS 0–4
QUESTION 1 / 4
MATURITY LEVEL
STAGE
NEXT MOVE
Answer four yes/no questions about your own team. The path walks the standard MLOps maturity ladder — Level 0 (manual notebook) → 1 (automated pipeline) → 2 (CI/CD for the pipeline) → 3 (automated retraining) → 4 (full governance with continuous monitoring and sign-off). The "next move" is the single highest-leverage thing to build next. The lesson: maturity is a ladder, and you do not get to skip rungs — automated retraining (Level 3) is dangerous without the monitoring and registry of the levels below it.
7.2

Experiment tracking & model registries

Two systems sit at the heart of any serious ML platform, and they answer two different questions.

An experiment tracker answers "what did we try, and what happened?" Every run logs its parameters, its metrics, the data snapshot hash, the git commit, and the produced artifacts. Months later you can ask "which run produced this checkpoint, on what data, with what learning rate, and what was its held-out AUC?" and get an exact answer instead of an archaeology project. The tracker is the lab notebook the literal notebook never was — searchable, comparable, immutable.

A model registry answers a sharper, scarier question: "which artifact is live right now, who approved it, and what do I roll back to?" The registry is not storage — it is a state machine over model versions, with explicit stages and gated transitions:

EQ V7.2 — THE REGISTRY STATE MACHINE $$ \texttt{None} \;\xrightarrow{\text{register}}\; \texttt{Staging} \;\xrightarrow{\;\text{eval + sign-off}\;}\; \texttt{Production} \;\xrightarrow{\;\text{superseded}\;}\; \texttt{Archived} $$
Each arrow is a guarded transition: a model may only enter Production when it passes the gate (offline evals clear thresholds, a human with the right role approves, the deployment config is pinned). The registry records who pulled the lever and when. The one invariant that matters: at most one version is Production per deployment slot, and you can name it in one query. A team that cannot answer "what is live?" in seconds does not have a registry — it has a folder.

The registry is what makes a rollback a one-line operation instead of a 2 a.m. incident. Because every version's full lineage (EQ V7.1) is attached, reverting to the previous Production model is just re-pointing the serving slot at an immutable, already-validated artifact — no rebuild, no retrain, no guessing. The same machinery powers champion/challenger rollouts (§7.3) and multi-tenant serving where many model versions coexist behind one gateway.

SystemAnswersKeyed onFailure if absent
Experiment trackerWhat did we try & what happened?run idCan't reproduce or compare past results
Model registryWhat is live, who approved, roll back to what?model versionNo fast rollback; "what's in prod?" is unanswerable
Artifact / data storeWhere are the bytes, by content hash?content digestLineage breaks; artifacts mutate under you

A pragmatic caveat: in 2026 the tracker and registry are often the same platform (MLflow, Weights & Biases, Vertex, SageMaker, and others bundle both), and for LLM/agent systems a "model version" increasingly means a tuple of base-model id, adapter or system-prompt version, and tool schema. The abstractions are unchanged; only the artifact got more interesting.

By the registry invariant in EQ V7.2, how many model versions may be in the Production stage for a single deployment slot at one time?
The registry is a state machine whose key invariant is that each deployment slot has at most one live version — that is precisely what lets you answer "what is in prod?" in one query and roll back deterministically. So the answer is 1. (Several versions may sit in Staging or Archived; only one is Production per slot.)
7.3

CI/CD & automated retraining

Software CI/CD tests code. ML CI/CD must also test data and models — three things change independently, and any one can break production. The mature pipeline therefore runs three layers of gates, often summarized as the ML Test Score (Breck et al.): tests for the data (schema, distributions, expected-value constraints), tests for the model (does training converge, does it beat a baseline, is it robust to perturbations), and tests for the infrastructure (can it be served, rolled back, reproduced).

A model never goes live just because it trained. It goes live only if it clears an offline gate against the current Production model on a frozen holdout, and — for high-stakes systems — survives an online gate (a canary or A/B test on real traffic). The offline decision is the champion/challenger rule: the newly trained challenger replaces the live champion only if it is decisively better.

EQ V7.3 — CHAMPION / CHALLENGER PROMOTION RULE $$ \text{promote} \;\iff\; \big(M_{\text{chal}} - M_{\text{champ}} \;>\; \delta\big)\ \ \wedge\ \ \big(G_{\text{chal}} \;\ge\; G_{\min}\big) $$
\(M\) is the primary holdout metric (AUC, F1, revenue-per-session…), measured for both models on the same frozen evaluation set. \(\delta > 0\) is a margin that must exceed the metric's noise (recall the holdout standard error of MLOPS · EQ V1.2) so you are not promoting on a coin flip. \(G\) are guardrail metrics — latency, fairness gaps, calibration, a forbidden-behavior rate — that must each clear a floor \(G_{\min}\). The challenger is presumed guilty: it must beat the champion by a real margin and break no guardrail, or the champion stays. A challenger that wins on the headline metric while quietly regressing latency or a subgroup's error rate must not ship.

The same logic, applied to a stream of automatically retrained models, gives continuous training (CT): on a schedule or a trigger (§7.4), the pipeline retrains on fresh data, runs the full test suite, and proposes a challenger to the gate. Crucially, automated retraining does not mean automated deployment — the gate (and, for regulated models, a human sign-off) stays in the loop. Fully closed-loop retraining without a gate is how a feedback bug or a poisoned data window silently degrades a model over weeks.

In a champion/challenger setup, the challenger is promoted to production only if it beats the current champion on the holdout metric (by a margin, and without breaking guardrails). True or false? (Answer true or false.)
This is exactly the promotion rule of EQ V7.3: \(M_{\text{chal}} - M_{\text{champ}} > \delta\) and the guardrails hold. The incumbent is the default; a challenger must earn its place by a real margin. So the statement is true.
A challenger is scored on a frozen holdout of \( m = 2000 \) rows where the champion's accuracy is \( p = 0.90 \). To promote only on real signal, set the margin to the 95% half-width of the holdout estimate, \( \delta = 1.96\sqrt{p(1-p)/m} \). What is \( \delta \), to three decimals?
\( p(1-p) = 0.90 \times 0.10 = 0.09 \); divide by \( m = 2000 \) → \( 4.5\times10^{-5} \); square root → \( 0.006708 \) (the standard error). Multiply by \( 1.96 \): \( 1.96 \times 0.006708 = 0.01315 \approx \) 0.013. A challenger must beat the champion by at least ~1.3 accuracy points here, or the gap is indistinguishable from sampling noise — the same \(1/\sqrt{m}\) law from EQ V1.2.
PYTHON · RUNNABLE IN-BROWSER
# Champion/challenger promotion from holdout metrics (EQ V7.3).
import numpy as np

def promote(M_champ, M_chal, delta, guardrails):
    # guardrails: list of (name, value, floor, higher_is_better)
    metric_ok = (M_chal - M_champ) > delta
    breaches  = []
    for name, val, floor, higher in guardrails:
        ok = (val >= floor) if higher else (val <= floor)
        if not ok: breaches.append(name)
    decision = metric_ok and not breaches
    return decision, metric_ok, breaches

# Frozen-holdout AUC for both models; margin must beat metric noise.
M_champ, M_chal, delta = 0.842, 0.857, 0.005
guardrails = [                       # (name, challenger value, floor, higher_is_better)
    ("p99_latency_ms", 180.0, 200.0, False),   # must be <= 200ms  -> OK
    ("fairness_gap",   0.030, 0.050, False),   # must be <= 0.05   -> OK
    ("calibration_ece",0.021, 0.040, False),   # must be <= 0.04   -> OK
]

dec, mok, breaches = promote(M_champ, M_chal, delta, guardrails)
print(f"champion AUC   : {M_champ:.3f}")
print(f"challenger AUC : {M_chal:.3f}   ({M_chal-M_champ:+.3f}, margin needed {delta})")
print(f"beats margin?  : {mok}")
print(f"guardrail breaches: {breaches if breaches else 'none'}")
print(f"\nDECISION: {'PROMOTE challenger' if dec else 'KEEP champion'}")

# Counterfactual: same AUC win, but latency now blows the guardrail.
g2 = guardrails[:]; g2[0] = ("p99_latency_ms", 240.0, 200.0, False)
print("if p99 latency were 240ms ->",
      "PROMOTE" if promote(M_champ, M_chal, delta, g2)[0] else "KEEP champion (guardrail)")
edits are live — break it on purpose
INSTRUMENT V7.2 — PIPELINE-DAG ANATOMYTYPED STAGES · ARTIFACTS · GATES
STAGES TO RERUN
CACHE HITS (SKIPPED)
SELECTED STAGE
none
Click any stage to mark it edited. The DAG is a real ML pipeline: ingest → validate → features → train → evaluate → register → serve, with evaluate as the champion/challenger gate before register. Editing a stage dirties it and everything downstream (mint) while upstream stages stay cached (grey). Click features and watch train/evaluate/register/serve all light up; click serve and nothing upstream reruns. This is why content-addressed pipelines are cheap to iterate: you only pay for what actually changed.
7.4

Monitoring, lineage & reproducibility

A deployed model decays even though its weights never change, because the world the weights describe keeps moving. Two distinct decays matter, and confusing them is a classic mistake:

  • Data drift (covariate shift). The input distribution \(P(x)\) moves — a new traffic source, a seasonal effect, an upstream feature that started arriving null. The model is still "correct," but it is now answering questions about a population it was not trained on.
  • Concept drift. The relationship \(P(y \mid x)\) itself changes — fraud tactics evolve, user tastes shift, a competitor changes the market. Even on identical inputs, the right answer is now different. Only concept drift necessarily degrades accuracy; data drift may or may not.

Labels arrive late or never, so you cannot always watch accuracy directly. The first line of defence is therefore an unsupervised drift signal on the inputs and the predictions. The workhorse is the Population Stability Index (PSI), which compares a baseline (training) distribution against a recent production window, bucketed:

EQ V7.4 — POPULATION STABILITY INDEX $$ \mathrm{PSI} \;=\; \sum_{i=1}^{B} \big(a_i - e_i\big)\,\ln\!\frac{a_i}{e_i} $$
For each of \(B\) buckets, \(e_i\) is the expected (baseline) fraction of mass and \(a_i\) the actual (recent) fraction; the sum is a symmetrized relative-entropy distance. Industry rule of thumb: PSI < 0.1 = stable, 0.1–0.25 = moderate shift (investigate), > 0.25 = significant shift (act). PSI is a symmetrized cousin of the KL divergence (INFO THEORY · EQ S2.3): each term is \((a_i-e_i)\ln(a_i/e_i)\) rather than \(a_i\ln(a_i/e_i)\), so it is always non-negative and order-insensitive. Its blind spot, which experts insist on: PSI detects marginal drift only — a change in the joint distribution that leaves every marginal unchanged is invisible to it.

Drift on its own is only a warning. The decisive signal, when labels eventually land, is a service-level objective (SLO) on the live metric, with an alert that fires on a sustained breach rather than a single bad point — one noisy day is not an incident, a week below the floor is.

In a PSI computation (EQ V7.4), one bucket had expected mass \( e = 0.20 \) at baseline but actual mass \( a = 0.30 \) in the recent window. What is that single bucket's contribution \( (a-e)\ln(a/e) \)? (Use \( \ln 1.5 = 0.405 \).)
\( a - e = 0.30 - 0.20 = 0.10 \); \( a/e = 1.5 \), so \( \ln(a/e) = 0.405 \). The contribution is \( 0.10 \times 0.405 = \) 0.04. A handful of buckets shifting like this can push total PSI past the 0.1 "investigate" line — the trigger the retraining-policy instrument below explores.
PYTHON · RUNNABLE IN-BROWSER
# Model-monitoring SLA-breach flag from a daily metric stream.
import numpy as np
rng = np.random.default_rng(11)

# 30 days of live accuracy: stable, then a drift-driven slide after day 18.
days = np.arange(30)
base = np.where(days < 18, 0.91, 0.91 - 0.006 * (days - 18))   # slow decay
acc  = np.clip(base + rng.normal(0, 0.012, 30), 0, 1)          # daily noise

SLO       = 0.88     # contractual floor on accuracy
WINDOW    = 5        # smooth over a rolling window (ignore one-day noise)
N_BREACH  = 3        # alert only after this many consecutive sub-SLO smoothed days

roll = np.convolve(acc, np.ones(WINDOW)/WINDOW, mode="valid")  # len 30-WINDOW+1
below = roll < SLO
# longest run of consecutive sub-SLO days, and the day the alert would fire
run = fire = 0
fire_day = None
for i, b in enumerate(below):
    run = run + 1 if b else 0
    if run >= N_BREACH and fire_day is None:
        fire_day = i + WINDOW - 1          # map rolling index back to a calendar day
    fire = max(fire, run)

print(f"SLO floor          : {SLO:.2f}   rolling window: {WINDOW}d")
print(f"min rolling acc    : {roll.min():.3f}  (raw min {acc.min():.3f})")
print(f"longest breach run : {fire} day(s)   threshold: {N_BREACH}")
print(f"BREACH ALERT       : {'FIRE on day '+str(fire_day) if fire_day is not None else 'none'}")
plot_xy(np.arange(WINDOW-1, 30), roll)     # the smoothed curve crossing the SLO floor
edits are live — break it on purpose

Behind every alert sits lineage: the graph that connects a live prediction back through the model version, the training run, the data snapshot, and the feature code that produced it (EQ V7.1). When an incident hits, lineage answers the only questions that matter at 2 a.m. — which model is responsible, what was it trained on, what changed since it was clean, and what do we roll back to? A monitor without lineage tells you the patient has a fever; lineage tells you why.

7.5

Model risk management & governance

Everything so far is engineering. Governance is the layer that makes those engineering controls accountable — who is allowed to deploy, who signed off, what evidence exists, and what happens when the model causes harm. In regulated industries this is not optional. The canonical reference is the US Federal Reserve / OCC supervisory letter SR 11-7 (2011), "Guidance on Model Risk Management", which defines model risk as the potential for adverse consequences from decisions based on incorrect or misused models, and prescribes three controls that map almost one-to-one onto good MLOps.

EQ V7.5 — MODEL RISK (SR 11-7 FRAMING) $$ \text{Model risk} \;=\; \underbrace{P(\text{model is wrong})}_{\text{fundamental error}} \;+\; \underbrace{P(\text{model is misused})}_{\text{wrong context / inputs}} $$
SR 11-7's central insight is that risk has two sources, not one: a model can be wrong (bad data, bad assumptions, overfitting), and a perfectly good model can be misused (applied outside its validated domain, fed inputs it never saw, trusted beyond its accuracy). Both must be managed. The guidance's three pillars are: (1) robust development & documentation — the pipeline, lineage, and reproducibility of §§7.1–7.4; (2) independent validation — a second team, not the builders, challenges the model before and after deployment; (3) governance, policies & controls — an inventory of every model, defined ownership, sign-off, and ongoing monitoring. "Effective challenge" — critical review by people with the authority and incentive to push back — is the phrase the document hangs everything on.

This regulatory framing has since been generalized far beyond banking. The EU AI Act (in force from 2024, with high-risk obligations phasing in through 2026–2027) imposes risk-tiered duties — risk management systems, data governance, logging, human oversight, and post-market monitoring — that are recognisably the same controls. The NIST AI Risk Management Framework (2023) and ISO/IEC 42001 (2023, the first AI management-system standard) give voluntary but increasingly expected scaffolding. The through-line across all of them is a small set of governance artifacts every mature ML organisation now maintains:

ArtifactQuestion it answersLineage to MLOps
Model inventoryWhich models exist, who owns each, what is their risk tier?registry (§7.2)
Model card / documentationIntended use, training data, metrics, limitations, fairnesstracker + lineage
Validation reportIndependent challenge: does it work, where does it fail?eval gate (§7.3)
Sign-off / approval recordWho authorized production, on what evidence, when?registry transition
Monitoring & incident logHow is it behaving live; what went wrong and when?monitors (§7.4)
CONTESTED

Governance can calcify into theatre. The honest tension in 2026: heavyweight model-risk processes designed for slow-moving credit models fit awkwardly onto fast-iterating ML and especially onto LLM/agent systems, where the "model" is a prompt-plus-tools assembly that changes weekly and whose failure modes (hallucination, prompt injection, jailbreaks) are not what SR 11-7 imagined. Two failure modes bracket the debate: too little governance ships unvalidated models into high-stakes decisions; too much produces a compliance pantomime where teams generate documents nobody reads to satisfy a checklist, while real risk goes unmonitored. The defensible middle is risk-tiered governance: match the weight of the controls to the stakes of the decision, automate the evidence-gathering so documentation is a by-product of the pipeline rather than a separate chore, and keep "effective challenge" genuinely effective.

US SR 11-7 is regulatory supervisory guidance on model risk management (development & documentation, independent validation, and governance/controls). True or false? (Answer true or false.)
SR 11-7 is the 2011 supervisory letter issued by the US Federal Reserve and the OCC, "Guidance on Model Risk Management." It defines model risk and lays out the three pillars in EQ V7.5. So the statement is true — and it is the document most ML governance programs still trace their lineage to.
INSTRUMENT V7.3 — RETRAINING-TRIGGER POLICY EXPLORERPSI · METRIC SLO · SCHEDULE · EQ V7.4
POLICY DECISION
FIRED TRIGGER(S)
ACTION
Three independent triggers can each demand a retrain: input drift (PSI past the 0.25 act-line, or 0.1 watch-line), a performance SLO breach (live accuracy below the 0.88 floor), or a staleness deadline (a max-age schedule). Slide each control and watch which bars cross their threshold. The lesson is policy design: a good retraining policy is the OR of a few cheap, observable signals — and even when a trigger fires, the action is "retrain & propose a challenger to the gate," never "auto-deploy." Drift alone never ships a model; the gate of §7.3 still has to say yes.
NEXT

You now have the operational backbone — pipelines, registries, monitoring, and the governance that makes a model an accountable asset. That closes the Model Validation & Risk track. From here the manual turns to the model itself: the LLM Field Manual opens with foundations — tokens, embeddings, and the next-token objective that everything in production is ultimately serving.

7.R

References

  1. Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015 — the "ML code is a small box" argument behind §7.1.
  2. Board of Governors of the Federal Reserve System & OCC (2011). SR 11-7: Guidance on Model Risk Management. Supervisory letter — the model-risk framework and three pillars of §7.5 (EQ V7.5).
  3. Breck, E., Cai, S., Nielsen, E., Salib, M. & Sculley, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data 2017 — the data/model/infra test layers of §7.3.
  4. Kreuzberger, D., Kühl, N. & Hirschl, S. (2022). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. arXiv:2205.02302 — a current reference architecture for pipelines, CI/CD, and CT.
  5. National Institute of Standards and Technology (2023). AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1 — the Govern/Map/Measure/Manage scaffolding generalizing §7.5.
  6. European Union (2024). Regulation (EU) 2024/1689 — the Artificial Intelligence Act. Official Journal — risk-tiered obligations (risk management, data governance, logging, human oversight) phasing in through 2026–2027.