From notebook to production pipeline
Almost every real ML failure happens outside the model. The famous diagram from Sculley et al. makes the point: the box labelled "ML code" is a small square surrounded by configuration, data collection, feature extraction, serving infrastructure, monitoring, and process management — the model is a few percent of the system. A notebook captures only that small square, and it captures it badly: hidden cell-execution order, an un-pinned environment, a CSV that was edited by hand, a random seed nobody set. None of that survives a redeploy.
The discipline that fixes this is to treat the path from raw data to served prediction as a single, versioned, re-runnable pipeline — a directed acyclic graph (DAG) of typed stages. Every edge is an artifact (a dataset, a feature table, a model file, an eval report); every node is a deterministic transform pinned to a code commit and a config. The asset you ship is not the weights file — it is the recipe that regenerates the weights file.
The payoff is concrete. If stage inputs are content-addressed, a pipeline can skip any stage whose inputs are unchanged and rerun only what is downstream of an edit — the same idea as a build system, applied to data and models. Change one feature definition and the framework knows exactly which models must be retrained and which evals must be rerun; change nothing and the whole pipeline is a cache hit.
There is an honest tension here. Notebooks are unmatched for exploration — the friction of a full pipeline would kill the iteration speed that finds the model in the first place. The mature workflow is therefore not "no notebooks" but a clear promotion boundary: explore freely in a notebook, then graduate the winning recipe into pipeline stages before anything touches production. The maturity instrument below is exactly a tour of that boundary.
Experiment tracking & model registries
Two systems sit at the heart of any serious ML platform, and they answer two different questions.
An experiment tracker answers "what did we try, and what happened?" Every run logs its parameters, its metrics, the data snapshot hash, the git commit, and the produced artifacts. Months later you can ask "which run produced this checkpoint, on what data, with what learning rate, and what was its held-out AUC?" and get an exact answer instead of an archaeology project. The tracker is the lab notebook the literal notebook never was — searchable, comparable, immutable.
A model registry answers a sharper, scarier question: "which artifact is live right now, who approved it, and what do I roll back to?" The registry is not storage — it is a state machine over model versions, with explicit stages and gated transitions:
Production when it passes the gate (offline evals clear thresholds, a human with the right role approves, the deployment config is pinned). The registry records who pulled the lever and when. The one invariant that matters: at most one version is Production per deployment slot, and you can name it in one query. A team that cannot answer "what is live?" in seconds does not have a registry — it has a folder.The registry is what makes a rollback a one-line operation instead of a 2 a.m. incident. Because every version's full lineage (EQ V7.1) is attached, reverting to the previous Production model is just re-pointing the serving slot at an immutable, already-validated artifact — no rebuild, no retrain, no guessing. The same machinery powers champion/challenger rollouts (§7.3) and multi-tenant serving where many model versions coexist behind one gateway.
| System | Answers | Keyed on | Failure if absent |
|---|---|---|---|
| Experiment tracker | What did we try & what happened? | run id | Can't reproduce or compare past results |
| Model registry | What is live, who approved, roll back to what? | model version | No fast rollback; "what's in prod?" is unanswerable |
| Artifact / data store | Where are the bytes, by content hash? | content digest | Lineage breaks; artifacts mutate under you |
A pragmatic caveat: in 2026 the tracker and registry are often the same platform (MLflow, Weights & Biases, Vertex, SageMaker, and others bundle both), and for LLM/agent systems a "model version" increasingly means a tuple of base-model id, adapter or system-prompt version, and tool schema. The abstractions are unchanged; only the artifact got more interesting.
Production stage for a single deployment slot at one time?Staging or Archived; only one is Production per slot.)CI/CD & automated retraining
Software CI/CD tests code. ML CI/CD must also test data and models — three things change independently, and any one can break production. The mature pipeline therefore runs three layers of gates, often summarized as the ML Test Score (Breck et al.): tests for the data (schema, distributions, expected-value constraints), tests for the model (does training converge, does it beat a baseline, is it robust to perturbations), and tests for the infrastructure (can it be served, rolled back, reproduced).
A model never goes live just because it trained. It goes live only if it clears an offline gate against the current Production model on a frozen holdout, and — for high-stakes systems — survives an online gate (a canary or A/B test on real traffic). The offline decision is the champion/challenger rule: the newly trained challenger replaces the live champion only if it is decisively better.
The same logic, applied to a stream of automatically retrained models, gives continuous training (CT): on a schedule or a trigger (§7.4), the pipeline retrains on fresh data, runs the full test suite, and proposes a challenger to the gate. Crucially, automated retraining does not mean automated deployment — the gate (and, for regulated models, a human sign-off) stays in the loop. Fully closed-loop retraining without a gate is how a feedback bug or a poisoned data window silently degrades a model over weeks.
# Champion/challenger promotion from holdout metrics (EQ V7.3).
import numpy as np
def promote(M_champ, M_chal, delta, guardrails):
# guardrails: list of (name, value, floor, higher_is_better)
metric_ok = (M_chal - M_champ) > delta
breaches = []
for name, val, floor, higher in guardrails:
ok = (val >= floor) if higher else (val <= floor)
if not ok: breaches.append(name)
decision = metric_ok and not breaches
return decision, metric_ok, breaches
# Frozen-holdout AUC for both models; margin must beat metric noise.
M_champ, M_chal, delta = 0.842, 0.857, 0.005
guardrails = [ # (name, challenger value, floor, higher_is_better)
("p99_latency_ms", 180.0, 200.0, False), # must be <= 200ms -> OK
("fairness_gap", 0.030, 0.050, False), # must be <= 0.05 -> OK
("calibration_ece",0.021, 0.040, False), # must be <= 0.04 -> OK
]
dec, mok, breaches = promote(M_champ, M_chal, delta, guardrails)
print(f"champion AUC : {M_champ:.3f}")
print(f"challenger AUC : {M_chal:.3f} ({M_chal-M_champ:+.3f}, margin needed {delta})")
print(f"beats margin? : {mok}")
print(f"guardrail breaches: {breaches if breaches else 'none'}")
print(f"\nDECISION: {'PROMOTE challenger' if dec else 'KEEP champion'}")
# Counterfactual: same AUC win, but latency now blows the guardrail.
g2 = guardrails[:]; g2[0] = ("p99_latency_ms", 240.0, 200.0, False)
print("if p99 latency were 240ms ->",
"PROMOTE" if promote(M_champ, M_chal, delta, g2)[0] else "KEEP champion (guardrail)")
Monitoring, lineage & reproducibility
A deployed model decays even though its weights never change, because the world the weights describe keeps moving. Two distinct decays matter, and confusing them is a classic mistake:
- Data drift (covariate shift). The input distribution \(P(x)\) moves — a new traffic source, a seasonal effect, an upstream feature that started arriving null. The model is still "correct," but it is now answering questions about a population it was not trained on.
- Concept drift. The relationship \(P(y \mid x)\) itself changes — fraud tactics evolve, user tastes shift, a competitor changes the market. Even on identical inputs, the right answer is now different. Only concept drift necessarily degrades accuracy; data drift may or may not.
Labels arrive late or never, so you cannot always watch accuracy directly. The first line of defence is therefore an unsupervised drift signal on the inputs and the predictions. The workhorse is the Population Stability Index (PSI), which compares a baseline (training) distribution against a recent production window, bucketed:
Drift on its own is only a warning. The decisive signal, when labels eventually land, is a service-level objective (SLO) on the live metric, with an alert that fires on a sustained breach rather than a single bad point — one noisy day is not an incident, a week below the floor is.
# Model-monitoring SLA-breach flag from a daily metric stream.
import numpy as np
rng = np.random.default_rng(11)
# 30 days of live accuracy: stable, then a drift-driven slide after day 18.
days = np.arange(30)
base = np.where(days < 18, 0.91, 0.91 - 0.006 * (days - 18)) # slow decay
acc = np.clip(base + rng.normal(0, 0.012, 30), 0, 1) # daily noise
SLO = 0.88 # contractual floor on accuracy
WINDOW = 5 # smooth over a rolling window (ignore one-day noise)
N_BREACH = 3 # alert only after this many consecutive sub-SLO smoothed days
roll = np.convolve(acc, np.ones(WINDOW)/WINDOW, mode="valid") # len 30-WINDOW+1
below = roll < SLO
# longest run of consecutive sub-SLO days, and the day the alert would fire
run = fire = 0
fire_day = None
for i, b in enumerate(below):
run = run + 1 if b else 0
if run >= N_BREACH and fire_day is None:
fire_day = i + WINDOW - 1 # map rolling index back to a calendar day
fire = max(fire, run)
print(f"SLO floor : {SLO:.2f} rolling window: {WINDOW}d")
print(f"min rolling acc : {roll.min():.3f} (raw min {acc.min():.3f})")
print(f"longest breach run : {fire} day(s) threshold: {N_BREACH}")
print(f"BREACH ALERT : {'FIRE on day '+str(fire_day) if fire_day is not None else 'none'}")
plot_xy(np.arange(WINDOW-1, 30), roll) # the smoothed curve crossing the SLO floor
Behind every alert sits lineage: the graph that connects a live prediction back through the model version, the training run, the data snapshot, and the feature code that produced it (EQ V7.1). When an incident hits, lineage answers the only questions that matter at 2 a.m. — which model is responsible, what was it trained on, what changed since it was clean, and what do we roll back to? A monitor without lineage tells you the patient has a fever; lineage tells you why.
Model risk management & governance
Everything so far is engineering. Governance is the layer that makes those engineering controls accountable — who is allowed to deploy, who signed off, what evidence exists, and what happens when the model causes harm. In regulated industries this is not optional. The canonical reference is the US Federal Reserve / OCC supervisory letter SR 11-7 (2011), "Guidance on Model Risk Management", which defines model risk as the potential for adverse consequences from decisions based on incorrect or misused models, and prescribes three controls that map almost one-to-one onto good MLOps.
This regulatory framing has since been generalized far beyond banking. The EU AI Act (in force from 2024, with high-risk obligations phasing in through 2026–2027) imposes risk-tiered duties — risk management systems, data governance, logging, human oversight, and post-market monitoring — that are recognisably the same controls. The NIST AI Risk Management Framework (2023) and ISO/IEC 42001 (2023, the first AI management-system standard) give voluntary but increasingly expected scaffolding. The through-line across all of them is a small set of governance artifacts every mature ML organisation now maintains:
| Artifact | Question it answers | Lineage to MLOps |
|---|---|---|
| Model inventory | Which models exist, who owns each, what is their risk tier? | registry (§7.2) |
| Model card / documentation | Intended use, training data, metrics, limitations, fairness | tracker + lineage |
| Validation report | Independent challenge: does it work, where does it fail? | eval gate (§7.3) |
| Sign-off / approval record | Who authorized production, on what evidence, when? | registry transition |
| Monitoring & incident log | How is it behaving live; what went wrong and when? | monitors (§7.4) |
Governance can calcify into theatre. The honest tension in 2026: heavyweight model-risk processes designed for slow-moving credit models fit awkwardly onto fast-iterating ML and especially onto LLM/agent systems, where the "model" is a prompt-plus-tools assembly that changes weekly and whose failure modes (hallucination, prompt injection, jailbreaks) are not what SR 11-7 imagined. Two failure modes bracket the debate: too little governance ships unvalidated models into high-stakes decisions; too much produces a compliance pantomime where teams generate documents nobody reads to satisfy a checklist, while real risk goes unmonitored. The defensible middle is risk-tiered governance: match the weight of the controls to the stakes of the decision, automate the evidence-gathering so documentation is a by-product of the pipeline rather than a separate chore, and keep "effective challenge" genuinely effective.
You now have the operational backbone — pipelines, registries, monitoring, and the governance that makes a model an accountable asset. That closes the Model Validation & Risk track. From here the manual turns to the model itself: the LLM Field Manual opens with foundations — tokens, embeddings, and the next-token objective that everything in production is ultimately serving.
References
- Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems.
- Board of Governors of the Federal Reserve System & OCC (2011). SR 11-7: Guidance on Model Risk Management.
- Breck, E., Cai, S., Nielsen, E., Salib, M. & Sculley, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.
- Kreuzberger, D., Kühl, N. & Hirschl, S. (2022). Machine Learning Operations (MLOps): Overview, Definition, and Architecture.
- National Institute of Standards and Technology (2023). AI Risk Management Framework (AI RMF 1.0).
- European Union (2024). Regulation (EU) 2024/1689 — the Artificial Intelligence Act.