ROC curves & AUC
A binary classifier that emits a score (a probability, a logit, a credit grade) does not commit to a decision until you pick a threshold. Sweep the threshold from high to low and you trace out the full menu of operating points the model can offer. The Receiver Operating Characteristic curve plots two of them against each other: the true positive rate (recall, sensitivity) on the vertical axis and the false positive rate (1 − specificity) on the horizontal.
The single-number summary is the Area Under the ROC Curve (AUC, or AUROC). Its value is not a coincidence of geometry — it equals a probability:
Computing AUC by sweeping thresholds is the slow way; the pair-counting identity is the fast and exact way. Sort by score, walk the list, and accumulate how many negatives each positive outranks — \(O(n \log n)\), no integration error.
# ROC points and AUC from scores, two ways: threshold sweep vs pair-counting.
import numpy as np
rng = np.random.default_rng(0)
# 600 negatives ~ N(0,1), 400 positives ~ N(1.1,1): overlapping but separable.
neg = rng.normal(0.0, 1.0, 600)
pos = rng.normal(1.1, 1.0, 400)
scores = np.concatenate([neg, pos])
y = np.concatenate([np.zeros(600), np.ones(400)]).astype(int)
# --- ROC points by sweeping every distinct score as a threshold (EQ V4.1) ---
order = np.argsort(-scores) # high score first
ys = y[order]
P, Nn = ys.sum(), (1 - ys).sum()
tpr = np.cumsum(ys) / P # caught positives so far
fpr = np.cumsum(1 - ys) / Nn # false alarms so far
auc_curve = np.sum(np.diff(fpr) * (tpr[1:] + tpr[:-1]) / 2) # trapezoid area
# --- AUC by the Mann-Whitney pair-counting identity (EQ V4.2) ---
ranks = scores.argsort().argsort() + 1 # average-free rank of each score
auc_rank = (ranks[y == 1].sum() - P*(P+1)/2) / (P*Nn)
print(f"positives: {int(P)} negatives: {int(Nn)}")
print(f"AUC (threshold sweep / trapezoid) : {auc_curve:.4f}")
print(f"AUC (Mann-Whitney pair counting) : {auc_rank:.4f}")
print(f"the two agree to rounding : {abs(auc_curve-auc_rank) < 1e-3}")
plot_xy(fpr, tpr) # the ROC curve itself
Precision–recall curves
ROC's prevalence-invariance is a feature when you want to judge a ranker in the abstract — and a trap when you deploy it. On a 1%-positive fraud problem, a model can post a gorgeous 0.95 AUC and still flag fifty false alarms for every real fraud, because the false-positive rate is measured against the vast negative pool. The precision–recall curve tells the story ROC hides: it plots precision (of the cases I flagged, what fraction were right?) against recall (of the real positives, what fraction did I catch?).
The practical rule, widely repeated since Saito & Rehmsmeier's 2015 study and still the consensus in 2026: use ROC/AUC to compare rankers and report discrimination; use PR/AP when positives are rare and the cost of false alarms is concrete. A change that is invisible on ROC can be dramatic on PR precisely because the rare class is where the action is. The two are not rivals — they answer different questions about the same ranking.
"0.97 AUC" is not a deployment guarantee. AUC conditions on the true class, so it cannot see that your negatives outnumber positives 100-to-1. Two models with identical AUC can have wildly different false-alarm volumes at any usable operating point. Before you ship a rare-event detector, look at the PR curve and the absolute counts at your chosen threshold — precision, not AUC, is what your reviewers and on-call team will actually feel.
# Same ranking, two prevalences: AUC barely moves, PR-AUC collapses.
import numpy as np
rng = np.random.default_rng(2)
def auc_ap(pos, neg):
s = np.concatenate([pos, neg])
y = np.concatenate([np.ones(len(pos)), np.zeros(len(neg))]).astype(int)
order = np.argsort(-s); ys = y[order]
P, N = ys.sum(), (1 - ys).sum()
tpr = np.cumsum(ys) / P
fpr = np.cumsum(1 - ys) / N
auc = np.sum(np.diff(fpr) * (tpr[1:] + tpr[:-1]) / 2) # trapezoid area
prec = np.cumsum(ys) / np.arange(1, len(ys) + 1) # precision at each cutoff
rec = tpr
ap = np.sum(np.diff(np.concatenate([[0], rec])) * prec) # area under PR
return auc, ap, P / (P + N)
# Identical separability; only the negative pool grows.
mu = 1.3
for n_neg in (500, 5000, 50000):
pos = rng.normal(mu, 1.0, 500)
neg = rng.normal(0.0, 1.0, n_neg)
auc, ap, pi = auc_ap(pos, neg)
print(f"prevalence {100*pi:5.1f}% -> AUC {auc:.3f} PR-AUC {ap:.3f} baseline {pi:.3f}")
print("\nAUC is nearly constant (it conditions on the true class);")
print("PR-AUC sinks toward the shrinking baseline as positives get rarer.")
PR-AUC is summarized two ways and they differ: Average Precision (a step-wise sum, the scikit-learn default) and the trapezoidal area (which can be optimistic because linear interpolation between PR points is not achievable). Report which one you mean — and never compare an AP from one library to a trapezoidal PR-AUC from another.
The KS statistic & Gini (credit scoring)
Credit risk has its own ranking dialect, inherited from decades of scorecard practice. Two numbers dominate model documentation in banking: the Kolmogorov–Smirnov statistic and the Gini coefficient. Both measure the same thing AUC does — how well the score separates good from bad — but in coordinates a risk committee reads fluently.
The KS statistic is the largest gap between the two cumulative distributions of the score: the cumulative share of positives (bads) versus the cumulative share of negatives (goods), as you walk the score from one end to the other.
The Gini coefficient is just AUC rescaled to put a random model at zero and a perfect model at one:
# KS statistic and Gini from two score distributions (goods vs bads).
import numpy as np
rng = np.random.default_rng(5)
bads = rng.normal(0.65, 0.18, 800).clip(0, 1) # higher score = riskier
goods = rng.normal(0.40, 0.18, 4000).clip(0, 1)
# KS: walk a common grid of thresholds, compare cumulative shares (EQ V4.4).
grid = np.linspace(0, 1, 501)
F_bad = np.searchsorted(np.sort(bads), grid, side="right") / len(bads)
F_good = np.searchsorted(np.sort(goods), grid, side="right") / len(goods)
gap = np.abs(F_bad - F_good)
ks = gap.max()
ks_at = grid[gap.argmax()]
# AUC by pair-counting -> Gini = 2*AUC - 1 (EQ V4.2, V4.5).
s = np.concatenate([bads, goods])
y = np.concatenate([np.ones(len(bads)), np.zeros(len(goods))])
ranks = s.argsort().argsort() + 1
P, N = len(bads), len(goods)
auc = (ranks[y == 1].sum() - P*(P+1)/2) / (P*N)
gini = 2*auc - 1
print(f"AUC = {auc:.4f}")
print(f"Gini = 2*AUC - 1 = {gini:.4f}")
print(f"KS = {ks:.4f} (max gap at score ~ {ks_at:.2f})")
print("\nKS is the widest separation of the two cumulative curves;")
print("Gini is AUC stretched so chance=0 and perfect=1.")
plot_xy(grid, gap) # the KS gap as a function of cutoff
A note on PSI. The Population Stability Index — the workhorse for detecting that today's score distribution has drifted from the development sample — lives in the same credit-scoring toolbox and shares KS's distribution-comparison spirit, but it answers a different question: not "how well does the score separate good from bad?" but "has the input or score population shifted since the model was built?" PSI is therefore a stability and drift diagnostic, and Chapter 05 develops it in full alongside characteristic-stability and drift detection. Here it is enough to know that KS/Gini measure discrimination, PSI measures population shift, and a healthy KS today says nothing about whether PSI has quietly crept past its alarm threshold.
Calibration — reliability curves & Brier score
Everything so far judged ordering. None of it cares about the actual value of the score, because you can apply any strictly increasing transform — square it, pass it through a sigmoid, raise it to the tenth power — and AUC, KS, and Gini are all unchanged. But a score that drives a decision usually has to mean something: an expected-loss calculation needs a real probability of default, a triage tool needs to say "this patient has a 12% chance," not merely "this patient ranks 47th." Calibration is the property that closes the gap between the number and the world.
You inspect calibration with a reliability curve: bin the predicted probabilities, and for each bin plot the mean prediction against the observed positive frequency. Perfect calibration is the 45° diagonal. A curve that sags below it means the model is over-confident (it says 0.9 but only 0.7 actually happen); a curve that bows above means it is under-confident. The classic shapes have classic causes: modern neural nets and boosted trees tend to over-confidence, naive Bayes pushes probabilities toward the extremes, and a well-regularized logistic regression is often calibrated almost for free.
The standard scalar summary is the Brier score — the mean squared error of the probabilities themselves:
When a model ranks well but is miscalibrated, you do not retrain — you recalibrate the output with a cheap monotone post-processor fit on held-out data: Platt scaling (a one-parameter logistic on the scores) or isotonic regression (a free-form non-decreasing step function). Both preserve the ranking exactly — AUC, KS, and Gini are untouched — while bending the reliability curve back onto the diagonal. Isotonic is more flexible but needs more data and can overfit; Platt is robust on small validation sets. This is the standard fix established by Niculescu-Mizil & Caruana and unchanged in practice today.
# Discrimination vs calibration are orthogonal: same ranking, three Brier scores.
import numpy as np
rng = np.random.default_rng(11)
n = 4000
p_true = rng.beta(2, 5, n) # the genuine probabilities
y = (rng.random(n) < p_true).astype(int) # outcomes drawn from them
def brier(p): return np.mean((p - y) ** 2)
def auc(p):
r = p.argsort().argsort() + 1
P, N = y.sum(), (1 - y).sum()
return (r[y == 1].sum() - P*(P+1)/2) / (P*N)
def warp(p, gamma): # scale the logit -> a monotone re-map
logit = np.log(p / (1 - p)) * gamma # gamma>1 sharpens, gamma<1 softens
return 1 / (1 + np.exp(-logit))
calibrated = p_true # honest: gamma = 1
overconf = warp(p_true, 2.2) # over-confident: probs pushed to extremes
underconf = warp(p_true, 0.45) # under-confident: probs squashed to 0.5
print(f"{'model':<16}{'AUC':>8}{'Brier':>9}")
for name, p in [("calibrated", calibrated),
("over-confident", overconf),
("under-confident", underconf)]:
print(f"{name:<16}{auc(p):>8.3f}{brier(p):>9.4f}")
print("\nAUC is IDENTICAL for all three -- warp() is monotone, so the ranking")
print("never changes. Brier separates them: only the calibrated model is honest")
print("about its probabilities. Discrimination cannot see what calibration measures.")
Cutoff selection by cost
The ROC curve hands you every operating point the model can reach; it does not tell you which one to stand on. The default of 0.5 is almost always wrong — it is correct only when classes are balanced and the two error types cost the same, which is to say almost never. The right threshold is the one that minimizes expected cost, and that depends on numbers the model never sees: the price of a false positive, the price of a false negative, and the prevalence.
For a model that emits a true probability \(\hat{p}\), the cost-minimizing rule has a clean closed form. Flagging a case is worth it when its expected cost of being positive falls below the expected cost of being negative, which rearranges to a single threshold on the probability:
Two things follow. First, the whole pipeline composes: rank well (§4.1–4.3), calibrate the probabilities (§4.4), then place the cutoff by cost (§4.5). Skip the middle step and the last one is meaningless. Second, when costs are uncertain — as they usually are — do not pick a single \(t^{\star}\); sweep the cost ratio and present the operating frontier, so the business owner can see the trade and choose with open eyes rather than inherit a hidden 0.5.
# Cost-based cutoff: sweep thresholds, find the minimum-cost operating point.
import numpy as np
rng = np.random.default_rng(9)
n = 6000
y = (rng.random(n) < 0.15).astype(int) # 15% positive
p = np.clip(0.15 + 0.55*y + rng.normal(0, 0.20, n), 0.001, 0.999) # calibrated-ish
c_fp, c_fn = 1.0, 9.0 # a miss costs 9x a false alarm
ts = np.linspace(0.01, 0.99, 99)
costs = []
for t in ts:
pred = (p >= t).astype(int)
fp = int(((pred == 1) & (y == 0)).sum())
fn = int(((pred == 0) & (y == 1)).sum())
costs.append(c_fp*fp + c_fn*fn)
costs = np.array(costs)
t_star_grid = ts[costs.argmin()] # empirically optimal cutoff
t_star_formula = c_fp / (c_fp + c_fn) # EQ V4.9 closed form
print(f"closed-form t* = c_FP/(c_FP+c_FN) = {t_star_formula:.3f}")
print(f"grid-search t* (min cost) = {t_star_grid:.3f}")
print(f"cost at t=0.50 (naive default) = {costs[np.argmin(np.abs(ts-0.5))]:.0f}")
print(f"cost at t* (cost-optimal) = {costs.min():.0f}")
print("\nThe default 0.5 leaves money on the table whenever costs are asymmetric.")
plot_xy(ts, costs) # the cost-vs-threshold curve (U-shaped)
These metrics all assume the world stays still — the population you scored yesterday is the population you score today. It never does. Chapter 05 turns to stability and drift: the Population Stability Index (PSI) and characteristic stability that catch a shifting input distribution, covariate and concept drift, and the monitoring that tells you when a once-excellent AUC has quietly stopped describing reality.
References
- Fawcett, T. (2006). An Introduction to ROC Analysis.
- Hand, D. J. (2009). Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.
- Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities With Supervised Learning.
- Saito, T. & Rehmsmeier, M. (2015). The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets.
- Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability.
- Hanley, J. A. & McNeil, B. J. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.
- Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks.