Ranking, Calibration, ROC, KS & PSI

4.1

ROC curves & AUC

A binary classifier that emits a score (a probability, a logit, a credit grade) does not commit to a decision until you pick a threshold. Sweep the threshold from high to low and you trace out the full menu of operating points the model can offer. The Receiver Operating Characteristic curve plots two of them against each other: the true positive rate (recall, sensitivity) on the vertical axis and the false positive rate (1 − specificity) on the horizontal.

EQ V4.1 — THE TWO RATES OF THE ROC AXES $$ \mathrm{TPR}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FN}(t)}, \qquad \mathrm{FPR}(t) = \frac{\mathrm{FP}(t)}{\mathrm{FP}(t) + \mathrm{TN}(t)} $$

At threshold $t$, everything scoring $\ge t$ is called positive. TPR is the fraction of true positives the model catches; FPR is the fraction of true negatives it falsely raises. As $t \to \infty$ you predict nothing positive and sit at $(0,0)$; as $t \to -\infty$ you predict everything positive and sit at $(1,1)$. Crucially, both rates condition on the true class — so the ROC curve is invariant to class prevalence. A 1%-positive fraud set and a balanced one produce the same ROC for the same ranking, which is exactly why it is the standard summary of a model's discrimination.

The single-number summary is the Area Under the ROC Curve (AUC, or AUROC). Its value is not a coincidence of geometry — it equals a probability:

EQ V4.2 — AUC AS A RANKING PROBABILITY $$ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\,\big(\mathrm{FPR}^{-1}(u)\big)\,du \;=\; \Pr\big(\,s(X^{+}) > s(X^{-})\,\big) + \tfrac{1}{2}\Pr\big(\,s(X^{+}) = s(X^{-})\,\big) $$

Draw one random positive and one random negative; AUC is the probability the model scores the positive higher (ties split evenly). This is the Wilcoxon–Mann–Whitney statistic. AUC = 1.0 is a perfect ranker, 0.5 is a coin flip, and below 0.5 means your score is backwards (flip its sign and you are above 0.5 again). Because it asks only "is the positive ranked above the negative?", AUC measures ordering and is completely blind to whether the scores are calibrated probabilities — the gap §4.4 exists to fill.

Computing AUC by sweeping thresholds is the slow way; the pair-counting identity is the fast and exact way. Sort by score, walk the list, and accumulate how many negatives each positive outranks — $O(n \log n)$, no integration error.

A perfect classifier assigns every positive a higher score than every negative. Using EQ V4.2 (AUC = probability a random positive outranks a random negative), what is its AUC?

If every positive outranks every negative, then for every positive–negative pair the positive wins: the concordant fraction is $1$ and there are no ties, so $\mathrm{AUC} = \Pr(s(X^+) > s(X^-)) = $ 1.0. The ROC curve hugs the top-left corner, passing through $(0,1)$.

PYTHON · RUNNABLE IN-BROWSER

# ROC points and AUC from scores, two ways: threshold sweep vs pair-counting.
import numpy as np
rng = np.random.default_rng(0)

# 600 negatives ~ N(0,1), 400 positives ~ N(1.1,1): overlapping but separable.
neg = rng.normal(0.0, 1.0, 600)
pos = rng.normal(1.1, 1.0, 400)
scores = np.concatenate([neg, pos])
y      = np.concatenate([np.zeros(600), np.ones(400)]).astype(int)

# --- ROC points by sweeping every distinct score as a threshold (EQ V4.1) ---
order = np.argsort(-scores)                 # high score first
ys = y[order]
P, Nn = ys.sum(), (1 - ys).sum()
tpr = np.cumsum(ys) / P                      # caught positives so far
fpr = np.cumsum(1 - ys) / Nn                 # false alarms so far
auc_curve = np.sum(np.diff(fpr) * (tpr[1:] + tpr[:-1]) / 2)  # trapezoid area

# --- AUC by the Mann-Whitney pair-counting identity (EQ V4.2) ---
ranks = scores.argsort().argsort() + 1      # average-free rank of each score
auc_rank = (ranks[y == 1].sum() - P*(P+1)/2) / (P*Nn)

print(f"positives: {int(P)}   negatives: {int(Nn)}")
print(f"AUC (threshold sweep / trapezoid) : {auc_curve:.4f}")
print(f"AUC (Mann-Whitney pair counting)  : {auc_rank:.4f}")
print(f"the two agree to rounding         : {abs(auc_curve-auc_rank) < 1e-3}")
plot_xy(fpr, tpr)                            # the ROC curve itself

edits are live — break it on purpose

INSTRUMENT V4.1 — ROC / PR / KS EXPLORERDRAG THE TWO CLASS DISTRIBUTIONS · EQ V4.1–V4.2

CLASS SEPARATION (Δμ) 1.40

POSITIVE SPREAD (σ⁺) 1.00

PREVALENCE (% POS) 40%

VIEW

AUC (AUROC)

—

KS STATISTIC

—

AVG PRECISION (PR-AUC)

—

The two bell curves are the score distributions of negatives and positives. Slide SEPARATION to zero and the curves collapse onto the diagonal — AUC → 0.5, a useless ranker. Pull them apart and the ROC bows toward the top-left corner. The KS gap marked on the ROC view is the largest vertical distance between TPR and FPR (§4.3). Switch to PRECISION–RECALL and drop PREVALENCE to 2% to watch the lesson of §4.2: the ROC barely moves, but the PR curve collapses — because precision pays the rent on rarity.

4.2

Precision–recall curves

ROC's prevalence-invariance is a feature when you want to judge a ranker in the abstract — and a trap when you deploy it. On a 1%-positive fraud problem, a model can post a gorgeous 0.95 AUC and still flag fifty false alarms for every real fraud, because the false-positive rate is measured against the vast negative pool. The precision–recall curve tells the story ROC hides: it plots precision (of the cases I flagged, what fraction were right?) against recall (of the real positives, what fraction did I catch?).

EQ V4.3 — PRECISION, RECALL, AND THE PR BASELINE $$ \mathrm{Precision}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FP}(t)}, \qquad \mathrm{Recall}(t) = \mathrm{TPR}(t), \qquad \text{baseline} = \frac{P}{P + N} = \pi $$

Precision has $\mathrm{FP}$ in its denominator, and $\mathrm{FP}$ scales with the size of the negative pool — so precision is acutely sensitive to prevalence in a way TPR and FPR are not. The no-skill baseline of a PR curve is the positive rate $\pi$ (a random classifier holds precision $\pi$ at every recall), versus the fixed diagonal at AUC = 0.5 for ROC. The area under the PR curve is summarized by Average Precision (AP), the precision averaged over the recall levels at which a new positive is retrieved.

The practical rule, widely repeated since Saito & Rehmsmeier's 2015 study and still the consensus in 2026: use ROC/AUC to compare rankers and report discrimination; use PR/AP when positives are rare and the cost of false alarms is concrete. A change that is invisible on ROC can be dramatic on PR precisely because the rare class is where the action is. The two are not rivals — they answer different questions about the same ranking.

THE PREVALENCE TRAP

"0.97 AUC" is not a deployment guarantee. AUC conditions on the true class, so it cannot see that your negatives outnumber positives 100-to-1. Two models with identical AUC can have wildly different false-alarm volumes at any usable operating point. Before you ship a rare-event detector, look at the PR curve and the absolute counts at your chosen threshold — precision, not AUC, is what your reviewers and on-call team will actually feel.

At your chosen threshold the model flags 50 cases as positive; 30 of them are truly positive ($\mathrm{TP} = 30$, $\mathrm{FP} = 20$). What is the precision, $\dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$?

Of the 50 flagged, 30 are correct and 20 are false alarms: $\mathrm{Precision} = \dfrac{30}{30+20} = \dfrac{30}{50} = $ 0.6. Sixty percent of your alerts are real — a number ROC's two rates never put in front of you.

PYTHON · RUNNABLE IN-BROWSER

# Same ranking, two prevalences: AUC barely moves, PR-AUC collapses.
import numpy as np
rng = np.random.default_rng(2)

def auc_ap(pos, neg):
    s = np.concatenate([pos, neg])
    y = np.concatenate([np.ones(len(pos)), np.zeros(len(neg))]).astype(int)
    order = np.argsort(-s); ys = y[order]
    P, N = ys.sum(), (1 - ys).sum()
    tpr = np.cumsum(ys) / P
    fpr = np.cumsum(1 - ys) / N
    auc = np.sum(np.diff(fpr) * (tpr[1:] + tpr[:-1]) / 2)  # trapezoid area
    prec = np.cumsum(ys) / np.arange(1, len(ys) + 1)   # precision at each cutoff
    rec  = tpr
    ap = np.sum(np.diff(np.concatenate([[0], rec])) * prec)  # area under PR
    return auc, ap, P / (P + N)

# Identical separability; only the negative pool grows.
mu = 1.3
for n_neg in (500, 5000, 50000):
    pos = rng.normal(mu, 1.0, 500)
    neg = rng.normal(0.0, 1.0, n_neg)
    auc, ap, pi = auc_ap(pos, neg)
    print(f"prevalence {100*pi:5.1f}%  ->  AUC {auc:.3f}   PR-AUC {ap:.3f}   baseline {pi:.3f}")

print("\nAUC is nearly constant (it conditions on the true class);")
print("PR-AUC sinks toward the shrinking baseline as positives get rarer.")

edits are live — break it on purpose

PR-AUC is summarized two ways and they differ: Average Precision (a step-wise sum, the scikit-learn default) and the trapezoidal area (which can be optimistic because linear interpolation between PR points is not achievable). Report which one you mean — and never compare an AP from one library to a trapezoidal PR-AUC from another.

4.3

The KS statistic & Gini (credit scoring)

Credit risk has its own ranking dialect, inherited from decades of scorecard practice. Two numbers dominate model documentation in banking: the Kolmogorov–Smirnov statistic and the Gini coefficient. Both measure the same thing AUC does — how well the score separates good from bad — but in coordinates a risk committee reads fluently.

The KS statistic is the largest gap between the two cumulative distributions of the score: the cumulative share of positives (bads) versus the cumulative share of negatives (goods), as you walk the score from one end to the other.

EQ V4.4 — THE KS STATISTIC $$ \mathrm{KS} = \max_{t}\;\big|\, F_{+}(t) - F_{-}(t)\,\big| \;=\; \max_{t}\;\big|\,\mathrm{TPR}(t) - \mathrm{FPR}(t)\,\big| $$

$F_{+}$ and $F_{-}$ are the cumulative distribution functions of the score among positives and negatives. Because $\mathrm{TPR} = 1 - F_{+}$ and $\mathrm{FPR} = 1 - F_{-}$ up to orientation, KS is exactly the maximum vertical distance between the ROC curve and the diagonal — the most-separated operating point. KS ranges 0 (curves identical, no separation) to 1 (perfectly disjoint). In retail credit, KS in the 30s–40s is a healthy application scorecard; above ~75 usually means a leak, not a triumph. The threshold at which the gap is maximized is a natural — though rarely cost-optimal (§4.5) — cutoff.

The Gini coefficient is just AUC rescaled to put a random model at zero and a perfect model at one:

EQ V4.5 — GINI FROM AUC $$ \mathrm{Gini} = 2\,\mathrm{AUC} - 1 \qquad\Longleftrightarrow\qquad \mathrm{AUC} = \frac{\mathrm{Gini} + 1}{2} $$

Gini is the ratio of the area between the ROC curve and the diagonal to the area between the perfect curve and the diagonal — twice the area AUC adds above 0.5. A model with AUC 0.80 has Gini 0.60; AUC 0.5 → Gini 0; AUC 1.0 → Gini 1. KS, Gini, and AUC all rank a model's discrimination, but they are not monotone transforms of one another: Gini is a fixed function of AUC, whereas KS depends on the shape of the separation and can reorder two models that AUC ranks the other way. Banks report all three because regulators expect them, and because a model strong on KS but weak on Gini (or vice versa) signals an unusual score distribution worth a second look.

The KS statistic is the maximum gap between the two classes' cumulative distribution functions of the score (equivalently, the largest vertical distance between the ROC curve and the diagonal). True or false? (Answer true or false.)

By definition (EQ V4.4), $\mathrm{KS} = \max_t |F_{+}(t) - F_{-}(t)| = \max_t |\mathrm{TPR}(t) - \mathrm{FPR}(t)|$ — precisely the maximum separation between the cumulative distributions of positives and negatives, which is the largest vertical gap between the ROC curve and the chance diagonal. The statement is true.

A scorecard reports $\mathrm{AUC} = 0.80$. Using EQ V4.5, what is its Gini coefficient ($2\,\mathrm{AUC} - 1$)?

$\mathrm{Gini} = 2 \times 0.80 - 1 = 1.60 - 1 = $ 0.6. Equivalently, the model captures 60% of the way from a coin flip (Gini 0) to a perfect ranker (Gini 1).

PYTHON · RUNNABLE IN-BROWSER

# KS statistic and Gini from two score distributions (goods vs bads).
import numpy as np
rng = np.random.default_rng(5)

bads  = rng.normal(0.65, 0.18, 800).clip(0, 1)   # higher score = riskier
goods = rng.normal(0.40, 0.18, 4000).clip(0, 1)

# KS: walk a common grid of thresholds, compare cumulative shares (EQ V4.4).
grid = np.linspace(0, 1, 501)
F_bad  = np.searchsorted(np.sort(bads),  grid, side="right") / len(bads)
F_good = np.searchsorted(np.sort(goods), grid, side="right") / len(goods)
gap = np.abs(F_bad - F_good)
ks = gap.max()
ks_at = grid[gap.argmax()]

# AUC by pair-counting -> Gini = 2*AUC - 1 (EQ V4.2, V4.5).
s = np.concatenate([bads, goods])
y = np.concatenate([np.ones(len(bads)), np.zeros(len(goods))])
ranks = s.argsort().argsort() + 1
P, N = len(bads), len(goods)
auc = (ranks[y == 1].sum() - P*(P+1)/2) / (P*N)
gini = 2*auc - 1

print(f"AUC  = {auc:.4f}")
print(f"Gini = 2*AUC - 1 = {gini:.4f}")
print(f"KS   = {ks:.4f}  (max gap at score ~ {ks_at:.2f})")
print("\nKS is the widest separation of the two cumulative curves;")
print("Gini is AUC stretched so chance=0 and perfect=1.")
plot_xy(grid, gap)                                # the KS gap as a function of cutoff

edits are live — break it on purpose

A note on PSI. The Population Stability Index — the workhorse for detecting that today's score distribution has drifted from the development sample — lives in the same credit-scoring toolbox and shares KS's distribution-comparison spirit, but it answers a different question: not "how well does the score separate good from bad?" but "has the input or score population shifted since the model was built?" PSI is therefore a stability and drift diagnostic, and Chapter 05 develops it in full alongside characteristic-stability and drift detection. Here it is enough to know that KS/Gini measure discrimination, PSI measures population shift, and a healthy KS today says nothing about whether PSI has quietly crept past its alarm threshold.

4.4

Calibration — reliability curves & Brier score

Everything so far judged ordering. None of it cares about the actual value of the score, because you can apply any strictly increasing transform — square it, pass it through a sigmoid, raise it to the tenth power — and AUC, KS, and Gini are all unchanged. But a score that drives a decision usually has to mean something: an expected-loss calculation needs a real probability of default, a triage tool needs to say "this patient has a 12% chance," not merely "this patient ranks 47th." Calibration is the property that closes the gap between the number and the world.

EQ V4.6 — PERFECT CALIBRATION $$ \Pr\big(\,Y = 1 \,\mid\, \hat{p}(X) = p\,\big) = p \qquad \text{for all } p \in [0, 1] $$

Among all cases the model assigns probability $p$, a fraction $p$ should actually be positive. Calibration and discrimination are orthogonal. A model can be perfectly calibrated yet useless at ranking (predict the base rate $\pi$ for everyone — calibrated, AUC = 0.5), or a flawless ranker yet badly miscalibrated (AUC = 1.0 with every probability squashed toward 0.5). You need both, and you must measure them separately because no single ranking metric will ever catch a calibration failure.

You inspect calibration with a reliability curve: bin the predicted probabilities, and for each bin plot the mean prediction against the observed positive frequency. Perfect calibration is the 45° diagonal. A curve that sags below it means the model is over-confident (it says 0.9 but only 0.7 actually happen); a curve that bows above means it is under-confident. The classic shapes have classic causes: modern neural nets and boosted trees tend to over-confidence, naive Bayes pushes probabilities toward the extremes, and a well-regularized logistic regression is often calibrated almost for free.

The standard scalar summary is the Brier score — the mean squared error of the probabilities themselves:

EQ V4.7 — THE BRIER SCORE $$ \mathrm{BS} = \frac{1}{n}\sum_{i=1}^{n}\big(\hat{p}_i - y_i\big)^2, \qquad y_i \in \{0, 1\} $$

Lower is better; $0$ is perfect, and predicting the base rate $\pi$ for everyone gives $\pi(1-\pi)$. The Brier score is a strictly proper scoring rule: it is uniquely minimized in expectation by the true probabilities, so you cannot game it by shading your forecasts. Its great virtue is also its limit — it bundles two things together. The Murphy decomposition splits it into calibration (reliability) plus refinement (resolution minus uncertainty), so a low Brier score can come from sharp-and-calibrated forecasts or from a timid model hugging the base rate. Read it alongside the reliability curve, never alone; for a pure calibration number, the Expected Calibration Error (the average bin-wise gap from the diagonal) is the common companion.

When a model ranks well but is miscalibrated, you do not retrain — you recalibrate the output with a cheap monotone post-processor fit on held-out data: Platt scaling (a one-parameter logistic on the scores) or isotonic regression (a free-form non-decreasing step function). Both preserve the ranking exactly — AUC, KS, and Gini are untouched — while bending the reliability curve back onto the diagonal. Isotonic is more flexible but needs more data and can overfit; Platt is robust on small validation sets. This is the standard fix established by Niculescu-Mizil & Caruana and unchanged in practice today.

Two predictions, both for true positives ($y = 1$): $\hat{p}_1 = 0.8$ and $\hat{p}_2 = 0.9$. Using EQ V4.7, what is the Brier score $\tfrac{1}{2}\big[(\hat{p}_1 - 1)^2 + (\hat{p}_2 - 1)^2\big]$?

$(0.8 - 1)^2 = (-0.2)^2 = 0.04$ and $(0.9 - 1)^2 = (-0.1)^2 = 0.01$. The mean is $\tfrac{1}{2}(0.04 + 0.01) = \tfrac{0.05}{2} = $ 0.025 — the squared-error penalty grows fast as a probability drifts from the truth.

PYTHON · RUNNABLE IN-BROWSER

# Discrimination vs calibration are orthogonal: same ranking, three Brier scores.
import numpy as np
rng = np.random.default_rng(11)

n = 4000
p_true = rng.beta(2, 5, n)                    # the genuine probabilities
y = (rng.random(n) < p_true).astype(int)      # outcomes drawn from them

def brier(p): return np.mean((p - y) ** 2)
def auc(p):
    r = p.argsort().argsort() + 1
    P, N = y.sum(), (1 - y).sum()
    return (r[y == 1].sum() - P*(P+1)/2) / (P*N)

def warp(p, gamma):                           # scale the logit -> a monotone re-map
    logit = np.log(p / (1 - p)) * gamma       # gamma>1 sharpens, gamma<1 softens
    return 1 / (1 + np.exp(-logit))

calibrated  = p_true                          # honest: gamma = 1
overconf    = warp(p_true, 2.2)               # over-confident: probs pushed to extremes
underconf   = warp(p_true, 0.45)              # under-confident: probs squashed to 0.5

print(f"{'model':<16}{'AUC':>8}{'Brier':>9}")
for name, p in [("calibrated", calibrated),
                ("over-confident", overconf),
                ("under-confident", underconf)]:
    print(f"{name:<16}{auc(p):>8.3f}{brier(p):>9.4f}")

print("\nAUC is IDENTICAL for all three -- warp() is monotone, so the ranking")
print("never changes. Brier separates them: only the calibrated model is honest")
print("about its probabilities. Discrimination cannot see what calibration measures.")

edits are live — break it on purpose

INSTRUMENT V4.2 — RELIABILITY CURVE & BRIEROVER- vs UNDER-CONFIDENT MODELS · EQ V4.6–V4.7

CONFIDENCE (γ) 1.00

BIN COUNT 10

REGIME

—

BRIER SCORE

—

EXPECTED CALIB. ERROR

—

The model's probabilities are warped by an exponent $\gamma$: the dots are binned predictions, the dashed line is perfect calibration. At $\gamma = 1$ the model sits on the diagonal — honest. Push $\gamma$ above 1 to make it over-confident (the curve sags below the line; it claims more certainty than it has) and below 1 to make it under-confident (the curve bows above). Watch the Brier score and ECE bottom out exactly at $\gamma = 1$ — and note the ranking never changes, because $\gamma$ is a monotone transform: this is calibration moving while discrimination stands still.

4.5

Cutoff selection by cost

The ROC curve hands you every operating point the model can reach; it does not tell you which one to stand on. The default of 0.5 is almost always wrong — it is correct only when classes are balanced and the two error types cost the same, which is to say almost never. The right threshold is the one that minimizes expected cost, and that depends on numbers the model never sees: the price of a false positive, the price of a false negative, and the prevalence.

EQ V4.8 — EXPECTED COST OF A THRESHOLD $$ \mathbb{E}[\text{cost}](t) = c_{\mathrm{FP}}\cdot\mathrm{FP}(t) + c_{\mathrm{FN}}\cdot\mathrm{FN}(t) \;\;\Big(- \, b_{\mathrm{TP}}\cdot\mathrm{TP}(t) - b_{\mathrm{TN}}\cdot\mathrm{TN}(t)\Big) $$

Each cell of the confusion matrix carries a cost (or benefit); the total is their weighted sum, and you choose the $t$ that minimizes it. The benefit terms in parentheses are optional — when only errors are penalized, dropping them does not move the optimum. The optimal threshold is governed by the cost ratio, not by 0.5. If a missed fraud costs ten times a false alarm, you should accept far more false alarms to catch it — the cutoff slides down accordingly.

For a model that emits a true probability $\hat{p}$, the cost-minimizing rule has a clean closed form. Flagging a case is worth it when its expected cost of being positive falls below the expected cost of being negative, which rearranges to a single threshold on the probability:

EQ V4.9 — THE COST-OPTIMAL PROBABILITY THRESHOLD $$ t^{\star} = \frac{c_{\mathrm{FP}}}{c_{\mathrm{FP}} + c_{\mathrm{FN}}} \qquad\Longleftrightarrow\qquad \text{predict positive when } \hat{p} \;\ge\; t^{\star} $$

The optimal cutoff depends only on the ratio of error costs. Equal costs ($c_{\mathrm{FP}} = c_{\mathrm{FN}}$) give $t^{\star} = 0.5$ — the only case the default is right. If a false negative costs $9\times$ a false positive, $t^{\star} = \frac{1}{1+9} = 0.1$: flag anything above a 10% probability. This formula is only valid if $\hat{p}$ is calibrated — which is precisely why §4.4 comes before §4.5. Feed it the over-confident scores of an uncalibrated model and the "optimal" threshold is optimal for a world that does not exist. Calibrate first, then optimize the cutoff; otherwise you are tuning a decision on a lie.

Two things follow. First, the whole pipeline composes: rank well (§4.1–4.3), calibrate the probabilities (§4.4), then place the cutoff by cost (§4.5). Skip the middle step and the last one is meaningless. Second, when costs are uncertain — as they usually are — do not pick a single $t^{\star}$; sweep the cost ratio and present the operating frontier, so the business owner can see the trade and choose with open eyes rather than inherit a hidden 0.5.

A false positive costs $c_{\mathrm{FP}} = 1$ and a false negative costs $c_{\mathrm{FN}} = 9$. Using EQ V4.9, at what calibrated probability $t^{\star}$ should you start predicting positive ($\tfrac{c_{\mathrm{FP}}}{c_{\mathrm{FP}}+c_{\mathrm{FN}}}$)?

$t^{\star} = \dfrac{c_{\mathrm{FP}}}{c_{\mathrm{FP}} + c_{\mathrm{FN}}} = \dfrac{1}{1 + 9} = \dfrac{1}{10} = $ 0.1. Because a miss is nine times as expensive as a false alarm, you flag any case with at least a 10% probability — far below the naive 0.5.

PYTHON · RUNNABLE IN-BROWSER

# Cost-based cutoff: sweep thresholds, find the minimum-cost operating point.
import numpy as np
rng = np.random.default_rng(9)

n = 6000
y = (rng.random(n) < 0.15).astype(int)        # 15% positive
p = np.clip(0.15 + 0.55*y + rng.normal(0, 0.20, n), 0.001, 0.999)  # calibrated-ish

c_fp, c_fn = 1.0, 9.0                          # a miss costs 9x a false alarm
ts = np.linspace(0.01, 0.99, 99)
costs = []
for t in ts:
    pred = (p >= t).astype(int)
    fp = int(((pred == 1) & (y == 0)).sum())
    fn = int(((pred == 0) & (y == 1)).sum())
    costs.append(c_fp*fp + c_fn*fn)
costs = np.array(costs)

t_star_grid = ts[costs.argmin()]              # empirically optimal cutoff
t_star_formula = c_fp / (c_fp + c_fn)         # EQ V4.9 closed form

print(f"closed-form t* = c_FP/(c_FP+c_FN) = {t_star_formula:.3f}")
print(f"grid-search    t* (min cost)      = {t_star_grid:.3f}")
print(f"cost at t=0.50 (naive default)    = {costs[np.argmin(np.abs(ts-0.5))]:.0f}")
print(f"cost at t*     (cost-optimal)     = {costs.min():.0f}")
print("\nThe default 0.5 leaves money on the table whenever costs are asymmetric.")
plot_xy(ts, costs)                            # the cost-vs-threshold curve (U-shaped)

edits are live — break it on purpose

INSTRUMENT V4.3 — COST-BASED CUTOFF OPTIMIZERSWEEP THE THRESHOLD · EQ V4.8–V4.9

FALSE-POSITIVE COST 1

FALSE-NEGATIVE COST 9

PREVALENCE (% POS) 15%

COST-OPTIMAL t*

—

COST @ t* vs @ 0.50

—

SAVINGS vs DEFAULT

—

The U-shaped curve is total expected cost (EQ V4.8) as the threshold sweeps left to right; the mint marker is the cost-minimizing $t^{\star}$, the grey line is the naive 0.5. Raise FALSE-NEGATIVE COST and watch $t^{\star}$ slide left — you accept more false alarms to stop catching fewer expensive misses — landing near the closed form $c_{\mathrm{FP}}/(c_{\mathrm{FP}}+c_{\mathrm{FN}})$ of EQ V4.9. The "savings vs default" readout is the money the standard 0.5 quietly throws away whenever your costs are asymmetric.

These metrics all assume the world stays still — the population you scored yesterday is the population you score today. It never does. Chapter 05 turns to stability and drift: the Population Stability Index (PSI) and characteristic stability that catch a shifting input distribution, covariate and concept drift, and the monitoring that tells you when a once-excellent AUC has quietly stopped describing reality.

4.R

References

Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters 27(8) — the canonical tutorial on ROC curves, AUC, and the pair-counting identity (§4.1).
Hand, D. J. (2009). Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve. Machine Learning 77(1) — the influential critique of AUC and the proposed H-measure.
Niculescu-Mizil, A. & Caruana, R. (2005). Predicting Good Probabilities With Supervised Learning. ICML 2005 — calibration behavior across model families and the Platt / isotonic fixes (§4.4).
Saito, T. & Rehmsmeier, M. (2015). The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10(3) — the empirical case for PR over ROC under class imbalance (§4.2).
Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1) — the original mean-squared-error scoring rule for probabilities (§4.4).
Hanley, J. A. & McNeil, B. J. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 143(1) — the AUC = Wilcoxon–Mann–Whitney equivalence (EQ V4.2).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017 — modern deep networks are systematically over-confident; temperature scaling as a fix (§4.4).