AI // ENCYCLOPEDIA / MODEL RISK / 03 / METRICS INDEX NEXT: 04 RANKING & CALIBRATION →
MODEL VALIDATION & RISK · CHAPTER 03 / 07

Metrics — Regression & Classification

A metric is not just a final report; it is the objective the pipeline optimizes toward, and it determines which errors the model is willing to make. The metric you optimize is the behavior you get, and accuracy is the one most often misread on imbalanced data. This chapter covers the working vocabulary: regression error measures, the confusion matrix and the rates derived from it, and probabilistic scores that grade the predicted confidence as well as the answer.

LEVELINTRO READING TIME≈ 24 MIN BUILDS ONMLOPS 01 · STATS 04 INSTRUMENTSCONFUSION · MAE vs RMSE · MAPE TRAP
3.1

Regression metrics: MSE, RMSE, MAE, MAPE, R²

A regression model predicts a number \(\hat{y}_i\) for each row whose truth is \(y_i\). The single object every regression metric chews on is the vector of residuals \(e_i = y_i - \hat{y}_i\). The metrics differ only in how they punish a residual — and that choice of punishment is the choice of what the model will try hardest to avoid.

EQ V3.1 — MEAN SQUARED ERROR & ITS ROOT $$ \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2, \qquad \mathrm{RMSE} = \sqrt{\mathrm{MSE}} $$
Squaring makes large residuals dominate: a single error of 10 contributes as much as one hundred errors of 1. So MSE/RMSE is the metric of choice when big misses are disproportionately bad (a forecast that is occasionally catastrophic). RMSE takes the square root to return to the original units — predict dollars, read dollars — and is the value almost always reported. The estimator that minimizes MSE is the conditional mean \(\mathbb{E}[y\mid x]\).
EQ V3.2 — MEAN ABSOLUTE ERROR $$ \mathrm{MAE} = \frac{1}{n}\sum_{i=1}^{n} \lvert\, y_i - \hat{y}_i \,\rvert $$
MAE punishes every dollar of error equally, with no squaring. It is in the original units already and is far more robust to outliers than RMSE — one wild residual moves it linearly, not quadratically. The estimator that minimizes MAE is the conditional median, which is why MAE-trained models lean toward the typical case and ignore rare extremes. The gap between RMSE and MAE is itself a diagnostic: RMSE \(\ge\) MAE always, and a large ratio signals a heavy tail of big errors.
EQ V3.3 — MEAN ABSOLUTE PERCENTAGE ERROR $$ \mathrm{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n} \left\lvert \frac{y_i - \hat{y}_i}{y_i} \right\rvert $$
MAPE rescales every residual by the truth, giving a unit-free percentage that lets you compare error across series of wildly different magnitude. That convenience hides two traps it is famous for: it explodes when any \(y_i\) is near zero (the demo in §3.2), and it is asymmetric — it penalizes over-prediction more harshly than under-prediction, quietly biasing a MAPE-tuned forecast low. Use it for reporting across scales; never as your sole training objective.
EQ V3.4 — COEFFICIENT OF DETERMINATION (R²) $$ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} = 1 - \frac{\mathrm{SS}_{\text{res}}}{\mathrm{SS}_{\text{tot}}} $$
\(R^2\) is the fraction of the target's variance the model explains, measured against the dumbest honest baseline: always predicting the mean \(\bar{y}\). \(R^2 = 1\) is perfect; \(R^2 = 0\) means you matched the mean and learned nothing; \(R^2\) can go negative when the model is worse than that constant — a fact that surprises people who assume it lives in \([0,1]\). Because it is normalized by the data's own spread, \(R^2\) is the one regression metric that is comparable across datasets.
A model makes two predictions with errors \( e = (3,\ 4) \). Using EQ V3.1, what is the RMSE of these errors, \( \sqrt{\tfrac{1}{2}(3^2 + 4^2)} \)?
Square the errors: \( 3^2 = 9 \) and \( 4^2 = 16 \). The mean squared error is \( (9 + 16)/2 = 25/2 = 12.5 \). Taking the root: \( \sqrt{12.5} = \) 3.54. (Note the MAE of the same errors is \( (3+4)/2 = 3.5 \) — RMSE sits above it because squaring inflates the larger residual.)
PYTHON · RUNNABLE IN-BROWSER
# Every regression metric from scratch on a toy fit (EQ V3.1-V3.4).
import numpy as np
rng = np.random.default_rng(0)

n = 60
x = np.linspace(0, 10, n)
y = 3.0 * x + 5.0 + rng.normal(0, 2.0, n)     # truth: a noisy line
# Fit y = a*x + b by least squares (this is what MSE training would find).
A = np.vstack([x, np.ones_like(x)]).T
a, b = np.linalg.lstsq(A, y, rcond=None)[0]
yhat = a * x + b

e = y - yhat                                   # residuals: the raw material
mse  = np.mean(e**2)
rmse = np.sqrt(mse)
mae  = np.mean(np.abs(e))
mape = 100 * np.mean(np.abs(e / y))            # y is safely far from 0 here
r2   = 1 - np.sum(e**2) / np.sum((y - y.mean())**2)

print(f"fitted line : y = {a:.2f}*x + {b:.2f}")
print(f"MSE  = {mse:.3f}")
print(f"RMSE = {rmse:.3f}   (same units as y)")
print(f"MAE  = {mae:.3f}    (<= RMSE, always)")
print(f"MAPE = {mape:.2f} %")
print(f"R2   = {r2:.3f}     (1 = perfect, 0 = no better than the mean)")
plot_scatter(x, y)                              # the data the line was fit to
edits are live — break it on purpose
3.2

When each regression metric misleads

None of these numbers is neutral. Each encodes an opinion about which errors matter, and each has a regime where it quietly tells you the wrong thing. The professional habit is to report at least two — usually RMSE and MAE — and to read the gap between them.

MetricWhat it rewardsWhere it misleads
RMSE / MSEgetting the big cases rightone outlier can dominate the score; over-sensitive to a single bad row
MAEgetting the typical case rightindifferent to whether a miss is huge or merely large; ignores the tail
MAPEcomparable error across scalesundefined / explodes near \(y = 0\); asymmetric, biases forecasts low
beating the mean baselineinflates with more features; can go negative; meaningless on tiny test sets

The cleanest demonstration is the outlier sensitivity of RMSE versus MAE. Take ten residuals of size 1. MAE = 1 and RMSE = 1. Now turn one of those into a residual of 10 — a single bad day. MAE crawls up to \((9\cdot 1 + 10)/10 = 1.9\). RMSE leaps to \(\sqrt{(9\cdot 1 + 100)/10} = \sqrt{10.9} = 3.30\). The same data; one metric shrugged, the other tripled. Which reaction you want is a domain decision — but you must know the metric is making it for you.

THE MAPE TRAP

MAPE divides by the truth, so a single true value near zero detonates it. If one row has \(y_i = 0.01\) and you predict \(0.5\), that term alone is \(|{-0.49}/0.01| = 49 = 4900\%\) — and the average is now hostage to one near-zero label, regardless of how good the other thousand predictions are. The standard escapes are sMAPE (symmetric MAPE), WAPE / weighted MAPE (divide by the sum of actuals, not row by row), or simply MAE when the targets can be small. Never compute MAPE on data with zeros in it.

A model has residual sum of squares \( \mathrm{SS}_{\text{res}} = 4 \) against a total sum of squares \( \mathrm{SS}_{\text{tot}} = 20 \). Using EQ V3.4, what is \( R^2 = 1 - \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}} \)?
\( \mathrm{SS}_{\text{res}}/\mathrm{SS}_{\text{tot}} = 4/20 = 0.20 \), so \( R^2 = 1 - 0.20 = \) 0.80. The model explains 80% of the variance that always-predict-the-mean would leave unexplained.
INSTRUMENT V3.1 — MAE vs RMSE DIVERGENCEADD AN OUTLIER · EQ V3.1 / V3.2
MAE
RMSE
RMSE / MAE RATIO
Each bar is one residual: grey are the typical errors of size 1, the red bar is the single outlier you control. Drag OUTLIER RESIDUAL SIZE up and watch MAE rise gently while RMSE — and the RMSE/MAE ratio — climbs far faster, because squaring lets one bad row dominate. With no outlier (size 1) the two metrics agree exactly; the ratio is the tell-tale of a heavy error tail.
PYTHON · RUNNABLE IN-BROWSER
# One outlier: MAE shrugs, RMSE leaps. And MAPE near zero detonates.
import numpy as np

e = np.ones(10)                                # ten residuals of size 1
print("clean residuals:", e.astype(int))
print(f"  MAE  = {np.mean(np.abs(e)):.3f}   RMSE = {np.sqrt(np.mean(e**2)):.3f}")

e_out = e.copy(); e_out[0] = 10                # turn ONE into a size-10 miss
print("\nwith one size-10 outlier:")
print(f"  MAE  = {np.mean(np.abs(e_out)):.3f}   RMSE = {np.sqrt(np.mean(e_out**2)):.3f}")
print(f"  RMSE rose {np.sqrt(np.mean(e_out**2))/np.sqrt(np.mean(e**2)):.2f}x; "
      f"MAE rose only {np.mean(np.abs(e_out))/np.mean(np.abs(e)):.2f}x")

# The MAPE trap: identical absolute errors, but one true value sits near zero.
y     = np.array([100., 100., 100.,   0.01])
yhat  = np.array([101., 101., 101.,   0.50])
print("\nMAPE per row (%):", np.round(100*np.abs((y-yhat)/y), 1))
print(f"overall MAPE = {100*np.mean(np.abs((y-yhat)/y)):.0f} %  "
      "<- one near-zero label hijacks the whole average")
edits are live — break it on purpose
INSTRUMENT V3.2 — THE MAPE NEAR-ZERO PITFALLONE SMALL TRUTH BREAKS THE AVERAGE · EQ V3.3
MAE (units, stable)
MAPE (%, fragile)
y₄'s SHARE OF MAPE
Three well-behaved rows sit near \(y = 100\) with tiny errors; a fourth row's truth \(y_4\) is the slider, plotted on a log-distance scale. As you drag \(y_4\) toward zero, MAE barely twitches — the absolute error is unchanged — but MAPE blows up and that one row's share of the total MAPE races toward 100%. Slide \(y_4\) past zero and the percentage error becomes nonsense entirely: this is why MAPE is banned on data containing zeros.
3.3

The confusion matrix

Classification swaps a continuous truth for a discrete one, and the entire grammar of classification metrics is built from a single 2×2 table. A binary classifier converts a score into a label by comparing it to a threshold; every prediction then lands in one of four cells:

EQ V3.5 — THE CONFUSION MATRIX $$ \begin{array}{c|cc} & \hat{y}=1 & \hat{y}=0 \\ \hline y=1 & \mathrm{TP} & \mathrm{FN} \\ y=0 & \mathrm{FP} & \mathrm{TN} \end{array} $$
TP (true positive): predicted positive, was positive. FP (false positive, "false alarm"): predicted positive, was negative. FN (false negative, "miss"): predicted negative, was positive. TN (true negative). Every classification metric in §3.4 is just a ratio of these four counts. The deep point: FP and FN have different costs — a missed tumor is not the same as a false alarm — so no single number can serve every problem, and the threshold that balances them is a business decision, not a statistical one.

The threshold is the dial that moves counts between cells. Lower it and you call more things positive: TP and FP both rise, FN falls. Raise it and you become conservative: FP and TP both fall, FN rises. You cannot lower false alarms and misses at the same time by moving the threshold — you can only trade one for the other. That trade-off is the single most important intuition in classification, and Instrument V3.3 below exists to make you feel it in your hands.

A confusion matrix has \( \mathrm{TP}=60 \), \( \mathrm{FP}=40 \), \( \mathrm{FN}=40 \), \( \mathrm{TN}=60 \). What is the accuracy, \( (\mathrm{TP}+\mathrm{TN}) / (\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}) \)?
Correct predictions are \( \mathrm{TP}+\mathrm{TN} = 60 + 60 = 120 \); the total is \( 60+40+40+60 = 200 \). Accuracy \( = 120/200 = \) 0.6. Note that this "balanced-looking" matrix still gets two in five wrong — and on an imbalanced set the same accuracy could come from a model that never finds a single positive.
INSTRUMENT V3.3 — CONFUSION-MATRIX EXPLORERMOVE THE THRESHOLD · PRECISION ↔ RECALL · EQ V3.5
PRECISION
RECALL
F1
ACCURACY
Two overlapping bell curves are the score distributions of the negative and positive classes; the vertical line is your threshold. Everything to its right is called positive. Slide the threshold left and recall climbs while precision falls (you catch more positives but raise false alarms); slide it right and the trade reverses. The four counts and all four metrics update live. Then drag CLASS SEPARATION up: a genuinely better model is the only thing that improves precision and recall together.
PYTHON · RUNNABLE IN-BROWSER
# Confusion matrix -> precision, recall, F1, accuracy, all from scratch.
import numpy as np
rng = np.random.default_rng(2)

n = 1000
y = rng.integers(0, 2, n)                          # true labels
# scores: positives score higher on average, but the classes overlap
score = rng.normal(0, 1, n) + 1.3 * y
thr = 0.5
pred = (score > thr).astype(int)

TP = int(np.sum((pred == 1) & (y == 1)))
FP = int(np.sum((pred == 1) & (y == 0)))
FN = int(np.sum((pred == 0) & (y == 1)))
TN = int(np.sum((pred == 0) & (y == 0)))
print(f"confusion: TP={TP}  FP={FP}  FN={FN}  TN={TN}")

precision = TP / (TP + FP)                         # of predicted positives, how many real
recall    = TP / (TP + FN)                         # of real positives, how many caught
f1        = 2 * precision * recall / (precision + recall)
accuracy  = (TP + TN) / n
print(f"precision = {precision:.3f}")
print(f"recall    = {recall:.3f}")
print(f"F1        = {f1:.3f}   (harmonic mean of the two)")
print(f"accuracy  = {accuracy:.3f}")
edits are live — break it on purpose
3.4

Precision, recall, F1, accuracy

Four ratios of the four counts. They look interchangeable; they are not, and choosing the wrong one is the most common way a model ships looking great and fails in production.

EQ V3.6 — PRECISION & RECALL $$ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \qquad \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
Precision answers: of everything I flagged positive, what fraction really was? It is the metric you care about when a false alarm is expensive — a spam filter that quarantines a real invoice, a fraud system that freezes an honest card. Recall (sensitivity, true-positive rate) answers: of everything that really was positive, what fraction did I catch? It is what you care about when a miss is expensive — a cancer screen, a security threat. The two pull in opposite directions along the threshold (§3.3): you buy recall with precision and vice versa.
EQ V3.7 — F1: THE HARMONIC MEAN $$ F_1 = \frac{2\,\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}, \qquad F_\beta = (1+\beta^2)\frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\beta^2\,\mathrm{Precision}+\mathrm{Recall}} $$
\(F_1\) is the harmonic mean of precision and recall, not the arithmetic one — and the choice is deliberate. The harmonic mean is dragged toward the smaller of the two, so \(F_1\) is high only when precision and recall are both high; a model with precision 1.0 and recall 0.0 scores \(F_1 = 0\), not 0.5. \(F_\beta\) generalizes it: \(\beta > 1\) weights recall more (use when misses hurt), \(\beta < 1\) weights precision more. \(F_1\) is the right summary on imbalanced data where accuracy is useless.
EQ V3.8 — ACCURACY (AND WHY IT LIES) $$ \mathrm{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}} $$
Accuracy is the fraction of predictions that are correct — intuitive, and the default everyone reaches for first. It is also the metric that lies most often, because it collapses the confusion matrix into one number and so is blind to class imbalance. On a dataset that is 99% negative, the model that predicts "negative" for everything scores 99% accuracy while catching exactly zero positives — useless, yet by accuracy alone it looks excellent. This is the accuracy paradox, and it is why the lede of this chapter singles accuracy out. Under imbalance, report precision, recall, F1, or balanced accuracy instead.
A COMMON ERROR

"The model is 97% accurate, ship it." Always ask the base rate first. If 97% of the rows are negative, a constant "no" predictor already scores 97% — your model may have learned nothing. The diagnostic reflex: compute the accuracy of the majority-class baseline, and never report accuracy on imbalanced data without precision and recall beside it. Accuracy is a fine metric only when the classes are roughly balanced and false positives and false negatives cost about the same.

A classifier produces \( \mathrm{TP} = 40 \) true positives and \( \mathrm{FP} = 10 \) false positives. Using EQ V3.6, what is its precision, \( \mathrm{TP}/(\mathrm{TP}+\mathrm{FP}) \)?
Precision \( = \dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} = \dfrac{40}{40+10} = \dfrac{40}{50} = \) 0.8. Four out of every five items the model flagged as positive truly were — but precision alone says nothing about how many positives it missed; that is recall's job.
That same classifier also has \( \mathrm{FN} = 20 \) false negatives (positives it missed). Using EQ V3.6, what is its recall, \( \mathrm{TP}/(\mathrm{TP}+\mathrm{FN}) \)?
Recall \( = \dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} = \dfrac{40}{40+20} = \dfrac{40}{60} = \) 0.667. So the model is precise (0.8) but leaky on recall (0.67): its \( F_1 = \frac{2\cdot 0.8\cdot 0.667}{0.8+0.667} = 0.727 \), pulled below the average toward the weaker of the two.
PYTHON · RUNNABLE IN-BROWSER
# The accuracy paradox: 99% accurate and completely useless.
import numpy as np
rng = np.random.default_rng(5)

n = 10000
y = (rng.random(n) < 0.01).astype(int)            # 1% positive (rare disease)
print(f"true positives in data: {y.sum()} / {n}  ({100*y.mean():.1f}%)\n")

# "Model" A: predict the majority class (negative) for everyone.
predA = np.zeros(n, dtype=int)
accA = (predA == y).mean()
TP_A = int(np.sum((predA==1)&(y==1)))
print(f"ALWAYS-NEGATIVE:  accuracy = {accA:.3f}  but caught {TP_A} of {y.sum()} positives")
print("  precision/recall/F1 are 0/0 -> the metric that 'works' is a mirage.\n")

# "Model" B: a real but imperfect detector.
score = rng.normal(0, 1, n) + 2.5 * y
predB = (score > 1.5).astype(int)
TP=int(np.sum((predB==1)&(y==1))); FP=int(np.sum((predB==1)&(y==0)))
FN=int(np.sum((predB==0)&(y==1))); TN=int(np.sum((predB==0)&(y==0)))
prec = TP/(TP+FP) if TP+FP else 0
rec  = TP/(TP+FN) if TP+FN else 0
f1   = 2*prec*rec/(prec+rec) if prec+rec else 0
print(f"REAL DETECTOR:    accuracy = {(TP+TN)/n:.3f}")
print(f"  precision = {prec:.3f}  recall = {rec:.3f}  F1 = {f1:.3f}")
print("Same-ish accuracy, but only F1/precision/recall reveal which model works.")
edits are live — break it on purpose
3.5

Log loss & probabilistic scoring

Everything so far grades a decision — the label after thresholding. But most classifiers output a probability, and throwing it away to compute accuracy discards information: a model that says "90% sure" and is right is better than one that says "51% sure" and is right. Log loss (binary cross-entropy) grades the probability itself, rewarding confidence only when it is earned.

EQ V3.9 — BINARY CROSS-ENTROPY (LOG LOSS) $$ \mathrm{LogLoss} = -\frac{1}{n}\sum_{i=1}^{n}\Big[\, y_i\ln \hat{p}_i + (1-y_i)\ln(1-\hat{p}_i) \,\Big] $$
For each row, only one term survives: if \(y_i = 1\) the penalty is \(-\ln\hat{p}_i\), if \(y_i = 0\) it is \(-\ln(1-\hat{p}_i)\). Predict the truth with probability 1 and the penalty is \(-\ln 1 = 0\); predict it with probability \(0.5\) and you pay \(\ln 2 \approx 0.693\) — the cost of a coin flip, and the score a model that has learned nothing converges to. The penalty is unbounded: a confident wrong answer (\(\hat{p}\to 0\) when \(y=1\)) costs \(-\ln(0)\to\infty\). Log loss is the loss most classifiers are actually trained on, and it is the proper scoring rule that calibration (Chapter 04) exists to keep honest.

Two sibling scores are worth knowing. The Brier score is the mean squared error of the probabilities, \(\frac{1}{n}\sum(\hat{p}_i - y_i)^2\) — also a proper scoring rule, but bounded (a confident wrong answer maxes out at 1 rather than infinity), so it is gentler on outliers and easier to read. And cross-entropy generalizes immediately to \(K\) classes as \(-\frac{1}{n}\sum_i\sum_{c} y_{ic}\ln\hat{p}_{ic}\), the multiclass loss behind virtually every neural classifier. The honest caveat: log loss assumes the probabilities are calibrated; a model can have great ranking (AUC) yet terrible log loss if its probabilities are systematically over- or under-confident — exactly the gap the next chapter closes.

A model predicts probability \( \hat{p} = 0.9 \) for a row whose true label is \( y = 1 \). Using EQ V3.9, what is the log-loss penalty for this single row, \( -\ln(\hat{p}) \)? (Use \( \ln 0.9 = -0.105 \).)
With \( y = 1 \) only the first term survives: penalty \( = -\ln(\hat{p}) = -\ln(0.9) = -(-0.105) = \) 0.105. Compare a hedge at \( \hat{p}=0.5 \) (cost \(0.693\)) and a confident error at \( \hat{p}=0.1 \) (cost \(2.303\)): log loss rewards confidence only when it is right.
PYTHON · RUNNABLE IN-BROWSER
# Log loss vs Brier: confidence is rewarded only when it's right (EQ V3.9).
import numpy as np

def log_loss(y, p):
    p = np.clip(p, 1e-12, 1 - 1e-12)              # guard ln(0) = -inf
    return -np.mean(y*np.log(p) + (1-y)*np.log(1-p))

def brier(y, p):
    return np.mean((p - y)**2)

y = np.array([1, 1, 0, 0])                        # two positives, two negatives
confident_right = np.array([0.95, 0.90, 0.05, 0.10])
hedging         = np.array([0.55, 0.55, 0.45, 0.45])
confident_wrong = np.array([0.05, 0.10, 0.95, 0.90])

for name, p in [("confident & right", confident_right),
                ("hedging (~0.5)   ", hedging),
                ("confident & WRONG", confident_wrong)]:
    print(f"{name}:  log loss = {log_loss(y,p):.3f}   Brier = {brier(y,p):.3f}")

print("\nlog loss of a single confident-correct 0.9  :", round(-np.log(0.9), 3))
print("log loss of a single coin-flip      0.5     :", round(-np.log(0.5), 3))
print("log loss explodes for confident errors; Brier stays bounded by 1.")
edits are live — break it on purpose
NEXT

Every metric here graded a fixed threshold or assumed the probabilities were trustworthy — two assumptions the next chapter refuses to make. Chapter 04 sweeps the threshold to draw the ROC and precision–recall curves (and the AUC that summarizes them), then asks the harder question log loss only hinted at: when the model says 70%, does it happen 70% of the time? — calibration, reliability diagrams, and the fixes that make probabilities mean what they say.

3.R

References

  1. Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Machine Learning Technologies 2(1) — the definitive survey of confusion-matrix metrics, their biases, and what each one really measures (§3.3–3.4).
  2. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — the standard reference for loss functions, R², and the bias/variance view of regression error (§3.1–3.2).
  3. Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1) — the original proper scoring rule for probabilistic forecasts (§3.5).
  4. Gneiting, T. & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. J. American Statistical Association 102(477) — the theory of why log loss and Brier reward honest probabilities (§3.5).
  5. Hyndman, R. J. & Koehler, A. B. (2006). Another Look at Measures of Forecast Accuracy. International J. Forecasting 22(4) — the canonical critique of MAPE and the case for scaled error measures (§3.2).
  6. Chicco, D. & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6 — a modern argument for why accuracy and F1 mislead on imbalanced data (§3.4).