Regression metrics: MSE, RMSE, MAE, MAPE, R²
A regression model predicts a number \(\hat{y}_i\) for each row whose truth is \(y_i\). The single object every regression metric chews on is the vector of residuals \(e_i = y_i - \hat{y}_i\). The metrics differ only in how they punish a residual — and that choice of punishment is the choice of what the model will try hardest to avoid.
# Every regression metric from scratch on a toy fit (EQ V3.1-V3.4).
import numpy as np
rng = np.random.default_rng(0)
n = 60
x = np.linspace(0, 10, n)
y = 3.0 * x + 5.0 + rng.normal(0, 2.0, n) # truth: a noisy line
# Fit y = a*x + b by least squares (this is what MSE training would find).
A = np.vstack([x, np.ones_like(x)]).T
a, b = np.linalg.lstsq(A, y, rcond=None)[0]
yhat = a * x + b
e = y - yhat # residuals: the raw material
mse = np.mean(e**2)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(e))
mape = 100 * np.mean(np.abs(e / y)) # y is safely far from 0 here
r2 = 1 - np.sum(e**2) / np.sum((y - y.mean())**2)
print(f"fitted line : y = {a:.2f}*x + {b:.2f}")
print(f"MSE = {mse:.3f}")
print(f"RMSE = {rmse:.3f} (same units as y)")
print(f"MAE = {mae:.3f} (<= RMSE, always)")
print(f"MAPE = {mape:.2f} %")
print(f"R2 = {r2:.3f} (1 = perfect, 0 = no better than the mean)")
plot_scatter(x, y) # the data the line was fit to
When each regression metric misleads
None of these numbers is neutral. Each encodes an opinion about which errors matter, and each has a regime where it quietly tells you the wrong thing. The professional habit is to report at least two — usually RMSE and MAE — and to read the gap between them.
| Metric | What it rewards | Where it misleads |
|---|---|---|
| RMSE / MSE | getting the big cases right | one outlier can dominate the score; over-sensitive to a single bad row |
| MAE | getting the typical case right | indifferent to whether a miss is huge or merely large; ignores the tail |
| MAPE | comparable error across scales | undefined / explodes near \(y = 0\); asymmetric, biases forecasts low |
| R² | beating the mean baseline | inflates with more features; can go negative; meaningless on tiny test sets |
The cleanest demonstration is the outlier sensitivity of RMSE versus MAE. Take ten residuals of size 1. MAE = 1 and RMSE = 1. Now turn one of those into a residual of 10 — a single bad day. MAE crawls up to \((9\cdot 1 + 10)/10 = 1.9\). RMSE leaps to \(\sqrt{(9\cdot 1 + 100)/10} = \sqrt{10.9} = 3.30\). The same data; one metric shrugged, the other tripled. Which reaction you want is a domain decision — but you must know the metric is making it for you.
MAPE divides by the truth, so a single true value near zero detonates it. If one row has \(y_i = 0.01\) and you predict \(0.5\), that term alone is \(|{-0.49}/0.01| = 49 = 4900\%\) — and the average is now hostage to one near-zero label, regardless of how good the other thousand predictions are. The standard escapes are sMAPE (symmetric MAPE), WAPE / weighted MAPE (divide by the sum of actuals, not row by row), or simply MAE when the targets can be small. Never compute MAPE on data with zeros in it.
# One outlier: MAE shrugs, RMSE leaps. And MAPE near zero detonates.
import numpy as np
e = np.ones(10) # ten residuals of size 1
print("clean residuals:", e.astype(int))
print(f" MAE = {np.mean(np.abs(e)):.3f} RMSE = {np.sqrt(np.mean(e**2)):.3f}")
e_out = e.copy(); e_out[0] = 10 # turn ONE into a size-10 miss
print("\nwith one size-10 outlier:")
print(f" MAE = {np.mean(np.abs(e_out)):.3f} RMSE = {np.sqrt(np.mean(e_out**2)):.3f}")
print(f" RMSE rose {np.sqrt(np.mean(e_out**2))/np.sqrt(np.mean(e**2)):.2f}x; "
f"MAE rose only {np.mean(np.abs(e_out))/np.mean(np.abs(e)):.2f}x")
# The MAPE trap: identical absolute errors, but one true value sits near zero.
y = np.array([100., 100., 100., 0.01])
yhat = np.array([101., 101., 101., 0.50])
print("\nMAPE per row (%):", np.round(100*np.abs((y-yhat)/y), 1))
print(f"overall MAPE = {100*np.mean(np.abs((y-yhat)/y)):.0f} % "
"<- one near-zero label hijacks the whole average")
The confusion matrix
Classification swaps a continuous truth for a discrete one, and the entire grammar of classification metrics is built from a single 2×2 table. A binary classifier converts a score into a label by comparing it to a threshold; every prediction then lands in one of four cells:
The threshold is the dial that moves counts between cells. Lower it and you call more things positive: TP and FP both rise, FN falls. Raise it and you become conservative: FP and TP both fall, FN rises. You cannot lower false alarms and misses at the same time by moving the threshold — you can only trade one for the other. That trade-off is the single most important intuition in classification, and Instrument V3.3 below exists to make you feel it in your hands.
# Confusion matrix -> precision, recall, F1, accuracy, all from scratch.
import numpy as np
rng = np.random.default_rng(2)
n = 1000
y = rng.integers(0, 2, n) # true labels
# scores: positives score higher on average, but the classes overlap
score = rng.normal(0, 1, n) + 1.3 * y
thr = 0.5
pred = (score > thr).astype(int)
TP = int(np.sum((pred == 1) & (y == 1)))
FP = int(np.sum((pred == 1) & (y == 0)))
FN = int(np.sum((pred == 0) & (y == 1)))
TN = int(np.sum((pred == 0) & (y == 0)))
print(f"confusion: TP={TP} FP={FP} FN={FN} TN={TN}")
precision = TP / (TP + FP) # of predicted positives, how many real
recall = TP / (TP + FN) # of real positives, how many caught
f1 = 2 * precision * recall / (precision + recall)
accuracy = (TP + TN) / n
print(f"precision = {precision:.3f}")
print(f"recall = {recall:.3f}")
print(f"F1 = {f1:.3f} (harmonic mean of the two)")
print(f"accuracy = {accuracy:.3f}")
Precision, recall, F1, accuracy
Four ratios of the four counts. They look interchangeable; they are not, and choosing the wrong one is the most common way a model ships looking great and fails in production.
"The model is 97% accurate, ship it." Always ask the base rate first. If 97% of the rows are negative, a constant "no" predictor already scores 97% — your model may have learned nothing. The diagnostic reflex: compute the accuracy of the majority-class baseline, and never report accuracy on imbalanced data without precision and recall beside it. Accuracy is a fine metric only when the classes are roughly balanced and false positives and false negatives cost about the same.
# The accuracy paradox: 99% accurate and completely useless.
import numpy as np
rng = np.random.default_rng(5)
n = 10000
y = (rng.random(n) < 0.01).astype(int) # 1% positive (rare disease)
print(f"true positives in data: {y.sum()} / {n} ({100*y.mean():.1f}%)\n")
# "Model" A: predict the majority class (negative) for everyone.
predA = np.zeros(n, dtype=int)
accA = (predA == y).mean()
TP_A = int(np.sum((predA==1)&(y==1)))
print(f"ALWAYS-NEGATIVE: accuracy = {accA:.3f} but caught {TP_A} of {y.sum()} positives")
print(" precision/recall/F1 are 0/0 -> the metric that 'works' is a mirage.\n")
# "Model" B: a real but imperfect detector.
score = rng.normal(0, 1, n) + 2.5 * y
predB = (score > 1.5).astype(int)
TP=int(np.sum((predB==1)&(y==1))); FP=int(np.sum((predB==1)&(y==0)))
FN=int(np.sum((predB==0)&(y==1))); TN=int(np.sum((predB==0)&(y==0)))
prec = TP/(TP+FP) if TP+FP else 0
rec = TP/(TP+FN) if TP+FN else 0
f1 = 2*prec*rec/(prec+rec) if prec+rec else 0
print(f"REAL DETECTOR: accuracy = {(TP+TN)/n:.3f}")
print(f" precision = {prec:.3f} recall = {rec:.3f} F1 = {f1:.3f}")
print("Same-ish accuracy, but only F1/precision/recall reveal which model works.")
Log loss & probabilistic scoring
Everything so far grades a decision — the label after thresholding. But most classifiers output a probability, and throwing it away to compute accuracy discards information: a model that says "90% sure" and is right is better than one that says "51% sure" and is right. Log loss (binary cross-entropy) grades the probability itself, rewarding confidence only when it is earned.
Two sibling scores are worth knowing. The Brier score is the mean squared error of the probabilities, \(\frac{1}{n}\sum(\hat{p}_i - y_i)^2\) — also a proper scoring rule, but bounded (a confident wrong answer maxes out at 1 rather than infinity), so it is gentler on outliers and easier to read. And cross-entropy generalizes immediately to \(K\) classes as \(-\frac{1}{n}\sum_i\sum_{c} y_{ic}\ln\hat{p}_{ic}\), the multiclass loss behind virtually every neural classifier. The honest caveat: log loss assumes the probabilities are calibrated; a model can have great ranking (AUC) yet terrible log loss if its probabilities are systematically over- or under-confident — exactly the gap the next chapter closes.
# Log loss vs Brier: confidence is rewarded only when it's right (EQ V3.9).
import numpy as np
def log_loss(y, p):
p = np.clip(p, 1e-12, 1 - 1e-12) # guard ln(0) = -inf
return -np.mean(y*np.log(p) + (1-y)*np.log(1-p))
def brier(y, p):
return np.mean((p - y)**2)
y = np.array([1, 1, 0, 0]) # two positives, two negatives
confident_right = np.array([0.95, 0.90, 0.05, 0.10])
hedging = np.array([0.55, 0.55, 0.45, 0.45])
confident_wrong = np.array([0.05, 0.10, 0.95, 0.90])
for name, p in [("confident & right", confident_right),
("hedging (~0.5) ", hedging),
("confident & WRONG", confident_wrong)]:
print(f"{name}: log loss = {log_loss(y,p):.3f} Brier = {brier(y,p):.3f}")
print("\nlog loss of a single confident-correct 0.9 :", round(-np.log(0.9), 3))
print("log loss of a single coin-flip 0.5 :", round(-np.log(0.5), 3))
print("log loss explodes for confident errors; Brier stays bounded by 1.")
Every metric here graded a fixed threshold or assumed the probabilities were trustworthy — two assumptions the next chapter refuses to make. Chapter 04 sweeps the threshold to draw the ROC and precision–recall curves (and the AUC that summarizes them), then asks the harder question log loss only hinted at: when the model says 70%, does it happen 70% of the time? — calibration, reliability diagrams, and the fixes that make probabilities mean what they say.
References
- Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).
- Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability.
- Gneiting, T. & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation.
- Hyndman, R. J. & Koehler, A. B. (2006). Another Look at Measures of Forecast Accuracy.
- Chicco, D. & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.