Why imbalance breaks training & metrics
A dataset is imbalanced when one class vastly outnumbers another. The ratio is not a curiosity — it is the whole problem. Credit-card fraud runs near 1 transaction in 1,000; a screening test for a rare cancer might see 1 case in 10,000; a churn flag fires for a few percent of users. In every case the class you actually care about is the rare one, and the loss function — left to its own devices — barely notices it exists.
Start with the metric everyone reaches for. Accuracy is the fraction of predictions that are correct, and on imbalanced data it is worse than useless — it is actively misleading. Consider the majority-class baseline: a "model" that ignores its input and always predicts the common class.
The damage runs deeper than the scorecard. Most classifiers are trained by minimizing an average loss over examples (cross-entropy, Vol I · EQ M3.3). With 999 majority examples for every minority one, the gradient is dominated by the easy majority: the model can drive total loss down by becoming an excellent detector of the common class and a blind one for the rare class. The decision boundary is pushed into the minority region — the cheapest way to shave the average loss is to misclassify the few. Imbalance is therefore not just an evaluation headache; it is an optimization bias baked into the objective.
The instrument below makes this concrete. Dial the minority ratio down and watch accuracy march toward 100% while recall on the rare class collapses — the model has stopped learning the thing you built it for.
Resampling — random over- and under-sampling
The simplest cure operates on the data, before any model sees it: change the class proportions so the loss can no longer ignore the minority. Two opposite moves achieve the same balanced ratio.
- Random over-sampling (ROS). Duplicate minority examples (sampling with replacement) until the classes match. Keeps all majority information, but the copies are exact — the model can memorize them, inflating training scores and inviting overfitting to the few real minority points.
- Random under-sampling (RUS). Discard majority examples until the classes match. Fast, light, and a strong baseline — but it throws away potentially useful majority data, which hurts when the majority class is itself varied or the dataset is small.
To reach a target minority share \(\rho\) (with \(\rho = 0.5\) meaning a balanced 1:1 set) by over-sampling, the minority class must be grown to match. The arithmetic is worth internalizing because every resampling library is doing exactly this under the hood:
Resampling does not add information. Duplicating a point tells the model nothing it did not already know; it only reweights how loudly that point speaks in the loss — which is mathematically close to the class-weighting of §5.4. The honest framing: resampling and reweighting both move the effective prevalence the optimizer sees, nudging the decision threshold without changing the underlying separability of the classes. That realization is what motivates SMOTE — a way to add genuinely new minority points instead of mere copies.
# EQ D5.2: random over- vs under-sampling to a 1:1 balance
import numpy as np
rng = np.random.default_rng(0)
# a 95:5 training fold: 950 majority (label 0), 50 minority (label 1)
maj = rng.normal(0.0, 1.0, (950, 2))
mn = rng.normal(2.4, 1.0, (50, 2))
X = np.vstack([maj, mn]); y = np.array([0]*950 + [1]*50)
print(f"before : maj={np.sum(y==0):4d} min={np.sum(y==1):4d} "
f"min-share={np.mean(y==1):.3f}")
# random OVER-sampling: duplicate minority (with replacement) up to majority count
idx_min = np.where(y == 1)[0]
extra = rng.choice(idx_min, size=950 - 50, replace=True) # 900 duplicates
Xo, yo = np.vstack([X, X[extra]]), np.concatenate([y, y[extra]])
print(f"oversampled : maj={np.sum(yo==0):4d} min={np.sum(yo==1):4d} "
f"min-share={np.mean(yo==1):.3f} (rho=0.5, EQ D5.2)")
# random UNDER-sampling: keep all 50 minority, randomly keep 50 majority
keep_maj = rng.choice(np.where(y == 0)[0], size=50, replace=False)
Xu = np.vstack([X[keep_maj], X[idx_min]]); yu = np.array([0]*50 + [1]*50)
print(f"undersampled: maj={np.sum(yu==0):4d} min={np.sum(yu==1):4d} "
f"min-share={np.mean(yu==1):.3f} (kept only 100 of 1000 rows)")
SMOTE & variants
Random over-sampling copies points; SMOTE — Synthetic Minority Over-sampling Technique (Chawla et al., 2002) — invents them. Instead of duplicating a minority example, it draws a brand-new point along the line segment connecting that example to one of its minority near neighbors. The result is a denser, smoother minority region rather than a stack of identical copies, which forces the classifier to carve out broader minority territory instead of memorizing isolated dots.
Plain SMOTE treats every minority point equally. Its two most-used descendants spend their synthetic budget where it helps most — near the decision boundary, where errors actually happen:
| Variant | Where it synthesizes | Intuition |
|---|---|---|
| SMOTE | uniformly across all minority points | Densifies the whole minority region; simple, strong default. |
| Borderline-SMOTE | only from minority points near the boundary | A point is "in danger" if most of its neighbors are majority; reinforce exactly those frontier cases. |
| ADASYN | more for minority points that are harder to learn | Generate inversely to local density — pour synthetic mass where the minority is most outnumbered. |
Honest caveats. SMOTE assumes the space between two minority neighbors is itself minority — true for smooth, continuous features, false for categorical ones (use SMOTE-NC) and shaky in high dimensions, where "near neighbor" loses meaning and interpolation can land in nonsense regions. It can amplify noise (a mislabeled minority point spawns a cluster of synthetic noise) and, by design, blurs the boundary in overlapping classes. Modern practice often pairs it with a cleaning step — SMOTE-Tomek or SMOTE-ENN remove the majority points SMOTE's new neighbors now contradict. And on large deep-learning problems, loss-level fixes (§5.4) frequently beat resampling outright. SMOTE is a sharp tool, not a magic wand.
# SMOTE in pure numpy: interpolate between minority k-NN (EQ D5.3)
import numpy as np
rng = np.random.default_rng(1)
# a 90:10 fold: 90 majority, 10 minority, 2 features
maj = rng.normal(0.0, 1.0, (90, 2))
mn = rng.normal(2.6, 0.7, (10, 2))
X, y = np.vstack([maj, mn]), np.array([0]*90 + [1]*10)
P = X[y == 1] # minority points only
def smote(P, n_new, k=5):
out = []
D = np.sqrt(((P[:, None] - P[None]) ** 2).sum(-1)) # pairwise distances
for _ in range(n_new):
i = rng.integers(len(P)) # a random minority point
nn = np.argsort(D[i])[1:k+1] # its k nearest minority neighbors
j = nn[rng.integers(len(nn))] # pick one neighbor
lam = rng.random() # lambda ~ U(0,1)
out.append(P[i] + lam * (P[j] - P[i])) # the interpolated synthetic point
return np.array(out)
S = smote(P, n_new=80, k=5)
before = y.mean()
after = (y.sum() + len(S)) / (len(y) + len(S))
print(f"minority before SMOTE : {y.sum():2d} / {len(y)} = {before:.3f}")
print(f"synthetic generated : {len(S)}")
print(f"minority after SMOTE : {y.sum()+len(S):2d} / {len(y)+len(S)} = {after:.3f}")
inside = bool((S.min(0) >= P.min(0)).all() and (S.max(0) <= P.max(0)).all())
print("every synthetic point sits inside the real-minority box:", inside)
plot_scatter(np.r_[X[:,0], S[:,0]], np.r_[X[:,1], S[:,1]],
np.r_[y, np.full(len(S), 2)]) # 0 maj, 1 real-min, 2 synthetic
Algorithm-level fixes — class weights, focal loss, threshold moving
Resampling rewrites the data; the alternative is to leave the data alone and rewrite the objective. Three loss- and decision-level levers do this without touching a single row.
Class weights (cost-sensitive learning)
Scale each example's contribution to the loss by a class-dependent weight, so a minority mistake costs more than a majority one. The standard inverse-frequency weighting gives each class influence proportional to its rarity:
class_weight="balanced") makes each class contribute equally to the total loss in expectation. A class \(10\times\) rarer gets \(\sim\!10\times\) the per-example weight. This is the loss-level twin of over-sampling — both inflate the minority's voice in the gradient — but it adds no rows and no duplicates, so it is cheaper and overfits less. It moves the effective decision threshold toward the minority class, trading precision for recall.Focal loss
Class weights up-weight a whole class; focal loss (Lin et al., 2017, for dense object detection) up-weights the hard examples within it — the ones the model still gets wrong — and lets the easy, already-correct majority examples fade from the gradient automatically:
Threshold moving
The cheapest fix of all changes nothing about training. A probabilistic classifier outputs \(p = P(y=1 \mid x)\); the default rule "predict positive if \(p > 0.5\)" is a convention, not a law. Under imbalance — or under asymmetric costs, where a missed fraud dwarfs a false alarm — the optimal cut sits elsewhere. Sweep the threshold \(\tau\) and you trace the entire precision/recall trade-off from a single trained model:
# Accuracy lies; recall/precision trade off as the threshold moves (99:1)
import numpy as np
rng = np.random.default_rng(3)
n = 10000; n_pos = 100 # 1% prevalence -> 99:1
# simulate calibrated scores: positives skew high, negatives skew low
s_pos = np.clip(rng.beta(5, 2, n_pos), 0, 1) # true positives, score-ish high
s_neg = np.clip(rng.beta(2, 6, n-n_pos), 0, 1) # true negatives, score-ish low
score = np.r_[s_pos, s_neg]
y = np.r_[np.ones(n_pos), np.zeros(n-n_pos)].astype(int)
def report(tau):
yhat = (score > tau).astype(int)
tp = int(((yhat==1)&(y==1)).sum()); fp = int(((yhat==1)&(y==0)).sum())
fn = int(((yhat==0)&(y==1)).sum()); tn = int(((yhat==0)&(y==0)).sum())
acc = (tp+tn)/n
prec = tp/(tp+fp) if tp+fp else float('nan')
rec = tp/(tp+fn) if tp+fn else 0.0
return acc, prec, rec, tp, fp, fn
print(" tau acc prec recall TP FP FN")
for tau in (0.5, 0.3, 0.1):
acc, prec, rec, tp, fp, fn = report(tau)
print(f"{tau:.2f} {acc:.4f} {prec:.3f} {rec:.3f} {tp:4d} {fp:4d} {fn:4d}")
print("\nalways-predict-negative: acc =", round((n-n_pos)/n, 4),
" recall = 0.0 (caught nothing)")
print("dropping tau 0.5 -> 0.1 trades precision for the recall you actually need.")
Evaluating under imbalance — PR curves, the right metric
Every prediction lands in one of four cells of the confusion matrix, and every honest metric is built from them:
| CONFUSION MATRIX | PREDICTED + (alarm) | PREDICTED − (clear) |
|---|---|---|
| ACTUAL + (rare) | TP · caught it | FN · a miss |
| ACTUAL − (common) | FP · false alarm | TN · correct all-clear |
From these, two questions — and they are genuinely different questions:
Sweeping the threshold turns these point metrics into curves. Two summaries dominate, and the choice between them is the single most important evaluation decision under imbalance:
- ROC curve (TPR vs. FPR) and its area, ROC-AUC. Because FPR = FP/(FP+TN) has the huge TN count in its denominator, ROC is insensitive to prevalence — which sounds like a virtue but is the opposite here. On a 99:1 problem a model can post a flattering 0.95 ROC-AUC while its precision is dismal, because thousands of false positives barely dent the FPR.
- Precision–Recall curve and its area, PR-AUC (a.k.a. average precision). Precision does feel every false positive directly, so the PR curve exposes exactly the failure ROC hides. On imbalanced problems, prefer PR-AUC.
The base-rate ambush, in numbers. Screen 10,000 people for a condition with 1% prevalence (100 positives). A genuinely good test — 90% recall, 8% false-positive rate — catches 90 of the 100 cases but also flags 8% of 9,900 healthy people = 792 false alarms. Precision is \(90 / (90 + 792) \approx 10.2\%\): nine of every ten alarms are wrong, even though the test is "90% accurate" by recall. No amount of resampling fixes this — it is the prevalence speaking. The defenses are honest metrics (PR-AUC, precision at fixed recall), explicit cost modeling, and a calibrated threshold.
Beyond curves, two more metrics earn their place: balanced accuracy (the mean of recall on each class — the right "accuracy" when you must report one number) and Matthews correlation coefficient (MCC), a single value in \([-1, 1]\) that uses all four confusion cells and stays honest across any prevalence. Whatever you choose, the iron rule from §5.2 holds: measure on data at the real prevalence. Resample to train; never resample to evaluate.
The four classic imbalance mistakes: (1) reporting accuracy — it grades the base rate, not the model; (2) resampling before the train/test split, leaking synthetic minority points into the test fold and inventing scores; (3) trusting ROC-AUC on a 99:1 problem while precision quietly collapses; (4) shipping the default \(\tau = 0.5\) when your costs are asymmetric — the threshold is a free dial you forgot to turn.
You now know how to prepare and weigh data so a model learns what matters. The Machine Learning volume opens by stepping back to first principles — what it even means to learn from data, the bias–variance decomposition, and why every technique in this volume is ultimately a bet about generalization. Volume I · Chapter 01: Learning from Data.
References
- Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique.
- He, H. & Garcia, E. A. (2009). Learning from Imbalanced Data.
- Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal Loss for Dense Object Detection.
- Han, H., Wang, W.-Y. & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning.
- He, H., Bai, Y., Garcia, E. A. & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning.
- Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves.
- Chicco, D. & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 and Accuracy.