Imbalanced Data — Resampling & SMOTE

5.1

Why imbalance breaks training & metrics

A dataset is imbalanced when one class vastly outnumbers another. The ratio is not a curiosity — it is the whole problem. Credit-card fraud runs near 1 transaction in 1,000; a screening test for a rare cancer might see 1 case in 10,000; a churn flag fires for a few percent of users. In every case the class you actually care about is the rare one, and the loss function — left to its own devices — barely notices it exists.

Start with the metric everyone reaches for. Accuracy is the fraction of predictions that are correct, and on imbalanced data it is worse than useless — it is actively misleading. Consider the majority-class baseline: a "model" that ignores its input and always predicts the common class.

EQ D5.1 — THE ACCURACY TRAP $$ \text{Acc}_{\text{majority}} \;=\; \frac{N_{\text{maj}}}{N_{\text{maj}} + N_{\text{min}}} \;=\; 1 - \pi, \qquad \pi \;=\; \frac{N_{\text{min}}}{N} $$

$\pi$ is the minority prevalence — the base rate of the positive class. A constant predictor that always says "majority" scores $1-\pi$ accuracy while detecting nothing. At $\pi = 0.001$ it reads 99.9% accurate; at $\pi = 0.05$, 95%. The number is real and the model is useless — accuracy measures the imbalance, not the model. The honest signals are recall (of the real positives, how many did you catch?) and precision (of your alarms, how many were real?), defined in §5.5.

A dataset has a 95:5 class split (95% negative, 5% positive). A model that always predicts the majority (negative) class achieves what accuracy? (Give a decimal.)

By EQ D5.1, accuracy $= N_{\text{maj}}/N = 95/100 = $ 0.95. The constant predictor scores 95% while catching zero positives — which is exactly why accuracy cannot be trusted under imbalance.

The damage runs deeper than the scorecard. Most classifiers are trained by minimizing an average loss over examples (cross-entropy, Vol I · EQ M3.3). With 999 majority examples for every minority one, the gradient is dominated by the easy majority: the model can drive total loss down by becoming an excellent detector of the common class and a blind one for the rare class. The decision boundary is pushed into the minority region — the cheapest way to shave the average loss is to misclassify the few. Imbalance is therefore not just an evaluation headache; it is an optimization bias baked into the objective.

The instrument below makes this concrete. Dial the minority ratio down and watch accuracy march toward 100% while recall on the rare class collapses — the model has stopped learning the thing you built it for.

INSTRUMENT D5.1 — IMBALANCE PLAYGROUNDTWO GAUSSIAN CLOUDS · LOGISTIC FIT · EQ D5.1

MINORITY SHARE π 5.0%

TRAINING DATA

ACCURACY

—

RECALL (MINORITY)

—

MAJORITY BASELINE ACC

—

Mint = minority (the class that matters), blue = majority; the white line is the fitted boundary. Drag π toward 0.5%: accuracy climbs past 99% as the boundary swallows the minority cloud and recall craters — the model is acing the wrong test. Now switch to OVERSAMPLE or UNDERSAMPLE and watch the boundary swing back to bisect the clouds: accuracy dips, recall jumps. Rebalancing trades a meaningless metric for a meaningful one.

5.2

Resampling — random over- and under-sampling

The simplest cure operates on the data, before any model sees it: change the class proportions so the loss can no longer ignore the minority. Two opposite moves achieve the same balanced ratio.

Random over-sampling (ROS). Duplicate minority examples (sampling with replacement) until the classes match. Keeps all majority information, but the copies are exact — the model can memorize them, inflating training scores and inviting overfitting to the few real minority points.
Random under-sampling (RUS). Discard majority examples until the classes match. Fast, light, and a strong baseline — but it throws away potentially useful majority data, which hurts when the majority class is itself varied or the dataset is small.

To reach a target minority share $\rho$ (with $\rho = 0.5$ meaning a balanced 1:1 set) by over-sampling, the minority class must be grown to match. The arithmetic is worth internalizing because every resampling library is doing exactly this under the hood:

EQ D5.2 — RESAMPLING TO A TARGET RATIO $$ N_{\text{min}}^{\text{target}} \;=\; \frac{\rho}{1-\rho}\, N_{\text{maj}}, \qquad \text{(1:1 balance)} \;\;\rho = \tfrac12 \;\Rightarrow\; N_{\text{min}}^{\text{target}} = N_{\text{maj}} $$

$\rho$ is the desired minority fraction of the resampled set. Over-sampling duplicates the minority up to $N_{\text{min}}^{\text{target}}$ (total grows); under-sampling instead cuts the majority down to $\frac{1-\rho}{\rho} N_{\text{min}}$ (total shrinks). Cardinal rule: resample the training fold only. Touching validation or test data — or resampling before the train/test split — leaks information and manufactures fictional scores. The held-out set must keep the real-world prevalence $\pi$, because that is the distribution your model will actually face.

A training fold has 50 minority and 950 majority examples. You over-sample the minority up to 950 (a 1:1 balance). What is the minority's share of the resampled set? (Give a decimal.)

After over-sampling, minority $= 950$ and majority $= 950$, so the new total is $950 + 950 = 1900$. Minority share $= 950 / 1900 = $ 0.5. Note the original prevalence was $50/1000 = 0.05$; over-sampling to 1:1 has moved it to exactly one half — by construction (EQ D5.2 with $\rho = \tfrac12$).

Resampling does not add information. Duplicating a point tells the model nothing it did not already know; it only reweights how loudly that point speaks in the loss — which is mathematically close to the class-weighting of §5.4. The honest framing: resampling and reweighting both move the effective prevalence the optimizer sees, nudging the decision threshold without changing the underlying separability of the classes. That realization is what motivates SMOTE — a way to add genuinely new minority points instead of mere copies.

PYTHON · RUNNABLE IN-BROWSER

# EQ D5.2: random over- vs under-sampling to a 1:1 balance
import numpy as np
rng = np.random.default_rng(0)

# a 95:5 training fold: 950 majority (label 0), 50 minority (label 1)
maj = rng.normal(0.0, 1.0, (950, 2))
mn  = rng.normal(2.4, 1.0, (50,  2))
X = np.vstack([maj, mn]); y = np.array([0]*950 + [1]*50)
print(f"before      : maj={np.sum(y==0):4d}  min={np.sum(y==1):4d}  "
      f"min-share={np.mean(y==1):.3f}")

# random OVER-sampling: duplicate minority (with replacement) up to majority count
idx_min = np.where(y == 1)[0]
extra   = rng.choice(idx_min, size=950 - 50, replace=True)   # 900 duplicates
Xo, yo  = np.vstack([X, X[extra]]), np.concatenate([y, y[extra]])
print(f"oversampled : maj={np.sum(yo==0):4d}  min={np.sum(yo==1):4d}  "
      f"min-share={np.mean(yo==1):.3f}  (rho=0.5, EQ D5.2)")

# random UNDER-sampling: keep all 50 minority, randomly keep 50 majority
keep_maj = rng.choice(np.where(y == 0)[0], size=50, replace=False)
Xu = np.vstack([X[keep_maj], X[idx_min]]); yu = np.array([0]*50 + [1]*50)
print(f"undersampled: maj={np.sum(yu==0):4d}  min={np.sum(yu==1):4d}  "
      f"min-share={np.mean(yu==1):.3f}  (kept only 100 of 1000 rows)")

edits are live — try a different rho by changing the target counts

5.3

SMOTE & variants

Random over-sampling copies points; SMOTE — Synthetic Minority Over-sampling Technique (Chawla et al., 2002) — invents them. Instead of duplicating a minority example, it draws a brand-new point along the line segment connecting that example to one of its minority near neighbors. The result is a denser, smoother minority region rather than a stack of identical copies, which forces the classifier to carve out broader minority territory instead of memorizing isolated dots.

EQ D5.3 — SMOTE INTERPOLATION $$ x_{\text{new}} \;=\; x_i \;+\; \lambda \,\bigl(x_{nn} - x_i\bigr), \qquad \lambda \sim \mathcal{U}(0,1), \qquad x_{nn} \in \text{kNN}_{\text{min}}(x_i) $$

$x_i$ is a minority example; $x_{nn}$ is one of its $k$ nearest minority neighbors (typically $k = 5$), chosen at random; $\lambda$ is a uniform random step along the segment between them. $\lambda = 0$ returns $x_i$; $\lambda = 1$ lands on the neighbor; in between you get a convex blend — a new, plausible minority point. The synthetic point lives inside the convex hull of the minority class, never extrapolating outside it. Caveat: in regions where minority and majority overlap, SMOTE happily interpolates across the gap and plants synthetic points in majority territory — which is exactly what the Borderline and ADASYN variants try to fix.

SMOTE picks a minority point $x_i = 2$, a neighbor $x_{nn} = 6$, and draws $\lambda = 0.25$. By EQ D5.3, what is the synthetic point $x_{\text{new}}$?

$x_{\text{new}} = x_i + \lambda(x_{nn} - x_i) = 2 + 0.25\,(6 - 2) = 2 + 0.25\times 4 = 2 + 1 = $ 3. The new point lies a quarter of the way from $x_i$ toward its neighbor — inside the segment, never beyond it.

Plain SMOTE treats every minority point equally. Its two most-used descendants spend their synthetic budget where it helps most — near the decision boundary, where errors actually happen:

Variant	Where it synthesizes	Intuition
SMOTE	uniformly across all minority points	Densifies the whole minority region; simple, strong default.
Borderline-SMOTE	only from minority points near the boundary	A point is "in danger" if most of its neighbors are majority; reinforce exactly those frontier cases.
ADASYN	more for minority points that are harder to learn	Generate inversely to local density — pour synthetic mass where the minority is most outnumbered.

Honest caveats. SMOTE assumes the space between two minority neighbors is itself minority — true for smooth, continuous features, false for categorical ones (use SMOTE-NC) and shaky in high dimensions, where "near neighbor" loses meaning and interpolation can land in nonsense regions. It can amplify noise (a mislabeled minority point spawns a cluster of synthetic noise) and, by design, blurs the boundary in overlapping classes. Modern practice often pairs it with a cleaning step — SMOTE-Tomek or SMOTE-ENN remove the majority points SMOTE's new neighbors now contradict. And on large deep-learning problems, loss-level fixes (§5.4) frequently beat resampling outright. SMOTE is a sharp tool, not a magic wand.

INSTRUMENT D5.2 — SMOTE VISUALIZEREQ D5.3 · k-NN INTERPOLATION · SEEDED MINORITY

NEIGHBORS k 5

SYNTHETIC POINTS 60

REAL MINORITY

—

SYNTHETIC (SMOTE)

—

EFFECTIVE MIN-SHARE

Solid mint dots are the 14 real minority examples; faint blue dots are majority; hollow mint dots are synthetic points, each drawn on a segment between a real minority point and one of its k neighbors (the thin connecting line shows the parent pair). Raise k and the synthetic cloud reaches farther between sub-clusters; raise the count and the minority region fills in. Watch the effective minority share climb toward balance — without a single duplicated point.

PYTHON · RUNNABLE IN-BROWSER

# SMOTE in pure numpy: interpolate between minority k-NN (EQ D5.3)
import numpy as np
rng = np.random.default_rng(1)

# a 90:10 fold: 90 majority, 10 minority, 2 features
maj = rng.normal(0.0, 1.0, (90, 2))
mn  = rng.normal(2.6, 0.7, (10, 2))
X, y = np.vstack([maj, mn]), np.array([0]*90 + [1]*10)
P = X[y == 1]                                   # minority points only

def smote(P, n_new, k=5):
    out = []
    D = np.sqrt(((P[:, None] - P[None]) ** 2).sum(-1))   # pairwise distances
    for _ in range(n_new):
        i  = rng.integers(len(P))                # a random minority point
        nn = np.argsort(D[i])[1:k+1]             # its k nearest minority neighbors
        j  = nn[rng.integers(len(nn))]           # pick one neighbor
        lam = rng.random()                       # lambda ~ U(0,1)
        out.append(P[i] + lam * (P[j] - P[i]))   # the interpolated synthetic point
    return np.array(out)

S = smote(P, n_new=80, k=5)
before = y.mean()
after  = (y.sum() + len(S)) / (len(y) + len(S))
print(f"minority before SMOTE : {y.sum():2d} / {len(y)}  = {before:.3f}")
print(f"synthetic generated   : {len(S)}")
print(f"minority after SMOTE  : {y.sum()+len(S):2d} / {len(y)+len(S)} = {after:.3f}")
inside = bool((S.min(0) >= P.min(0)).all() and (S.max(0) <= P.max(0)).all())
print("every synthetic point sits inside the real-minority box:", inside)
plot_scatter(np.r_[X[:,0], S[:,0]], np.r_[X[:,1], S[:,1]],
             np.r_[y, np.full(len(S), 2)])      # 0 maj, 1 real-min, 2 synthetic

edits are live — set k=1 (nearest only) or push n_new to 200

5.4

Algorithm-level fixes — class weights, focal loss, threshold moving

Resampling rewrites the data; the alternative is to leave the data alone and rewrite the objective. Three loss- and decision-level levers do this without touching a single row.

Class weights (cost-sensitive learning)

Scale each example's contribution to the loss by a class-dependent weight, so a minority mistake costs more than a majority one. The standard inverse-frequency weighting gives each class influence proportional to its rarity:

EQ D5.4 — WEIGHTED CROSS-ENTROPY $$ \mathcal{L} \;=\; -\frac{1}{N}\sum_{i=1}^{N} w_{y_i}\,\log p_{i,\,y_i}, \qquad w_c \;=\; \frac{N}{C\,N_c} $$

$w_c$ is the weight for class $c$, $N_c$ its count, $C$ the number of classes; the formula (scikit-learn's class_weight="balanced") makes each class contribute equally to the total loss in expectation. A class $10\times$ rarer gets $\sim\!10\times$ the per-example weight. This is the loss-level twin of over-sampling — both inflate the minority's voice in the gradient — but it adds no rows and no duplicates, so it is cheaper and overfits less. It moves the effective decision threshold toward the minority class, trading precision for recall.

A binary problem has $N = 1000$ examples: 950 majority and 50 minority ($C = 2$ classes). Using balanced weighting $w_c = N/(C\,N_c)$, what weight does the minority class receive?

$w_{\text{min}} = \dfrac{N}{C\,N_{\text{min}}} = \dfrac{1000}{2 \times 50} = \dfrac{1000}{100} = $ 10. Each minority example counts ten times as heavily in the loss as it would unweighted — and the majority gets $1000/(2\times 950) \approx 0.53$, so the two classes contribute equally overall.

Focal loss

Class weights up-weight a whole class; focal loss (Lin et al., 2017, for dense object detection) up-weights the hard examples within it — the ones the model still gets wrong — and lets the easy, already-correct majority examples fade from the gradient automatically:

EQ D5.5 — FOCAL LOSS $$ \mathrm{FL}(p_t) \;=\; -\,\alpha_t\,(1 - p_t)^{\gamma}\,\log p_t, \qquad p_t \;=\; \begin{cases} p & y = 1 \\ 1 - p & y = 0 \end{cases} $$

$p_t$ is the probability assigned to the true class; $\alpha_t$ is an optional class weight as in EQ D5.4; $\gamma \ge 0$ is the focusing parameter. The modulating factor $(1-p_t)^{\gamma}$ is the whole idea: for a well-classified example ($p_t \to 1$) it $\to 0$, nearly deleting that example's gradient; for a hard one ($p_t$ small) it stays near 1. At $\gamma = 0$ focal loss is exactly cross-entropy; the paper used $\gamma = 2$. The effect: a flood of easy majority examples no longer drowns out the rare, hard minority ones — imbalance is handled inside the loss, no resampling required.

An easy example is classified with $p_t = 0.9$. Using focal loss with $\gamma = 2$, what is the modulating factor $(1 - p_t)^{\gamma}$ that scales its loss?

$(1 - p_t)^{\gamma} = (1 - 0.9)^2 = (0.1)^2 = $ 0.01. This easy example's contribution to the loss is cut to 1% of its cross-entropy value — so the gradient budget flows to the hard cases instead. A hard example at $p_t = 0.1$ keeps a factor of $(0.9)^2 = 0.81$, almost untouched.

Threshold moving

The cheapest fix of all changes nothing about training. A probabilistic classifier outputs $p = P(y=1 \mid x)$; the default rule "predict positive if $p > 0.5$" is a convention, not a law. Under imbalance — or under asymmetric costs, where a missed fraud dwarfs a false alarm — the optimal cut sits elsewhere. Sweep the threshold $\tau$ and you trace the entire precision/recall trade-off from a single trained model:

EQ D5.6 — COST-OPTIMAL THRESHOLD $$ \hat{y} = \mathbb{1}\!\left[\,p > \tau\,\right], \qquad \tau^{\star} \;=\; \frac{C_{\text{FP}}}{C_{\text{FP}} + C_{\text{FN}}} \quad \text{(Bayes-optimal cut for costs } C_{\text{FP}},\, C_{\text{FN}}) $$

Lower $\tau$ below 0.5 to catch more positives (recall ↑, precision ↓); raise it to flag only the confident ones (precision ↑, recall ↓). The Bayes-optimal $\tau^{\star}$ depends only on the relative cost of a false positive versus a false negative: if missing a fraud is 9× costlier than a false alarm ($C_{\text{FN}} = 9, C_{\text{FP}} = 1$), then $\tau^{\star} = 1/(1+9) = 0.1$ — flag anything over 10% probability. Threshold moving and proper probability calibration together often recover most of what resampling promised, with none of its risks.

INSTRUMENT D5.3 — THRESHOLD & COST EXPLOREREQ D5.6 · 1000 SCORED CASES · 95:5 PREVALENCE

THRESHOLD τ 0.50

COST OF A MISS (FN) 10×

PRECISION

—

RECALL

—

TOTAL COST (FP + c·FN)

—

The two curves are precision (mint) and recall (blue) as the threshold sweeps; the white line is your current τ. The dashed mint marker is the cost-optimal cut $\tau^{\star} = 1/(1 + c)$ from EQ D5.6. Slide τ left and recall rises while precision falls; raise the miss-cost c and watch $\tau^{\star}$ march left — when a miss costs 10× a false alarm, the optimal threshold drops to 0.09. The "TOTAL COST" readout is minimized near that marker, not at 0.5.

PYTHON · RUNNABLE IN-BROWSER

# Accuracy lies; recall/precision trade off as the threshold moves (99:1)
import numpy as np
rng = np.random.default_rng(3)

n = 10000; n_pos = 100                            # 1% prevalence -> 99:1
# simulate calibrated scores: positives skew high, negatives skew low
s_pos = np.clip(rng.beta(5, 2, n_pos),     0, 1) # true positives, score-ish high
s_neg = np.clip(rng.beta(2, 6, n-n_pos),   0, 1) # true negatives, score-ish low
score = np.r_[s_pos, s_neg]
y     = np.r_[np.ones(n_pos), np.zeros(n-n_pos)].astype(int)

def report(tau):
    yhat = (score > tau).astype(int)
    tp = int(((yhat==1)&(y==1)).sum()); fp = int(((yhat==1)&(y==0)).sum())
    fn = int(((yhat==0)&(y==1)).sum()); tn = int(((yhat==0)&(y==0)).sum())
    acc  = (tp+tn)/n
    prec = tp/(tp+fp) if tp+fp else float('nan')
    rec  = tp/(tp+fn) if tp+fn else 0.0
    return acc, prec, rec, tp, fp, fn

print(" tau    acc    prec   recall   TP   FP   FN")
for tau in (0.5, 0.3, 0.1):
    acc, prec, rec, tp, fp, fn = report(tau)
    print(f"{tau:.2f}  {acc:.4f}  {prec:.3f}  {rec:.3f}  {tp:4d} {fp:4d} {fn:4d}")
print("\nalways-predict-negative: acc =", round((n-n_pos)/n, 4),
      "  recall = 0.0  (caught nothing)")
print("dropping tau 0.5 -> 0.1 trades precision for the recall you actually need.")

edits are live — add tau=0.05, or change n_pos to make it 99.9:0.1

5.5

Evaluating under imbalance — PR curves, the right metric

Every prediction lands in one of four cells of the confusion matrix, and every honest metric is built from them:

CONFUSION MATRIX	PREDICTED + (alarm)	PREDICTED − (clear)
ACTUAL + (rare)	TP · caught it	FN · a miss
ACTUAL − (common)	FP · false alarm	TN · correct all-clear

From these, two questions — and they are genuinely different questions:

EQ D5.7 — PRECISION, RECALL, F1 $$ \text{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \qquad \text{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \qquad F_1 = \frac{2\,\text{P}\,\text{R}}{\text{P}+\text{R}} $$

Precision = of everything you flagged, how much was real (the false-alarm tax). Recall = of everything real, how much you caught (the miss rate's complement). Crucially, neither uses TN — so the giant pile of easy true negatives that inflates accuracy simply cannot rig these numbers. $F_1$ is their harmonic mean, harsh on any large gap between the two. When false-negative and false-positive costs differ, use the weighted $F_\beta$ (β > 1 favors recall) instead of $F_1$.

On a 1000-row test set with 10 true positives, a model flags all 10 ($\mathrm{TP}=10$, $\mathrm{FN}=0$) plus 90 negatives by mistake ($\mathrm{FP}=90$). What is its precision?

Precision $= \dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} = \dfrac{10}{10 + 90} = \dfrac{10}{100} = $ 0.1. Recall is a perfect $10/10 = 1.0$, yet 9 of every 10 alarms are false — the classic rare-event ambush, and exactly the trade-off the threshold of §5.4 controls.

Sweeping the threshold turns these point metrics into curves. Two summaries dominate, and the choice between them is the single most important evaluation decision under imbalance:

ROC curve (TPR vs. FPR) and its area, ROC-AUC. Because FPR = FP/(FP+TN) has the huge TN count in its denominator, ROC is insensitive to prevalence — which sounds like a virtue but is the opposite here. On a 99:1 problem a model can post a flattering 0.95 ROC-AUC while its precision is dismal, because thousands of false positives barely dent the FPR.
Precision–Recall curve and its area, PR-AUC (a.k.a. average precision). Precision does feel every false positive directly, so the PR curve exposes exactly the failure ROC hides. On imbalanced problems, prefer PR-AUC.

The base-rate ambush, in numbers. Screen 10,000 people for a condition with 1% prevalence (100 positives). A genuinely good test — 90% recall, 8% false-positive rate — catches 90 of the 100 cases but also flags 8% of 9,900 healthy people = 792 false alarms. Precision is $90 / (90 + 792) \approx 10.2\%$: nine of every ten alarms are wrong, even though the test is "90% accurate" by recall. No amount of resampling fixes this — it is the prevalence speaking. The defenses are honest metrics (PR-AUC, precision at fixed recall), explicit cost modeling, and a calibrated threshold.

SCREENED

10,000

prevalence 1% → 100 actually positive

RECALL 90%

90 TP

10 real cases slip through (FN)

FP RATE 8%

792 FP

8% of 9,900 healthy people flagged

PRECISION

10.2%

90 / 882 alarms are real

Beyond curves, two more metrics earn their place: balanced accuracy (the mean of recall on each class — the right "accuracy" when you must report one number) and Matthews correlation coefficient (MCC), a single value in $[-1, 1]$ that uses all four confusion cells and stays honest across any prevalence. Whatever you choose, the iron rule from §5.2 holds: measure on data at the real prevalence. Resample to train; never resample to evaluate.

PITFALLS

The four classic imbalance mistakes: (1) reporting accuracy — it grades the base rate, not the model; (2) resampling before the train/test split, leaking synthetic minority points into the test fold and inventing scores; (3) trusting ROC-AUC on a 99:1 problem while precision quietly collapses; (4) shipping the default $\tau = 0.5$ when your costs are asymmetric — the threshold is a free dial you forgot to turn.

You now know how to prepare and weigh data so a model learns what matters. The Machine Learning volume opens by stepping back to first principles — what it even means to learn from data, the bias–variance decomposition, and why every technique in this volume is ultimately a bet about generalization. Volume I · Chapter 01: Learning from Data.

5.R

References

Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 — the interpolation method of EQ D5.3.
He, H. & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9) — the canonical survey of resampling, cost-sensitive learning, and evaluation.
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017 (RetinaNet) — focal loss, EQ D5.5.
Han, H., Wang, W.-Y. & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. ICIC 2005 — synthesizing only near the decision boundary (§5.3).
He, H., Bai, Y., Garcia, E. A. & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IJCNN 2008 — density-adaptive synthetic generation (§5.3).
Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. ICML 2006 — why PR-AUC, not ROC-AUC, is the metric to trust under imbalance (§5.5).
Chicco, D. & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 and Accuracy. BMC Genomics 21 — the case for MCC on imbalanced binary problems (§5.5).