04 · Trees, Forests & Neighbors

4.1

k-NN: memory as a model

The k-nearest-neighbors classifier has no training step, no parameters, and no loss function. The "model" is the training set itself, a distance function, and one integer. To classify a new point $x$: find the $k$ training points closest to it, and let them vote.

EQ M4.1 — MAJORITY VOTE OF THE k NEAREST $$ \hat{y}(x) \;=\; \operatorname*{arg\,max}_{c}\; \sum_{i \,\in\, N_k(x)} \mathbf{1}\!\left[\, y_i = c \,\right] $$

$N_k(x)$ is the set of the $k$ training points nearest to $x$ — usually under Euclidean distance — and $\mathbf{1}[\cdot]$ counts a vote when neighbor $i$ carries label $c$. All the cost moves to query time: $O(nd)$ distance computations per prediction against $n$ stored examples in $d$ dimensions. Every modeling assumption hides inside the distance function — if one feature is measured in millimeters and another in kilometers, the millimeters decide everything, which is why k-NN demands scaled features while trees (§4.2) do not.

Classify a query with $k = 7$. Its seven nearest training points carry labels $1, 0, 1, 1, 0, 1, 0$. By EQ M4.1, which class does k-NN predict (answer $0$ or $1$)?

Count the votes: class $1$ appears $4$ times, class $0$ appears $3$ times. The $\arg\max$ picks the majority, so the prediction is class 1. With odd $k$ in a two-class problem, ties are impossible.

For something this naive, k-NN is theoretically respectable: a classic result (Cover & Hart, 1967) shows that with unlimited data, the humble 1-NN rule's error is at most twice the best achievable error of any classifier. Memory is a legitimate model. And it scaled further than anyone guessed: vector search over learned embeddings — the retrieval step inside every RAG system and vector database — is k-NN run at billion-document scale. The algorithm never died; it just changed feature spaces.

CURSE

The curse of dimensionality. "Nearest" degrades as dimensions grow. In a 100-dimensional unit cube, a sub-cube that wants to capture just 1% of uniformly spread points needs edge length $0.01^{1/100} \approx 0.955$ — it must span 95% of every axis to hold 1% of the data. Worse, pairwise distances concentrate: the gap between the nearest and farthest neighbor shrinks toward nothing, and the vote becomes noise. The honest footnote: real high-dimensional data (images, text embeddings) usually lies near much lower-dimensional structure, which is why k-NN on a good embedding still works — a thread picked up in Chapter 05.

Run it from scratch. The cell below builds two overlapping Gaussian blobs, classifies with brute-force distance + vote, and scores $k=1$ against $k=15$. Note the signature of memorization: $k=1$ is perfect on the training set (every point is its own nearest neighbor) and the worst of the pair on the test set.

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(7)

# two overlapping Gaussian blobs — 240 points, 160 train / 80 test
n = 120
X0 = rng.normal([-0.9, -0.6], 1.1, (n, 2))
X1 = rng.normal([ 0.9,  0.7], 1.1, (n, 2))
X = np.vstack([X0, X1]); y = np.r_[np.zeros(n, int), np.ones(n, int)]
idx = rng.permutation(2 * n)
Xtr, ytr = X[idx[:160]], y[idx[:160]]
Xte, yte = X[idx[160:]], y[idx[160:]]

def knn_predict(Xq, k):
    d = ((Xq[:, None, :] - Xtr[None, :, :]) ** 2).sum(-1)  # all pairwise dist²
    votes = ytr[np.argsort(d, axis=1)[:, :k]]              # labels of k nearest
    return (votes.mean(axis=1) > 0.5).astype(int)          # majority (k is odd)

for k in (1, 15):
    print(f"k={k:2d}   train acc = {(knn_predict(Xtr, k) == ytr).mean():.3f}"
          f"   test acc = {(knn_predict(Xte, k) == yte).mean():.3f}")
plot_scatter(X[:, 0], X[:, 1], y)

edits are live — try k = 75, or shrink the blob spread to 0.5

4.2

Decision trees: greedy questions, rectangular answers

A decision tree classifies by interrogation: is feature 2 below 0.21? yes → left, no → right, repeated until a leaf, where the majority class of the training points that landed there becomes the prediction. Geometrically, every internal node slices the space with an axis-aligned cut, so every leaf is a rectangle (a box, in higher dimensions). The model is a partition.

Which question to ask first? Finding the globally optimal tree is NP-complete (Hyafil & Rivest, 1976), so CART — the algorithm running live in Instrument M4.1 — is greedy: at each node, try every feature and every threshold, keep the single split that most purifies the two children, and recurse. Purity is measured by Gini impurity or entropy:

EQ M4.2 — GINI IMPURITY AND SPLIT GAIN $$ G(S) \;=\; 1 - \sum_{c} p_c^{\,2}, \qquad \Delta \;=\; G(S) \;-\; \frac{|S_L|}{|S|}\,G(S_L) \;-\; \frac{|S_R|}{|S|}\,G(S_R) $$

$p_c$ is the fraction of class $c$ among the samples $S$ at the node; $G$ is the probability that two random draws from the node disagree — 0 when pure, 0.5 at a two-class coin flip. The split $s$ sends samples to children $S_L, S_R$, and CART picks the feature–threshold pair maximizing the gain $\Delta$. The entropy alternative $H(S) = -\sum_c p_c \log_2 p_c$ ("information gain") almost never produces a different tree — Gini wins on speed, not principle.

A tree node holds 8 samples: 6 of class ● and 2 of class ○. What is its Gini impurity $G = 1 - \sum_c p_c^2$ (EQ M4.2)?

Fractions: $p_● = 6/8 = 0.75$, $p_○ = 2/8 = 0.25$. $G = 1 - (0.75^2 + 0.25^2) = 1 - (0.5625 + 0.0625) = 1 - 0.625 = $ 0.375. Less impure than the 0.5 of a 50/50 node, but not yet pure.

A parent node of 8 samples (4 ●, 4 ○, so $G = 0.5$) is split into a left child of 4 (3 ●, 1 ○) and a right child of 4 (1 ●, 3 ○). Each child has Gini $0.375$. What is the split gain $\Delta$ (EQ M4.2)?

$\Delta = G(S) - \tfrac{|S_L|}{|S|}G(S_L) - \tfrac{|S_R|}{|S|}G(S_R) = 0.5 - \tfrac{4}{8}(0.375) - \tfrac{4}{8}(0.375) = 0.5 - 0.1875 - 0.1875 = $ 0.125. A modest, balanced improvement — CART would still prefer this over any split that purifies less.

FIG M4.AA DEPTH-2 TREE IS A PARTITION INTO THREE RECTANGLES

Same object, two views. The tree (left) and the partition (right) are identical. Note that class ● already owns a non-convex region — two splits suffice — and that the boundaries are axis-parallel by construction: a tree can only approximate a diagonal frontier with a staircase.

Trees buy three practical superpowers that gradient-trained models lack. They are invariant to any monotone rescaling of a feature (only the ordering of values matters, so no normalization, ever); they ingest mixed numeric and categorical columns without ceremony; and a small tree is genuinely readable — you can print it as a flowchart and hand it to an auditor. Their weakness is the same as their strength: predictions are piecewise-constant, so a lone tree is jagged, unstable, and cannot extrapolate a trend beyond the data it saw.

The atomic unit is the decision stump — a depth-1 tree, one question, two answers. The cell below performs the exact inner-loop search that CART runs at every node: scan every threshold, score every split by Gini gain, keep the best. Boosting (§4.5) will assemble hundreds of barely-better-than-chance learners like this one into a precision instrument.

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(3)

# 1-D labels with a true step at x = 0.35, plus 10% label noise
n = 200
x = rng.uniform(0, 1, n)
y = (x > 0.35).astype(int)
flip = rng.random(n) < 0.10
y[flip] = 1 - y[flip]

def gini(labels):
    if len(labels) == 0: return 0.0
    p = labels.mean()
    return 2 * p * (1 - p)        # = 1 - p² - (1-p)² for two classes

parent = gini(y)
best_gain, best_t = -1.0, None
for t in np.sort(x)[1:]:          # candidate threshold between each pair
    L, R = y[x < t], y[x >= t]
    w = len(L) / n
    gain = parent - (w * gini(L) + (1 - w) * gini(R))
    if gain > best_gain:
        best_gain, best_t = gain, t

pred = (x >= best_t).astype(int)
print(f"parent Gini    : {parent:.4f}")
print(f"best threshold : x = {best_t:.4f}   (true step at 0.35)")
print(f"Gini gain      : {best_gain:.4f}")
print(f"stump accuracy : {(pred == y).mean():.3f}   (noise ceiling ≈ 0.90)")

raise the label noise to 0.3 and watch the gain — and the recovered threshold — degrade

4.3

Overfitting a tree: depth is capacity

Nothing in CART tells it when to stop. Left alone, it splits until every leaf is pure — and a tree of depth $d$ can carve up to $2^d$ rectangles, so by depth 20 it has a private box for every training point, noise included. Train accuracy climbs monotonically with depth; test accuracy rises, peaks where the tree has captured the signal, then falls as additional splits start chiseling the noise. That divergence is the whole concept of overfitting, and you can watch it happen below: the instrument fits a real CART (the exact greedy Gini search of EQ M4.2) on 160 seeded "two-moons" points, paints its decision regions, and scores it against 80 held-out test points.

INSTRUMENT M4.1 — DEPTH DIALREAL CART · REFIT LIVE IN JS · EQ M4.2

MAX DEPTH 3

ACCURACY VS DEPTH — TRAIN · TEST

TRAIN ACC

—

TEST ACC

—

GENERALIZATION GAP

—

LEAVES

—

Solid dots are training points, hollow rings are the held-out test set; the painted regions are the tree's verdict at every pixel block. Depth 1–2 underfits — one axis cut cannot bend around a moon. Depth 3–4 is honest. By depth 9 the tree is perfect on train and busy fencing off single noisy points; on this seed, test accuracy slides from 85% back to 80% while train hits 100%. The curves below the map plot the full sweep — the widening mint–blue scissors after depth ~3 is overfitting, drawn live.

Production practice caps capacity directly — maximum depth, minimum samples per leaf, minimum gain to split, or post-hoc pruning — and Chapter 06 treats this trade-off in its full generality as bias versus variance. But the idea is bigger than trees, and k-NN makes the point beautifully because its capacity dial runs backwards: small $k$ means high capacity. At $k=1$ the model memorizes — every training point wins its own private island of space — while large $k$ averages over wide neighborhoods and the boundary relaxes into something smooth. Same data below; watch the islands dissolve.

INSTRUMENT M4.2 — k-NN k-SLIDERBRUTE-FORCE VOTE · SAME DATA · EQ M4.1

NEIGHBORS k 7

TRAIN ACC (SELF-VOTE)

—

TEST ACC

—

REGIME

—

Every painted block is a genuine brute-force vote over all 160 training points. At k = 1 train accuracy reads 100% — each point votes for itself, which is exactly how memorization flatters itself — yet test accuracy is the worst on the dial (81% on this seed vs 89% at k = 7). Slide right and the speckled islands vanish; push to k = 31 and watch the thin tails of each moon get annexed by the opposing majority — global accuracy holds steady here, but the local geometry is visibly wrong.

4.4

Ensembles I: bagging and random forests

A deep tree is a low-bias, high-variance learner: it can represent almost anything, but refit it on a slightly different sample and you get a visibly different partition. Bagging (bootstrap aggregating, Breiman 1996) exploits that instability instead of fighting it: draw $B$ bootstrap samples (sample $n$ points with replacement — each bag contains about $1 - 1/e \approx 63.2\%$ of the unique points), grow a deep, deliberately unpruned tree on each, and average their votes. Why averaging helps is one line of statistics:

EQ M4.3 — VARIANCE OF AN AVERAGE $$ \mathrm{Var}\!\left( \frac{1}{n} \sum_{i=1}^{n} \hat{f}_i(x) \right) \;=\; \frac{\sigma^2}{n} \qquad \text{(uncorrelated predictors)} $$

Each tree's prediction errs with variance $\sigma^2$; averaging $n$ independent errors divides the variance by $n$. The catch: trees grown on overlapping bootstrap samples are correlated, and with pairwise correlation $\rho$ the variance is $\rho\sigma^2 + \tfrac{1-\rho}{n}\sigma^2$. Averaging annihilates the second term but the $\rho\sigma^2$ floor survives no matter how many trees you add — so the entire engineering problem becomes: decorrelate the trees.

Four uncorrelated trees each predict with error variance $\sigma^2 = 0.8$. By EQ M4.3, what is the variance of their averaged prediction?

$\dfrac{\sigma^2}{n} = \dfrac{0.8}{4} = $ 0.2. Averaging four independent learners quarters the variance — the entire reason bagging works, and the reason random forests fight so hard to keep their trees uncorrelated.

A random forest is bagging plus one decorrelation trick: at every split, the tree may only consider a random subset of the features (classically $\sqrt{p}$ of $p$ for classification). Strong features stop dominating every tree, $\rho$ drops, and the variance floor drops with it. Two free gifts follow. Because each tree never saw ~37% of the data, scoring each point with only the trees that missed it yields out-of-bag error — an honest validation estimate with no held-out split. And since trees are independent, training parallelizes perfectly.

The honest ledger: random forests are astonishingly hard to break — near-default hyperparameters land within a few percent of optimal on most tabular tasks — but they pay in memory and latency (hundreds of deep trees), their predictions remain step functions, and like all trees they cannot extrapolate beyond the convex hull of what they saw: a forest trained on 2019 prices will never predict a 2026 price above the 2019 maximum.

4.5

Ensembles II: gradient boosting

Bagging builds strong learners in parallel and averages away variance. Boosting does the opposite: build weak learners — shallow trees, depth 4–8 — in sequence, each one trained on what the ensemble so far still gets wrong, attacking bias instead. After $m-1$ rounds the model is $F_{m-1}$; the next tree fits its mistakes:

EQ M4.4 — THE BOOSTING UPDATE $$ F_m(x) \;=\; F_{m-1}(x) \;+\; \eta\, h_m(x), \qquad h_m \;\text{ fit to the residuals }\; r_i = y_i - F_{m-1}(x_i) $$

For squared loss, the residuals $r_i$ literally are the errors left over. The general recipe — fit $h_m$ to the negative gradient of any differentiable loss, evaluated at the current predictions — is what makes it gradient boosting: the same descent idea as Chapter 02, except each "step" is an entire tree, taken in function space rather than parameter space. The shrinkage $\eta$ (typically 0.01–0.3) deliberately under-commits to each tree; small $\eta$ plus more rounds almost always generalizes better. Unlike bagging, boosting will overfit as rounds accumulate — early stopping on a validation set is part of the algorithm, not an optional extra.

For one input: target $y = 10$, current prediction $F_{m-1} = 6$, and the new tree fits the residual exactly ($h_m = 4$). With shrinkage $\eta = 0.3$, what residual $y - F_m$ remains after this boosting round (EQ M4.4)?

Update: $F_m = F_{m-1} + \eta\,h_m = 6 + 0.3\cdot 4 = 7.2$. New residual: $y - F_m = 10 - 7.2 = $ 2.8 — exactly $(1 - \eta) = 0.7$ times the old residual of $4$. Shrinkage means each round deliberately leaves most of the error for the next tree.

The modern implementations are the tabular kings. XGBoost (2016) added second-order gradients and explicit regularization on leaf weights; LightGBM made training fast at scale with histogram-binned splits and leaf-wise growth; CatBoost specializes in categorical columns via ordered target statistics. A decade of Kaggle tabular leaderboards is, to a first approximation, a history of these three libraries.

Property	Random forest	Gradient boosting
Trees built	in parallel, independent	in sequence, each fixing the last
Error attacked	variance (EQ M4.3)	bias (EQ M4.4)
Tree shape	deep, unpruned	shallow (depth 4–8 / 31–255 leaves)
Tuning sensitivity	low — defaults nearly optimal	moderate — η, rounds, depth interact
Overfits with more trees?	no (variance floor, never worse)	yes — early stopping required
Typical accuracy on tables	strong	strongest — the default to beat

4.6

When trees beat deep learning

This encyclopedia spends three volumes on neural networks, so honesty demands this section. On medium-sized tables — the ~1K-to-1M-row, heterogeneous-column datasets that constitute most of applied machine learning in industry — tuned gradient-boosted trees have beaten tuned deep models for most of the past decade. The careful benchmark of Grinsztajn et al. (NeurIPS 2022) compared XGBoost and random forests against MLPs, ResNets, and tabular transformers across ~45 datasets with equal tuning budgets, and the trees won outright on most of them. Three reasons survive scrutiny:

Tabular targets are irregular. Neural networks carry a smoothness prior; real-world table columns (thresholded business rules, saturation effects, encoded categories) produce jagged, discontinuous target functions that piecewise-constant trees fit natively.
Uninformative features are everywhere. A tree simply never splits on a useless column. An MLP must spend capacity learning to ignore it, and in low-data regimes it often fails to.
Axis alignment is the right prior. Table columns are individually meaningful — "age", "income" — and the correct decision boundaries really do tend to run parallel to them. The rotation invariance that serves vision models is exactly the wrong inductive bias here.

The frontier is genuinely moving, and the honest 2026 picture is contested at the edges: TabPFN-style models — transformers pre-trained on millions of synthetic tables that "fit" a new dataset by in-context learning, no gradient steps at all — now beat tuned GBDTs on many small tables (roughly ≤10K rows), as published in Nature in 2025. Deep models also win wherever the table stops being a table: text or image columns, massive row counts, transfer from pretrained representations, or the need for embeddings that feed a larger system. But the default has not changed: faced with a fresh tabular problem, the strong move is still a LightGBM or XGBoost baseline, trained in minutes on a CPU. Anything more exotic must beat that number to earn its complexity — and most of the time, it doesn't.

Every method so far was handed labels. Chapter 05 takes them away: k-means clustering stepped one Lloyd iteration at a time, PCA as organized variance-hunting, and the idea that data can describe itself — the first step toward the embeddings that power everything in Volumes II and beyond.

§