AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 04 / TREES, FORESTS & NEIGHBORS INDEX NEXT: CLUSTERING & PCA →
VOLUME I — FOUNDATIONS OF ML · CHAPTER 04 / 08

Trees, Forests & Neighbors

Not every model is a curve bent by gradient descent. This chapter covers methods that keep the training data instead of compressing it away: k-NN, which is its training set plus a voting rule, and decision trees, which carve the input space into rectangles by asking greedy yes/no questions. Ensembled into random forests and gradient-boosted stacks, these remain, in 2026, the methods to beat on tabular data.

LEVELINTRO READING TIME≈ 25 MIN BUILDS ONVOL I · CH 01–03 INSTRUMENTSDEPTH DIAL · k-NN k-SLIDER
4.1

k-NN: memory as a model

The k-nearest-neighbors classifier has no training step, no parameters, and no loss function. The "model" is the training set itself, a distance function, and one integer. To classify a new point \(x\): find the \(k\) training points closest to it, and let them vote.

EQ M4.1 — MAJORITY VOTE OF THE k NEAREST $$ \hat{y}(x) \;=\; \operatorname*{arg\,max}_{c}\; \sum_{i \,\in\, N_k(x)} \mathbf{1}\!\left[\, y_i = c \,\right] $$
\(N_k(x)\) is the set of the \(k\) training points nearest to \(x\) — usually under Euclidean distance — and \(\mathbf{1}[\cdot]\) counts a vote when neighbor \(i\) carries label \(c\). All the cost moves to query time: \(O(nd)\) distance computations per prediction against \(n\) stored examples in \(d\) dimensions. Every modeling assumption hides inside the distance function — if one feature is measured in millimeters and another in kilometers, the millimeters decide everything, which is why k-NN demands scaled features while trees (§4.2) do not.
Classify a query with \(k = 7\). Its seven nearest training points carry labels \(1, 0, 1, 1, 0, 1, 0\). By EQ M4.1, which class does k-NN predict (answer \(0\) or \(1\))?
Count the votes: class \(1\) appears \(4\) times, class \(0\) appears \(3\) times. The \(\arg\max\) picks the majority, so the prediction is class 1. With odd \(k\) in a two-class problem, ties are impossible.

For something this naive, k-NN is theoretically respectable: a classic result (Cover & Hart, 1967) shows that with unlimited data, the humble 1-NN rule's error is at most twice the best achievable error of any classifier. Memory is a legitimate model. And it scaled further than anyone guessed: vector search over learned embeddings — the retrieval step inside every RAG system and vector database — is k-NN run at billion-document scale. The algorithm never died; it just changed feature spaces.

CURSE

The curse of dimensionality. "Nearest" degrades as dimensions grow. In a 100-dimensional unit cube, a sub-cube that wants to capture just 1% of uniformly spread points needs edge length \(0.01^{1/100} \approx 0.955\) — it must span 95% of every axis to hold 1% of the data. Worse, pairwise distances concentrate: the gap between the nearest and farthest neighbor shrinks toward nothing, and the vote becomes noise. The honest footnote: real high-dimensional data (images, text embeddings) usually lies near much lower-dimensional structure, which is why k-NN on a good embedding still works — a thread picked up in Chapter 05.

Run it from scratch. The cell below builds two overlapping Gaussian blobs, classifies with brute-force distance + vote, and scores \(k=1\) against \(k=15\). Note the signature of memorization: \(k=1\) is perfect on the training set (every point is its own nearest neighbor) and the worst of the pair on the test set.

PYTHON · RUNNABLE IN-BROWSER
import numpy as np
rng = np.random.default_rng(7)

# two overlapping Gaussian blobs — 240 points, 160 train / 80 test
n = 120
X0 = rng.normal([-0.9, -0.6], 1.1, (n, 2))
X1 = rng.normal([ 0.9,  0.7], 1.1, (n, 2))
X = np.vstack([X0, X1]); y = np.r_[np.zeros(n, int), np.ones(n, int)]
idx = rng.permutation(2 * n)
Xtr, ytr = X[idx[:160]], y[idx[:160]]
Xte, yte = X[idx[160:]], y[idx[160:]]

def knn_predict(Xq, k):
    d = ((Xq[:, None, :] - Xtr[None, :, :]) ** 2).sum(-1)  # all pairwise dist²
    votes = ytr[np.argsort(d, axis=1)[:, :k]]              # labels of k nearest
    return (votes.mean(axis=1) > 0.5).astype(int)          # majority (k is odd)

for k in (1, 15):
    print(f"k={k:2d}   train acc = {(knn_predict(Xtr, k) == ytr).mean():.3f}"
          f"   test acc = {(knn_predict(Xte, k) == yte).mean():.3f}")
plot_scatter(X[:, 0], X[:, 1], y)
edits are live — try k = 75, or shrink the blob spread to 0.5
4.2

Decision trees: greedy questions, rectangular answers

A decision tree classifies by interrogation: is feature 2 below 0.21? yes → left, no → right, repeated until a leaf, where the majority class of the training points that landed there becomes the prediction. Geometrically, every internal node slices the space with an axis-aligned cut, so every leaf is a rectangle (a box, in higher dimensions). The model is a partition.

Which question to ask first? Finding the globally optimal tree is NP-complete (Hyafil & Rivest, 1976), so CART — the algorithm running live in Instrument M4.1 — is greedy: at each node, try every feature and every threshold, keep the single split that most purifies the two children, and recurse. Purity is measured by Gini impurity or entropy:

EQ M4.2 — GINI IMPURITY AND SPLIT GAIN $$ G(S) \;=\; 1 - \sum_{c} p_c^{\,2}, \qquad \Delta \;=\; G(S) \;-\; \frac{|S_L|}{|S|}\,G(S_L) \;-\; \frac{|S_R|}{|S|}\,G(S_R) $$
\(p_c\) is the fraction of class \(c\) among the samples \(S\) at the node; \(G\) is the probability that two random draws from the node disagree — 0 when pure, 0.5 at a two-class coin flip. The split \(s\) sends samples to children \(S_L, S_R\), and CART picks the feature–threshold pair maximizing the gain \(\Delta\). The entropy alternative \(H(S) = -\sum_c p_c \log_2 p_c\) ("information gain") almost never produces a different tree — Gini wins on speed, not principle.
A tree node holds 8 samples: 6 of class ● and 2 of class ○. What is its Gini impurity \(G = 1 - \sum_c p_c^2\) (EQ M4.2)?
Fractions: \(p_● = 6/8 = 0.75\), \(p_○ = 2/8 = 0.25\). \(G = 1 - (0.75^2 + 0.25^2) = 1 - (0.5625 + 0.0625) = 1 - 0.625 = \) 0.375. Less impure than the 0.5 of a 50/50 node, but not yet pure.
A parent node of 8 samples (4 ●, 4 ○, so \(G = 0.5\)) is split into a left child of 4 (3 ●, 1 ○) and a right child of 4 (1 ●, 3 ○). Each child has Gini \(0.375\). What is the split gain \(\Delta\) (EQ M4.2)?
\(\Delta = G(S) - \tfrac{|S_L|}{|S|}G(S_L) - \tfrac{|S_R|}{|S|}G(S_R) = 0.5 - \tfrac{4}{8}(0.375) - \tfrac{4}{8}(0.375) = 0.5 - 0.1875 - 0.1875 = \) 0.125. A modest, balanced improvement — CART would still prefer this over any split that purifies less.
FIG M4.AA DEPTH-2 TREE IS A PARTITION INTO THREE RECTANGLES
x₂ < 0.21 ? T F R1 · PREDICT ● x₁ < 1.04 ? T F R2 · PREDICT ○ R3 · PREDICT ● x₂ = 0.21 x₁ = 1.04 R1 · ● R2 · ○ R3 · ● x₁ → x₂
Same object, two views. The tree (left) and the partition (right) are identical. Note that class ● already owns a non-convex region — two splits suffice — and that the boundaries are axis-parallel by construction: a tree can only approximate a diagonal frontier with a staircase.

Trees buy three practical superpowers that gradient-trained models lack. They are invariant to any monotone rescaling of a feature (only the ordering of values matters, so no normalization, ever); they ingest mixed numeric and categorical columns without ceremony; and a small tree is genuinely readable — you can print it as a flowchart and hand it to an auditor. Their weakness is the same as their strength: predictions are piecewise-constant, so a lone tree is jagged, unstable, and cannot extrapolate a trend beyond the data it saw.

The atomic unit is the decision stump — a depth-1 tree, one question, two answers. The cell below performs the exact inner-loop search that CART runs at every node: scan every threshold, score every split by Gini gain, keep the best. Boosting (§4.5) will assemble hundreds of barely-better-than-chance learners like this one into a precision instrument.

PYTHON · RUNNABLE IN-BROWSER
import numpy as np
rng = np.random.default_rng(3)

# 1-D labels with a true step at x = 0.35, plus 10% label noise
n = 200
x = rng.uniform(0, 1, n)
y = (x > 0.35).astype(int)
flip = rng.random(n) < 0.10
y[flip] = 1 - y[flip]

def gini(labels):
    if len(labels) == 0: return 0.0
    p = labels.mean()
    return 2 * p * (1 - p)        # = 1 - p² - (1-p)² for two classes

parent = gini(y)
best_gain, best_t = -1.0, None
for t in np.sort(x)[1:]:          # candidate threshold between each pair
    L, R = y[x < t], y[x >= t]
    w = len(L) / n
    gain = parent - (w * gini(L) + (1 - w) * gini(R))
    if gain > best_gain:
        best_gain, best_t = gain, t

pred = (x >= best_t).astype(int)
print(f"parent Gini    : {parent:.4f}")
print(f"best threshold : x = {best_t:.4f}   (true step at 0.35)")
print(f"Gini gain      : {best_gain:.4f}")
print(f"stump accuracy : {(pred == y).mean():.3f}   (noise ceiling ≈ 0.90)")
raise the label noise to 0.3 and watch the gain — and the recovered threshold — degrade
4.3

Overfitting a tree: depth is capacity

Nothing in CART tells it when to stop. Left alone, it splits until every leaf is pure — and a tree of depth \(d\) can carve up to \(2^d\) rectangles, so by depth 20 it has a private box for every training point, noise included. Train accuracy climbs monotonically with depth; test accuracy rises, peaks where the tree has captured the signal, then falls as additional splits start chiseling the noise. That divergence is the whole concept of overfitting, and you can watch it happen below: the instrument fits a real CART (the exact greedy Gini search of EQ M4.2) on 160 seeded "two-moons" points, paints its decision regions, and scores it against 80 held-out test points.

INSTRUMENT M4.1 — DEPTH DIALREAL CART · REFIT LIVE IN JS · EQ M4.2
ACCURACY VS DEPTH — TRAIN · TEST
TRAIN ACC
TEST ACC
GENERALIZATION GAP
LEAVES
Solid dots are training points, hollow rings are the held-out test set; the painted regions are the tree's verdict at every pixel block. Depth 1–2 underfits — one axis cut cannot bend around a moon. Depth 3–4 is honest. By depth 9 the tree is perfect on train and busy fencing off single noisy points; on this seed, test accuracy slides from 85% back to 80% while train hits 100%. The curves below the map plot the full sweep — the widening mint–blue scissors after depth ~3 is overfitting, drawn live.

Production practice caps capacity directly — maximum depth, minimum samples per leaf, minimum gain to split, or post-hoc pruning — and Chapter 06 treats this trade-off in its full generality as bias versus variance. But the idea is bigger than trees, and k-NN makes the point beautifully because its capacity dial runs backwards: small \(k\) means high capacity. At \(k=1\) the model memorizes — every training point wins its own private island of space — while large \(k\) averages over wide neighborhoods and the boundary relaxes into something smooth. Same data below; watch the islands dissolve.

INSTRUMENT M4.2 — k-NN k-SLIDERBRUTE-FORCE VOTE · SAME DATA · EQ M4.1
TRAIN ACC (SELF-VOTE)
TEST ACC
REGIME
Every painted block is a genuine brute-force vote over all 160 training points. At k = 1 train accuracy reads 100% — each point votes for itself, which is exactly how memorization flatters itself — yet test accuracy is the worst on the dial (81% on this seed vs 89% at k = 7). Slide right and the speckled islands vanish; push to k = 31 and watch the thin tails of each moon get annexed by the opposing majority — global accuracy holds steady here, but the local geometry is visibly wrong.
4.4

Ensembles I: bagging and random forests

A deep tree is a low-bias, high-variance learner: it can represent almost anything, but refit it on a slightly different sample and you get a visibly different partition. Bagging (bootstrap aggregating, Breiman 1996) exploits that instability instead of fighting it: draw \(B\) bootstrap samples (sample \(n\) points with replacement — each bag contains about \(1 - 1/e \approx 63.2\%\) of the unique points), grow a deep, deliberately unpruned tree on each, and average their votes. Why averaging helps is one line of statistics:

EQ M4.3 — VARIANCE OF AN AVERAGE $$ \mathrm{Var}\!\left( \frac{1}{n} \sum_{i=1}^{n} \hat{f}_i(x) \right) \;=\; \frac{\sigma^2}{n} \qquad \text{(uncorrelated predictors)} $$
Each tree's prediction errs with variance \(\sigma^2\); averaging \(n\) independent errors divides the variance by \(n\). The catch: trees grown on overlapping bootstrap samples are correlated, and with pairwise correlation \(\rho\) the variance is \(\rho\sigma^2 + \tfrac{1-\rho}{n}\sigma^2\). Averaging annihilates the second term but the \(\rho\sigma^2\) floor survives no matter how many trees you add — so the entire engineering problem becomes: decorrelate the trees.
Four uncorrelated trees each predict with error variance \(\sigma^2 = 0.8\). By EQ M4.3, what is the variance of their averaged prediction?
\(\dfrac{\sigma^2}{n} = \dfrac{0.8}{4} = \) 0.2. Averaging four independent learners quarters the variance — the entire reason bagging works, and the reason random forests fight so hard to keep their trees uncorrelated.

A random forest is bagging plus one decorrelation trick: at every split, the tree may only consider a random subset of the features (classically \(\sqrt{p}\) of \(p\) for classification). Strong features stop dominating every tree, \(\rho\) drops, and the variance floor drops with it. Two free gifts follow. Because each tree never saw ~37% of the data, scoring each point with only the trees that missed it yields out-of-bag error — an honest validation estimate with no held-out split. And since trees are independent, training parallelizes perfectly.

The honest ledger: random forests are astonishingly hard to break — near-default hyperparameters land within a few percent of optimal on most tabular tasks — but they pay in memory and latency (hundreds of deep trees), their predictions remain step functions, and like all trees they cannot extrapolate beyond the convex hull of what they saw: a forest trained on 2019 prices will never predict a 2026 price above the 2019 maximum.

4.5

Ensembles II: gradient boosting

Bagging builds strong learners in parallel and averages away variance. Boosting does the opposite: build weak learners — shallow trees, depth 4–8 — in sequence, each one trained on what the ensemble so far still gets wrong, attacking bias instead. After \(m-1\) rounds the model is \(F_{m-1}\); the next tree fits its mistakes:

EQ M4.4 — THE BOOSTING UPDATE $$ F_m(x) \;=\; F_{m-1}(x) \;+\; \eta\, h_m(x), \qquad h_m \;\text{ fit to the residuals }\; r_i = y_i - F_{m-1}(x_i) $$
For squared loss, the residuals \(r_i\) literally are the errors left over. The general recipe — fit \(h_m\) to the negative gradient of any differentiable loss, evaluated at the current predictions — is what makes it gradient boosting: the same descent idea as Chapter 02, except each "step" is an entire tree, taken in function space rather than parameter space. The shrinkage \(\eta\) (typically 0.01–0.3) deliberately under-commits to each tree; small \(\eta\) plus more rounds almost always generalizes better. Unlike bagging, boosting will overfit as rounds accumulate — early stopping on a validation set is part of the algorithm, not an optional extra.
For one input: target \(y = 10\), current prediction \(F_{m-1} = 6\), and the new tree fits the residual exactly (\(h_m = 4\)). With shrinkage \(\eta = 0.3\), what residual \(y - F_m\) remains after this boosting round (EQ M4.4)?
Update: \(F_m = F_{m-1} + \eta\,h_m = 6 + 0.3\cdot 4 = 7.2\). New residual: \(y - F_m = 10 - 7.2 = \) 2.8 — exactly \((1 - \eta) = 0.7\) times the old residual of \(4\). Shrinkage means each round deliberately leaves most of the error for the next tree.

The modern implementations are the tabular kings. XGBoost (2016) added second-order gradients and explicit regularization on leaf weights; LightGBM made training fast at scale with histogram-binned splits and leaf-wise growth; CatBoost specializes in categorical columns via ordered target statistics. A decade of Kaggle tabular leaderboards is, to a first approximation, a history of these three libraries.

PropertyRandom forestGradient boosting
Trees builtin parallel, independentin sequence, each fixing the last
Error attackedvariance (EQ M4.3)bias (EQ M4.4)
Tree shapedeep, unprunedshallow (depth 4–8 / 31–255 leaves)
Tuning sensitivitylow — defaults nearly optimalmoderate — η, rounds, depth interact
Overfits with more trees?no (variance floor, never worse)yes — early stopping required
Typical accuracy on tablesstrongstrongest — the default to beat
4.6

When trees beat deep learning

This encyclopedia spends three volumes on neural networks, so honesty demands this section. On medium-sized tables — the ~1K-to-1M-row, heterogeneous-column datasets that constitute most of applied machine learning in industry — tuned gradient-boosted trees have beaten tuned deep models for most of the past decade. The careful benchmark of Grinsztajn et al. (NeurIPS 2022) compared XGBoost and random forests against MLPs, ResNets, and tabular transformers across ~45 datasets with equal tuning budgets, and the trees won outright on most of them. Three reasons survive scrutiny:

  • Tabular targets are irregular. Neural networks carry a smoothness prior; real-world table columns (thresholded business rules, saturation effects, encoded categories) produce jagged, discontinuous target functions that piecewise-constant trees fit natively.
  • Uninformative features are everywhere. A tree simply never splits on a useless column. An MLP must spend capacity learning to ignore it, and in low-data regimes it often fails to.
  • Axis alignment is the right prior. Table columns are individually meaningful — "age", "income" — and the correct decision boundaries really do tend to run parallel to them. The rotation invariance that serves vision models is exactly the wrong inductive bias here.

The frontier is genuinely moving, and the honest 2026 picture is contested at the edges: TabPFN-style models — transformers pre-trained on millions of synthetic tables that "fit" a new dataset by in-context learning, no gradient steps at all — now beat tuned GBDTs on many small tables (roughly ≤10K rows), as published in Nature in 2025. Deep models also win wherever the table stops being a table: text or image columns, massive row counts, transfer from pretrained representations, or the need for embeddings that feed a larger system. But the default has not changed: faced with a fresh tabular problem, the strong move is still a LightGBM or XGBoost baseline, trained in minutes on a CPU. Anything more exotic must beat that number to earn its complexity — and most of the time, it doesn't.

NEXT

Every method so far was handed labels. Chapter 05 takes them away: k-means clustering stepped one Lloyd iteration at a time, PCA as organized variance-hunting, and the idea that data can describe itself — the first step toward the embeddings that power everything in Volumes II and beyond.

§

Further reading

  • Cover, T. & Hart, P. (1967). Nearest Neighbor Pattern Classification. — the founding analysis of k-NN, including the bound that 1-NN error is at most twice the Bayes error.
  • Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984). Classification and Regression Trees. — the CART monograph that defines the greedy splitting, impurity, and pruning used by every modern tree.
  • Breiman, L. (2001). Random Forests. — introduces bagging plus random feature subsets and explains why averaging decorrelated trees lowers variance.
  • Friedman, J. (2001). Greedy Function Approximation: A Gradient Boosting Machine. — the paper that frames boosting as gradient descent in function space.
  • Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. — the engineering and regularization advances that made gradient boosting the default for tabular data.
  • Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? — the empirical case for trees over neural nets on heterogeneous tables.