k-NN: memory as a model
The k-nearest-neighbors classifier has no training step, no parameters, and no loss function. The "model" is the training set itself, a distance function, and one integer. To classify a new point \(x\): find the \(k\) training points closest to it, and let them vote.
For something this naive, k-NN is theoretically respectable: a classic result (Cover & Hart, 1967) shows that with unlimited data, the humble 1-NN rule's error is at most twice the best achievable error of any classifier. Memory is a legitimate model. And it scaled further than anyone guessed: vector search over learned embeddings — the retrieval step inside every RAG system and vector database — is k-NN run at billion-document scale. The algorithm never died; it just changed feature spaces.
The curse of dimensionality. "Nearest" degrades as dimensions grow. In a 100-dimensional unit cube, a sub-cube that wants to capture just 1% of uniformly spread points needs edge length \(0.01^{1/100} \approx 0.955\) — it must span 95% of every axis to hold 1% of the data. Worse, pairwise distances concentrate: the gap between the nearest and farthest neighbor shrinks toward nothing, and the vote becomes noise. The honest footnote: real high-dimensional data (images, text embeddings) usually lies near much lower-dimensional structure, which is why k-NN on a good embedding still works — a thread picked up in Chapter 05.
Run it from scratch. The cell below builds two overlapping Gaussian blobs, classifies with brute-force distance + vote, and scores \(k=1\) against \(k=15\). Note the signature of memorization: \(k=1\) is perfect on the training set (every point is its own nearest neighbor) and the worst of the pair on the test set.
import numpy as np
rng = np.random.default_rng(7)
# two overlapping Gaussian blobs — 240 points, 160 train / 80 test
n = 120
X0 = rng.normal([-0.9, -0.6], 1.1, (n, 2))
X1 = rng.normal([ 0.9, 0.7], 1.1, (n, 2))
X = np.vstack([X0, X1]); y = np.r_[np.zeros(n, int), np.ones(n, int)]
idx = rng.permutation(2 * n)
Xtr, ytr = X[idx[:160]], y[idx[:160]]
Xte, yte = X[idx[160:]], y[idx[160:]]
def knn_predict(Xq, k):
d = ((Xq[:, None, :] - Xtr[None, :, :]) ** 2).sum(-1) # all pairwise dist²
votes = ytr[np.argsort(d, axis=1)[:, :k]] # labels of k nearest
return (votes.mean(axis=1) > 0.5).astype(int) # majority (k is odd)
for k in (1, 15):
print(f"k={k:2d} train acc = {(knn_predict(Xtr, k) == ytr).mean():.3f}"
f" test acc = {(knn_predict(Xte, k) == yte).mean():.3f}")
plot_scatter(X[:, 0], X[:, 1], y)
Decision trees: greedy questions, rectangular answers
A decision tree classifies by interrogation: is feature 2 below 0.21? yes → left, no → right, repeated until a leaf, where the majority class of the training points that landed there becomes the prediction. Geometrically, every internal node slices the space with an axis-aligned cut, so every leaf is a rectangle (a box, in higher dimensions). The model is a partition.
Which question to ask first? Finding the globally optimal tree is NP-complete (Hyafil & Rivest, 1976), so CART — the algorithm running live in Instrument M4.1 — is greedy: at each node, try every feature and every threshold, keep the single split that most purifies the two children, and recurse. Purity is measured by Gini impurity or entropy:
Trees buy three practical superpowers that gradient-trained models lack. They are invariant to any monotone rescaling of a feature (only the ordering of values matters, so no normalization, ever); they ingest mixed numeric and categorical columns without ceremony; and a small tree is genuinely readable — you can print it as a flowchart and hand it to an auditor. Their weakness is the same as their strength: predictions are piecewise-constant, so a lone tree is jagged, unstable, and cannot extrapolate a trend beyond the data it saw.
The atomic unit is the decision stump — a depth-1 tree, one question, two answers. The cell below performs the exact inner-loop search that CART runs at every node: scan every threshold, score every split by Gini gain, keep the best. Boosting (§4.5) will assemble hundreds of barely-better-than-chance learners like this one into a precision instrument.
import numpy as np
rng = np.random.default_rng(3)
# 1-D labels with a true step at x = 0.35, plus 10% label noise
n = 200
x = rng.uniform(0, 1, n)
y = (x > 0.35).astype(int)
flip = rng.random(n) < 0.10
y[flip] = 1 - y[flip]
def gini(labels):
if len(labels) == 0: return 0.0
p = labels.mean()
return 2 * p * (1 - p) # = 1 - p² - (1-p)² for two classes
parent = gini(y)
best_gain, best_t = -1.0, None
for t in np.sort(x)[1:]: # candidate threshold between each pair
L, R = y[x < t], y[x >= t]
w = len(L) / n
gain = parent - (w * gini(L) + (1 - w) * gini(R))
if gain > best_gain:
best_gain, best_t = gain, t
pred = (x >= best_t).astype(int)
print(f"parent Gini : {parent:.4f}")
print(f"best threshold : x = {best_t:.4f} (true step at 0.35)")
print(f"Gini gain : {best_gain:.4f}")
print(f"stump accuracy : {(pred == y).mean():.3f} (noise ceiling ≈ 0.90)")
Overfitting a tree: depth is capacity
Nothing in CART tells it when to stop. Left alone, it splits until every leaf is pure — and a tree of depth \(d\) can carve up to \(2^d\) rectangles, so by depth 20 it has a private box for every training point, noise included. Train accuracy climbs monotonically with depth; test accuracy rises, peaks where the tree has captured the signal, then falls as additional splits start chiseling the noise. That divergence is the whole concept of overfitting, and you can watch it happen below: the instrument fits a real CART (the exact greedy Gini search of EQ M4.2) on 160 seeded "two-moons" points, paints its decision regions, and scores it against 80 held-out test points.
Production practice caps capacity directly — maximum depth, minimum samples per leaf, minimum gain to split, or post-hoc pruning — and Chapter 06 treats this trade-off in its full generality as bias versus variance. But the idea is bigger than trees, and k-NN makes the point beautifully because its capacity dial runs backwards: small \(k\) means high capacity. At \(k=1\) the model memorizes — every training point wins its own private island of space — while large \(k\) averages over wide neighborhoods and the boundary relaxes into something smooth. Same data below; watch the islands dissolve.
Ensembles I: bagging and random forests
A deep tree is a low-bias, high-variance learner: it can represent almost anything, but refit it on a slightly different sample and you get a visibly different partition. Bagging (bootstrap aggregating, Breiman 1996) exploits that instability instead of fighting it: draw \(B\) bootstrap samples (sample \(n\) points with replacement — each bag contains about \(1 - 1/e \approx 63.2\%\) of the unique points), grow a deep, deliberately unpruned tree on each, and average their votes. Why averaging helps is one line of statistics:
A random forest is bagging plus one decorrelation trick: at every split, the tree may only consider a random subset of the features (classically \(\sqrt{p}\) of \(p\) for classification). Strong features stop dominating every tree, \(\rho\) drops, and the variance floor drops with it. Two free gifts follow. Because each tree never saw ~37% of the data, scoring each point with only the trees that missed it yields out-of-bag error — an honest validation estimate with no held-out split. And since trees are independent, training parallelizes perfectly.
The honest ledger: random forests are astonishingly hard to break — near-default hyperparameters land within a few percent of optimal on most tabular tasks — but they pay in memory and latency (hundreds of deep trees), their predictions remain step functions, and like all trees they cannot extrapolate beyond the convex hull of what they saw: a forest trained on 2019 prices will never predict a 2026 price above the 2019 maximum.
Ensembles II: gradient boosting
Bagging builds strong learners in parallel and averages away variance. Boosting does the opposite: build weak learners — shallow trees, depth 4–8 — in sequence, each one trained on what the ensemble so far still gets wrong, attacking bias instead. After \(m-1\) rounds the model is \(F_{m-1}\); the next tree fits its mistakes:
The modern implementations are the tabular kings. XGBoost (2016) added second-order gradients and explicit regularization on leaf weights; LightGBM made training fast at scale with histogram-binned splits and leaf-wise growth; CatBoost specializes in categorical columns via ordered target statistics. A decade of Kaggle tabular leaderboards is, to a first approximation, a history of these three libraries.
| Property | Random forest | Gradient boosting |
|---|---|---|
| Trees built | in parallel, independent | in sequence, each fixing the last |
| Error attacked | variance (EQ M4.3) | bias (EQ M4.4) |
| Tree shape | deep, unpruned | shallow (depth 4–8 / 31–255 leaves) |
| Tuning sensitivity | low — defaults nearly optimal | moderate — η, rounds, depth interact |
| Overfits with more trees? | no (variance floor, never worse) | yes — early stopping required |
| Typical accuracy on tables | strong | strongest — the default to beat |
When trees beat deep learning
This encyclopedia spends three volumes on neural networks, so honesty demands this section. On medium-sized tables — the ~1K-to-1M-row, heterogeneous-column datasets that constitute most of applied machine learning in industry — tuned gradient-boosted trees have beaten tuned deep models for most of the past decade. The careful benchmark of Grinsztajn et al. (NeurIPS 2022) compared XGBoost and random forests against MLPs, ResNets, and tabular transformers across ~45 datasets with equal tuning budgets, and the trees won outright on most of them. Three reasons survive scrutiny:
- Tabular targets are irregular. Neural networks carry a smoothness prior; real-world table columns (thresholded business rules, saturation effects, encoded categories) produce jagged, discontinuous target functions that piecewise-constant trees fit natively.
- Uninformative features are everywhere. A tree simply never splits on a useless column. An MLP must spend capacity learning to ignore it, and in low-data regimes it often fails to.
- Axis alignment is the right prior. Table columns are individually meaningful — "age", "income" — and the correct decision boundaries really do tend to run parallel to them. The rotation invariance that serves vision models is exactly the wrong inductive bias here.
The frontier is genuinely moving, and the honest 2026 picture is contested at the edges: TabPFN-style models — transformers pre-trained on millions of synthetic tables that "fit" a new dataset by in-context learning, no gradient steps at all — now beat tuned GBDTs on many small tables (roughly ≤10K rows), as published in Nature in 2025. Deep models also win wherever the table stops being a table: text or image columns, massive row counts, transfer from pretrained representations, or the need for embeddings that feed a larger system. But the default has not changed: faced with a fresh tabular problem, the strong move is still a LightGBM or XGBoost baseline, trained in minutes on a CPU. Anything more exotic must beat that number to earn its complexity — and most of the time, it doesn't.
Every method so far was handed labels. Chapter 05 takes them away: k-means clustering stepped one Lloyd iteration at a time, PCA as organized variance-hunting, and the idea that data can describe itself — the first step toward the embeddings that power everything in Volumes II and beyond.
Further reading
- Cover, T. & Hart, P. (1967). Nearest Neighbor Pattern Classification. — the founding analysis of k-NN, including the bound that 1-NN error is at most twice the Bayes error.
- Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984). Classification and Regression Trees. — the CART monograph that defines the greedy splitting, impurity, and pruning used by every modern tree.
- Breiman, L. (2001). Random Forests. — introduces bagging plus random feature subsets and explains why averaging decorrelated trees lowers variance.
- Friedman, J. (2001). Greedy Function Approximation: A Gradient Boosting Machine. — the paper that frames boosting as gradient descent in function space.
- Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. — the engineering and regularization advances that made gradient boosting the default for tabular data.
- Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data? — the empirical case for trees over neural nets on heterogeneous tables.