AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 05 / CLUSTERING & DIMENSIONALITY INDEX NEXT: GENERALIZATION →
VOLUME I — FOUNDATIONS OF ML · CHAPTER 05 / 08

Clustering & Dimensionality

Every method so far was handed the right answers. This chapter takes them away. With no labels, the model must find structure the data carries on its own: groups that belong together, directions that matter, coordinates worth keeping. Data can describe itself, and a model that compresses data well has, in a measurable sense, understood it. Scaled up, that is how every modern language model learns.

LEVELCORE READING TIME≈ 22 MIN BUILDS ONCH 01 · 02 · 04 INSTRUMENTSLLOYD STEPPER · VARIANCE HUNT
5.1

Taking the labels away

Supervised learning runs on a luxury: someone already did the job correctly, two million times, and wrote the answers down. That luxury is expensive — labels cost human hours, expert hours, sometimes biopsy results — and it is scarce in exactly the places data is abundant. Server logs, transaction streams, sensor traces, the text of the internet: almost everything ever recorded arrives unlabeled. Unsupervised learning is the discipline of extracting structure from \(x\) alone, with no \(y\) anywhere in the problem.

What can structure mean, when nobody defines success for you? Three recurring answers, two of which this chapter builds from scratch:

TaskThe question it asksCanonical methodYou have already met it as…
ClusteringWhich points belong together?k-means (§5.2)k-NN's neighborhoods (Ch 04), minus the labels
Dimensionality reductionWhich directions carry the signal?PCA (§5.4)feature scaling's smarter sibling (Ch 02)
Density estimationWhat does typical look like — and what doesn't?histograms, mixtures, …the distribution \(\mathcal{D}\) behind EQ M1.3

The honest difficulty is not algorithmic — both algorithms below fit in a dozen lines. It is that without labels there is no ground truth to be scored against. A supervised model is wrong when it disagrees with \(y\); a clustering is "wrong" only relative to a purpose nobody encoded in the data. Every unsupervised method therefore optimizes a proxy — compactness, variance, reconstruction — and the practitioner's job is judging whether the proxy matches the purpose. Keep that skepticism switched on for the rest of the chapter; both instruments below are built to reward it.

One taxonomy note before we start. The regime that ate the world is a hybrid: self-supervised learning manufactures labels out of the raw data itself — mask a word and predict it, crop an image and match it, take a prefix and predict the next token. The loss is supervised machinery (cross-entropy, Ch 03); the labels cost nothing, exactly like the methods here. Section 5.5 traces the line from this chapter to that one.

5.2

k-means: Lloyd's loop

The oldest serious answer to "which points belong together" is geometric: choose \(k\) cluster centers, assign every point to its nearest center, and call a clustering good when points sit close to their assigned centers. That sentence is already the objective — the unsupervised stand-in for a loss function, with no label anywhere in it:

EQ M5.1 — THE K-MEANS OBJECTIVE (INERTIA) $$ J\big(c_{1..n},\, \mu_{1..k}\big) \;=\; \sum_{i=1}^{n} \big\lVert x_i - \mu_{c_i} \big\rVert^{2}, \qquad c_i \in \{1, \dots, k\} $$
\(\mu_j\) are the \(k\) centroids; \(c_i\) names the cluster point \(i\) is assigned to; \(J\) — called inertia, or within-cluster sum of squares — totals every point's squared distance to its own centroid. Minimizing \(J\) jointly over assignments and centroids is NP-hard even for \(k = 2\). Nobody minimizes it exactly; everybody uses the 1957 heuristic below.
Four 1-D points: 0, 2, 10, 12, with two centroids fixed at \(\mu_1 = 1\) and \(\mu_2 = 11\). Each point joins its nearest centroid. What is the inertia \(J = \sum_i (x_i - \mu_{c_i})^2\)?
Assign by nearness: 0 and 2 go to \(\mu_1 = 1\); 10 and 12 go to \(\mu_2 = 11\). Squared distances: \((0-1)^2 = 1\), \((2-1)^2 = 1\), \((10-11)^2 = 1\), \((12-11)^2 = 1\). Each centroid sits exactly one unit from both of its points, so \(J = 1+1+1+1 = \) 4.

Lloyd's algorithm attacks \(J\) the way you would untangle any two-variable problem: freeze one variable, optimize the other, alternate. Both half-steps have closed forms, and both can only push \(J\) down:

EQ M5.2 — LLOYD'S TWO STEPS $$ \textbf{ASSIGN:}\;\; c_i \,\leftarrow\, \arg\min_{j} \,\lVert x_i - \mu_j \rVert^{2} \qquad\qquad \textbf{UPDATE:}\;\; \mu_j \,\leftarrow\, \frac{1}{\lvert S_j \rvert} \sum_{i \in S_j} x_i $$
ASSIGN is optimal for the current centroids by definition of "nearest"; UPDATE is optimal for the current assignments because the mean is the point minimizing total squared distance to a set — the same fact that made squared error pick the average in Chapter 01. Each half-step lowers \(J\) or leaves it fixed, \(J \ge 0\), and only finitely many assignments exist, so the loop must converge — in remarkably few sweeps, in practice. The fine print: it converges to a local minimum that depends entirely on where the centroids started.
After ASSIGN, one cluster holds these four points (x-coordinate only): 3, 4, 6, 7. The UPDATE step moves the centroid to \(\mu = \frac{1}{|S|}\sum_{i \in S} x_i\). What is the new centroid's x-coordinate?
UPDATE replaces a centroid with the mean of its assigned points: \(\mu = (3 + 4 + 6 + 7)/4 = 20/4 = \) 5. This is the point minimizing total squared distance to the set — the closed form that makes UPDATE optimal for fixed assignments.

Run the loop by hand. The instrument seeds three well-separated blobs, drops \(k\) centroids on random data points, and gives you the two half-steps as buttons. Watch \(J\) after every press — it never rises, which is the convergence proof happening in front of you.

INSTRUMENT M5.1 — LLOYD STEPPER180 SEEDED POINTS · EQ M5.2 BY HAND · ✕ = CENTROID
NEXT HALF-STEP
ASSIGN
FULL SWEEPS
0
INERTIA J — EQ M5.1
Step until the readout says CONVERGED — with k = 3 it takes only a handful of sweeps, and most starts land at J ≈ 47, the (lucky) global optimum. Now press RE-INIT a few times and re-run: some draws put two centroids in the same blob, and Lloyd converges — fully, honestly converged — at J ≈ 480, one centroid straddling two blobs forever. Convergence is not correctness. Then sweep k: at k = 2 a blob pair is forcibly merged; at k = 6 real blobs get carved into fragments — and J still goes down, because adding centroids always lowers inertia. J can compare runs at the same k; it cannot choose k for you (§5.3).

The same loop in numpy — eight lines of algorithm, fully vectorized. The printed inertia drops hard, then freezes: that plateau is convergence (no assignment changed, so nothing can move again).

PYTHON · RUNNABLE IN-BROWSER
import numpy as np
rng = np.random.default_rng(5)

# three blobs, then k-means from scratch — Lloyd's two steps, verbatim
centers = np.array([[-2.0, -1.0], [2.0, -1.2], [0.2, 1.8]])
X = np.vstack([rng.normal(c, 0.55, (60, 2)) for c in centers])

k = 3
mu = X[rng.choice(len(X), k, replace=False)]          # init: k random points
for it in range(6):
    d = ((X[:, None, :] - mu[None, :, :]) ** 2).sum(-1)   # n x k distances
    c = d.argmin(1)                                   # ASSIGN  (EQ M5.2)
    J = d[np.arange(len(X)), c].sum()
    mu = np.array([X[c == j].mean(0) for j in range(k)])  # UPDATE (EQ M5.2)
    print(f"sweep {it}:  inertia J = {J:8.2f}")

print("\nrecovered centroids (true: -2,-1 / 2,-1.2 / 0.2,1.8):")
print(np.round(mu, 2))
plot_scatter(X[:, 0], X[:, 1], c)
edits are live — try k = 5, or rng seed 12 for a different init

Production reality, in three habits. (1) Never one run: restart from many random inits and keep the lowest \(J\) — or use k-means++ seeding (spread the initial centroids out proportionally to squared distance), which is the default in every serious library and carries an \(O(\log k)\) approximation guarantee. (2) Scale features first — squared Euclidean distance inherits every pathology Chapter 02 warned about, and an unscaled column silently owns the clustering. (3) At web scale, use minibatch k-means: same two steps, estimated on samples, for the same reason SGD exists.

5.3

What k-means cannot see

k-means is fast, simple, and everywhere — and it is opinionated in ways the output never confesses. The objective is built from squared Euclidean distance to a single center, so the method implicitly assumes every cluster is a compact, roughly spherical, roughly equal-sized ball. Hand it anything else and it will still return \(k\) tidy clusters, with total confidence, and they will be wrong:

Baked-in assumptionHow real data breaks itWhat to reach for
Clusters are sphericalelongated or curved groups — Chapter 04's two moons get sliced crosswise, not tracedspectral clustering; DBSCAN
Similar size & densityone dense core next to a sparse halo: the boundary lands where the variances balance, not where humans see itGaussian mixtures (EM) — k-means with ellipses and soft assignments
Every point belongs somewhereoutliers drag centroids; a single rogue point can claim a centroid at large kDBSCAN (has a noise label); trim or robustify first
k is knownit almost never iselbow on J, silhouette score — both heuristics, neither decisive

On choosing \(k\): inertia alone cannot do it — you proved in Instrument M5.1 that \(J\) falls monotonically as \(k\) grows, all the way to the absurd optimum of one centroid per point (\(J = 0\), the lookup table of Chapter 01 wearing a new disguise). The elbow method plots \(J\) against \(k\) and looks for the bend where added centroids stop paying; the silhouette score compares each point's distance to its own cluster against the nearest foreign one. Both are judgment calls dressed as numbers. When a downstream task exists — clusters feeding a classifier, segments feeding a campaign — let its metric choose \(k\); that converts an unanswerable unsupervised question back into a measurable supervised one, and it is the most honest trick in this chapter.

5.4

PCA: organized variance-hunting

Clustering compresses \(n\) points into \(k\) prototypes. The other great compression runs crosswise: keep every point, but shrink the number of coordinates used to describe it. Real datasets are wildly redundant — square footage and room count move together; pixel 2,001 mostly agrees with pixel 2,002. Redundancy means the data hugs a lower-dimensional sheet inside its nominal space, and principal component analysis finds the best flat sheet by a beautifully blunt criterion: keep the directions along which the data varies most. Center the data (always — PCA is blind without it), then ask for the unit direction that maximizes the variance of the projections:

EQ M5.3 — PCA AS VARIANCE MAXIMIZATION $$ u_1 \;=\; \arg\max_{\lVert u \rVert = 1} \; u^{\top} \Sigma\, u, \qquad \Sigma \;=\; \frac{1}{n} \sum_{i=1}^{n} \big(x_i - \bar{x}\big)\big(x_i - \bar{x}\big)^{\top} $$
\(\Sigma\) is the covariance matrix; \(u^\top \Sigma u\) is exactly the variance of the data projected onto \(u\). The maximizer is the top eigenvector of \(\Sigma\), and the variance it captures is its eigenvalue \(\lambda_1\); the second component is the best direction orthogonal to the first, and so on down the spectrum. The twin reading matters just as much: by Pythagoras, variance kept + variance lost = total, so the direction that captures the most variance is the same direction that minimizes squared perpendicular reconstruction error. Maximal information and minimal distortion are one criterion, not two.
A diagonal covariance \(\Sigma = \begin{psmallmatrix}4 & 0\\ 0 & 1\end{psmallmatrix}\) and a unit direction \(u = (0.6,\, 0.8)\) (note \(0.6^2 + 0.8^2 = 1\)). How much variance does \(u\) capture, \(u^\top \Sigma u\)?
For a diagonal \(\Sigma\) the cross-terms vanish, so \(u^\top \Sigma u = u_1^2\,\Sigma_{11} + u_2^2\,\Sigma_{22} = 0.6^2 \cdot 4 + 0.8^2 \cdot 1 = 0.36 \cdot 4 + 0.64 \cdot 1 = 1.44 + 0.64 = \) 2.08. (The top eigenvector here is \((1,0)\) with eigenvalue 4; this off-axis \(u\) captures less.)

Hunt the direction yourself before the eigensolver does. Below is a centered, correlated cloud; the slider aims a candidate axis, the red stalks are what projection onto that axis throws away, and the lower chart sweeps EQ M5.3's objective across every angle:

INSTRUMENT M5.2 — VARIANCE HUNT200 SEEDED POINTS · uTΣu LIVE · RED = WHAT PROJECTION DISCARDS
VARIANCE CAPTURED vs ANGLE — EQ M5.3 SWEPT OVER ALL θ
VARIANCE CAPTURED uTΣu
SHARE OF TOTAL
DISCARDED (RED STALKS)
PC1 — EIGENVECTOR ANSWER
Drag θ and watch the two readouts trade against each other — captured + discarded is constant, the Pythagoras identity of EQ M5.3. At θ = 0° (plain "keep the x-axis") you keep 66% of the variance; the dashed mint axis, at 32.5°, keeps 87.9% — and no angle does better, because that is the top eigenvector of this sample's Σ. One honest wrinkle: the cloud was generated along 34°. The 1.5° gap is sampling noise — PCA recovers the truth of your sample, which is never quite the truth of the world (Chapter 06 is about exactly this gap).

How many components to keep? The eigenvalue spectrum is the budget sheet — and the dropped eigenvalues are not vaguely "lost information", they are exactly the reconstruction error:

EQ M5.4 — THE BUDGET SHEET $$ \text{variance kept by } d \text{ of } D \text{ components} \;=\; \frac{\sum_{j=1}^{d} \lambda_j}{\sum_{j=1}^{D} \lambda_j}, \qquad \frac{1}{n}\sum_{i=1}^{n} \big\lVert x_i - \hat{x}_i \big\rVert^{2} \;=\; \sum_{j=d+1}^{D} \lambda_j $$
\(\hat{x}_i\) is the reconstruction from the kept \(d\) components. The right-hand identity is the Eckart–Young theorem in its friendliest costume: among all linear projections to \(d\) dimensions, PCA's is the one with the smallest possible reconstruction error, and that error equals the sum of the eigenvalues you dropped. Practitioners keep enough components for 90–99% of variance, or cut where the spectrum visibly cliffs.
An eigenvalue spectrum \(\lambda = (6,\, 3,\, 1)\). You keep the top \(d = 2\) components. What percentage of the total variance is retained?
Total variance \(= 6 + 3 + 1 = 10\). Kept by the top 2: \(6 + 3 = 9\). Fraction \(= 9/10 = 0.90\), i.e. 90%. The discarded \(\lambda_3 = 1\) is exactly the reconstruction error (the Eckart–Young identity of EQ M5.4).
PYTHON · RUNNABLE IN-BROWSER
import numpy as np
rng = np.random.default_rng(11)

# a correlated 2-D cloud: most variance lives along one hidden direction
n = 300
t = rng.normal(0, 1.9, n)                 # signal along the hidden axis
s = rng.normal(0, 0.6, n)                 # noise across it
ang = np.deg2rad(34)
X = np.column_stack([t*np.cos(ang) - s*np.sin(ang),
                     t*np.sin(ang) + s*np.cos(ang)])

Xc = X - X.mean(0)                        # center first — always
C = Xc.T @ Xc / n                         # covariance matrix (EQ M5.3)
lam, U = np.linalg.eigh(C)                # eigh: ascending order...
lam, U = lam[::-1], U[:, ::-1]            # ...so flip to largest-first

print("eigenvalues          :", np.round(lam, 3))
print("explained variance % :", np.round(100*lam/lam.sum(), 1))
print("PC1 angle (true 34deg):", round(np.degrees(np.arctan2(U[1,0], U[0,0])) % 180, 1))

Z  = Xc @ U[:, :1]                        # project to 1-D: the embedding
Xr = Z @ U[:, :1].T                       # reconstruct from 1 component
print("reconstruction MSE   :", round(float(((Xc - Xr)**2).mean()), 4))
print("dropped eigenvalue/2 :", round(lam[1]/2, 4), "  <- EQ M5.4, per entry")
shrink the noise to 0.1 — watch PC1 snap to 34° and the MSE collapse
FINE PRINT

Three ways PCA lies to the unwary. (1) It is scale-covariant: measure one feature in millimeters instead of meters and it manufactures a fake principal direction — standardize first unless the units are genuinely shared. (2) Variance is not importance. PCA never saw your labels; the discriminating signal for a downstream task can sit in a low-variance direction PCA throws away first. (3) It is strictly linear: a sheet, never a curve. Data on Chapter 04's moons or a spiral has structure no single flat projection can keep — the cue for everything in the next section.

5.5

From compression to embeddings

Look again at what the PCA cell printed: it took each point and re-expressed it as coordinates \(z\) in a learned system where the axes are ordered by how much they matter. That object has a modern name — an embedding: a compact vector representation in which geometry encodes meaning, so that nearby vectors are similar things. PCA is the simplest embedding machine ever built, and the lineage from here to the frontier is unusually direct:

  • Nonlinear map-making: t-SNE and UMAP. For looking at high-dimensional data, these bend the sheet, preserving local neighborhoods at the cost of global honesty. Use them for eyes, never for arithmetic: cluster sizes, inter-cluster distances, and density in such plots are artifacts of the optimizer as much as of the data, and both methods will draw confident-looking islands in pure noise.
  • Autoencoders. Replace PCA's linear projection with Chapter 07's networks: an encoder squeezes \(x\) into a low-dimensional bottleneck, a decoder reconstructs, and the loss is reconstruction error — EQ M5.4's right-hand side, made trainable by gradient descent. A linear autoencoder provably recovers the PCA subspace; nonlinear ones learn curved sheets PCA cannot.
  • Self-supervision: data labels itself. Delete part of the data and train a supervised model to restore it — next-token prediction is precisely this (Vol II · Ch 04). The "labels" are free, the scale is the internet, and the representations that fall out are the embeddings inside every LLM.
  • Embeddings meet Chapter 04. Today's retrieval stack is this chapter's two ideas reassembled: a learned embedding (this section) plus nearest-neighbor search over it (Ch 04's k-NN) is vector search — the machinery under RAG and semantic search. Even the curse of dimensionality resolves on cue: embeddings work because real data concentrates near low-dimensional structure, which is the bet PCA placed first.

The through-line deserves saying plainly. k-means compresses a dataset into \(k\) prototypes; PCA compresses it into \(d\) directions; an autoencoder into a bottleneck; a language model into weights that can regenerate the statistics of its corpus. In every case the test is the same — how much of the data can the summary give back? — and in every case better compression has meant something uncomfortably like better understanding. That is not a metaphor the field decorates itself with; it is the actual objective the largest training runs in history are minimizing.

NEXT

Your toolbox is now complete enough to be dangerous — supervised and unsupervised, parametric and memory-based. Chapter 06 supplies the discipline that decides whether any of it survives contact with new data: bias against variance, the capacity U-curve and its modern double-descent twist, regularization, and the validation hygiene that separates a result from an artifact.

§

Further reading

  • Lloyd, S. (1982). Least Squares Quantization in PCM. — the iterative assign-then-recenter algorithm now universally known as k-means (written 1957, published 1982).
  • Arthur, D. & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. — the smart initialization that fixes Lloyd's sensitivity to starting centroids.
  • Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. — the geometric origin of principal component analysis as best-fit subspaces.
  • Hotelling, H. (1933). Analysis of a Complex of Statistical Variables into Principal Components. — the variance-maximization formulation of PCA used today.
  • van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE. — the canonical nonlinear method for embedding high-dimensional data into 2-D maps.
  • McInnes, L., Healy, J. & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. — a faster, structure-preserving alternative to t-SNE for embeddings.