Taking the labels away
Supervised learning runs on a luxury: someone already did the job correctly, two million times, and wrote the answers down. That luxury is expensive — labels cost human hours, expert hours, sometimes biopsy results — and it is scarce in exactly the places data is abundant. Server logs, transaction streams, sensor traces, the text of the internet: almost everything ever recorded arrives unlabeled. Unsupervised learning is the discipline of extracting structure from \(x\) alone, with no \(y\) anywhere in the problem.
What can structure mean, when nobody defines success for you? Three recurring answers, two of which this chapter builds from scratch:
| Task | The question it asks | Canonical method | You have already met it as… |
|---|---|---|---|
| Clustering | Which points belong together? | k-means (§5.2) | k-NN's neighborhoods (Ch 04), minus the labels |
| Dimensionality reduction | Which directions carry the signal? | PCA (§5.4) | feature scaling's smarter sibling (Ch 02) |
| Density estimation | What does typical look like — and what doesn't? | histograms, mixtures, … | the distribution \(\mathcal{D}\) behind EQ M1.3 |
The honest difficulty is not algorithmic — both algorithms below fit in a dozen lines. It is that without labels there is no ground truth to be scored against. A supervised model is wrong when it disagrees with \(y\); a clustering is "wrong" only relative to a purpose nobody encoded in the data. Every unsupervised method therefore optimizes a proxy — compactness, variance, reconstruction — and the practitioner's job is judging whether the proxy matches the purpose. Keep that skepticism switched on for the rest of the chapter; both instruments below are built to reward it.
One taxonomy note before we start. The regime that ate the world is a hybrid: self-supervised learning manufactures labels out of the raw data itself — mask a word and predict it, crop an image and match it, take a prefix and predict the next token. The loss is supervised machinery (cross-entropy, Ch 03); the labels cost nothing, exactly like the methods here. Section 5.5 traces the line from this chapter to that one.
k-means: Lloyd's loop
The oldest serious answer to "which points belong together" is geometric: choose \(k\) cluster centers, assign every point to its nearest center, and call a clustering good when points sit close to their assigned centers. That sentence is already the objective — the unsupervised stand-in for a loss function, with no label anywhere in it:
Lloyd's algorithm attacks \(J\) the way you would untangle any two-variable problem: freeze one variable, optimize the other, alternate. Both half-steps have closed forms, and both can only push \(J\) down:
Run the loop by hand. The instrument seeds three well-separated blobs, drops \(k\) centroids on random data points, and gives you the two half-steps as buttons. Watch \(J\) after every press — it never rises, which is the convergence proof happening in front of you.
The same loop in numpy — eight lines of algorithm, fully vectorized. The printed inertia drops hard, then freezes: that plateau is convergence (no assignment changed, so nothing can move again).
import numpy as np
rng = np.random.default_rng(5)
# three blobs, then k-means from scratch — Lloyd's two steps, verbatim
centers = np.array([[-2.0, -1.0], [2.0, -1.2], [0.2, 1.8]])
X = np.vstack([rng.normal(c, 0.55, (60, 2)) for c in centers])
k = 3
mu = X[rng.choice(len(X), k, replace=False)] # init: k random points
for it in range(6):
d = ((X[:, None, :] - mu[None, :, :]) ** 2).sum(-1) # n x k distances
c = d.argmin(1) # ASSIGN (EQ M5.2)
J = d[np.arange(len(X)), c].sum()
mu = np.array([X[c == j].mean(0) for j in range(k)]) # UPDATE (EQ M5.2)
print(f"sweep {it}: inertia J = {J:8.2f}")
print("\nrecovered centroids (true: -2,-1 / 2,-1.2 / 0.2,1.8):")
print(np.round(mu, 2))
plot_scatter(X[:, 0], X[:, 1], c)
Production reality, in three habits. (1) Never one run: restart from many random inits and keep the lowest \(J\) — or use k-means++ seeding (spread the initial centroids out proportionally to squared distance), which is the default in every serious library and carries an \(O(\log k)\) approximation guarantee. (2) Scale features first — squared Euclidean distance inherits every pathology Chapter 02 warned about, and an unscaled column silently owns the clustering. (3) At web scale, use minibatch k-means: same two steps, estimated on samples, for the same reason SGD exists.
What k-means cannot see
k-means is fast, simple, and everywhere — and it is opinionated in ways the output never confesses. The objective is built from squared Euclidean distance to a single center, so the method implicitly assumes every cluster is a compact, roughly spherical, roughly equal-sized ball. Hand it anything else and it will still return \(k\) tidy clusters, with total confidence, and they will be wrong:
| Baked-in assumption | How real data breaks it | What to reach for |
|---|---|---|
| Clusters are spherical | elongated or curved groups — Chapter 04's two moons get sliced crosswise, not traced | spectral clustering; DBSCAN |
| Similar size & density | one dense core next to a sparse halo: the boundary lands where the variances balance, not where humans see it | Gaussian mixtures (EM) — k-means with ellipses and soft assignments |
| Every point belongs somewhere | outliers drag centroids; a single rogue point can claim a centroid at large k | DBSCAN (has a noise label); trim or robustify first |
| k is known | it almost never is | elbow on J, silhouette score — both heuristics, neither decisive |
On choosing \(k\): inertia alone cannot do it — you proved in Instrument M5.1 that \(J\) falls monotonically as \(k\) grows, all the way to the absurd optimum of one centroid per point (\(J = 0\), the lookup table of Chapter 01 wearing a new disguise). The elbow method plots \(J\) against \(k\) and looks for the bend where added centroids stop paying; the silhouette score compares each point's distance to its own cluster against the nearest foreign one. Both are judgment calls dressed as numbers. When a downstream task exists — clusters feeding a classifier, segments feeding a campaign — let its metric choose \(k\); that converts an unanswerable unsupervised question back into a measurable supervised one, and it is the most honest trick in this chapter.
PCA: organized variance-hunting
Clustering compresses \(n\) points into \(k\) prototypes. The other great compression runs crosswise: keep every point, but shrink the number of coordinates used to describe it. Real datasets are wildly redundant — square footage and room count move together; pixel 2,001 mostly agrees with pixel 2,002. Redundancy means the data hugs a lower-dimensional sheet inside its nominal space, and principal component analysis finds the best flat sheet by a beautifully blunt criterion: keep the directions along which the data varies most. Center the data (always — PCA is blind without it), then ask for the unit direction that maximizes the variance of the projections:
Hunt the direction yourself before the eigensolver does. Below is a centered, correlated cloud; the slider aims a candidate axis, the red stalks are what projection onto that axis throws away, and the lower chart sweeps EQ M5.3's objective across every angle:
How many components to keep? The eigenvalue spectrum is the budget sheet — and the dropped eigenvalues are not vaguely "lost information", they are exactly the reconstruction error:
import numpy as np
rng = np.random.default_rng(11)
# a correlated 2-D cloud: most variance lives along one hidden direction
n = 300
t = rng.normal(0, 1.9, n) # signal along the hidden axis
s = rng.normal(0, 0.6, n) # noise across it
ang = np.deg2rad(34)
X = np.column_stack([t*np.cos(ang) - s*np.sin(ang),
t*np.sin(ang) + s*np.cos(ang)])
Xc = X - X.mean(0) # center first — always
C = Xc.T @ Xc / n # covariance matrix (EQ M5.3)
lam, U = np.linalg.eigh(C) # eigh: ascending order...
lam, U = lam[::-1], U[:, ::-1] # ...so flip to largest-first
print("eigenvalues :", np.round(lam, 3))
print("explained variance % :", np.round(100*lam/lam.sum(), 1))
print("PC1 angle (true 34deg):", round(np.degrees(np.arctan2(U[1,0], U[0,0])) % 180, 1))
Z = Xc @ U[:, :1] # project to 1-D: the embedding
Xr = Z @ U[:, :1].T # reconstruct from 1 component
print("reconstruction MSE :", round(float(((Xc - Xr)**2).mean()), 4))
print("dropped eigenvalue/2 :", round(lam[1]/2, 4), " <- EQ M5.4, per entry")
Three ways PCA lies to the unwary. (1) It is scale-covariant: measure one feature in millimeters instead of meters and it manufactures a fake principal direction — standardize first unless the units are genuinely shared. (2) Variance is not importance. PCA never saw your labels; the discriminating signal for a downstream task can sit in a low-variance direction PCA throws away first. (3) It is strictly linear: a sheet, never a curve. Data on Chapter 04's moons or a spiral has structure no single flat projection can keep — the cue for everything in the next section.
From compression to embeddings
Look again at what the PCA cell printed: it took each point and re-expressed it as coordinates \(z\) in a learned system where the axes are ordered by how much they matter. That object has a modern name — an embedding: a compact vector representation in which geometry encodes meaning, so that nearby vectors are similar things. PCA is the simplest embedding machine ever built, and the lineage from here to the frontier is unusually direct:
- Nonlinear map-making: t-SNE and UMAP. For looking at high-dimensional data, these bend the sheet, preserving local neighborhoods at the cost of global honesty. Use them for eyes, never for arithmetic: cluster sizes, inter-cluster distances, and density in such plots are artifacts of the optimizer as much as of the data, and both methods will draw confident-looking islands in pure noise.
- Autoencoders. Replace PCA's linear projection with Chapter 07's networks: an encoder squeezes \(x\) into a low-dimensional bottleneck, a decoder reconstructs, and the loss is reconstruction error — EQ M5.4's right-hand side, made trainable by gradient descent. A linear autoencoder provably recovers the PCA subspace; nonlinear ones learn curved sheets PCA cannot.
- Self-supervision: data labels itself. Delete part of the data and train a supervised model to restore it — next-token prediction is precisely this (Vol II · Ch 04). The "labels" are free, the scale is the internet, and the representations that fall out are the embeddings inside every LLM.
- Embeddings meet Chapter 04. Today's retrieval stack is this chapter's two ideas reassembled: a learned embedding (this section) plus nearest-neighbor search over it (Ch 04's k-NN) is vector search — the machinery under RAG and semantic search. Even the curse of dimensionality resolves on cue: embeddings work because real data concentrates near low-dimensional structure, which is the bet PCA placed first.
The through-line deserves saying plainly. k-means compresses a dataset into \(k\) prototypes; PCA compresses it into \(d\) directions; an autoencoder into a bottleneck; a language model into weights that can regenerate the statistics of its corpus. In every case the test is the same — how much of the data can the summary give back? — and in every case better compression has meant something uncomfortably like better understanding. That is not a metaphor the field decorates itself with; it is the actual objective the largest training runs in history are minimizing.
Your toolbox is now complete enough to be dangerous — supervised and unsupervised, parametric and memory-based. Chapter 06 supplies the discipline that decides whether any of it survives contact with new data: bias against variance, the capacity U-curve and its modern double-descent twist, regularization, and the validation hygiene that separates a result from an artifact.
Further reading
- Lloyd, S. (1982). Least Squares Quantization in PCM. — the iterative assign-then-recenter algorithm now universally known as k-means (written 1957, published 1982).
- Arthur, D. & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. — the smart initialization that fixes Lloyd's sensitivity to starting centroids.
- Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. — the geometric origin of principal component analysis as best-fit subspaces.
- Hotelling, H. (1933). Analysis of a Complex of Statistical Variables into Principal Components. — the variance-maximization formulation of PCA used today.
- van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE. — the canonical nonlinear method for embedding high-dimensional data into 2-D maps.
- McInnes, L., Healy, J. & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. — a faster, structure-preserving alternative to t-SNE for embeddings.