What k-means cannot see (recap)
Chapter 05 built k-means from scratch: drop \(k\) centroids, assign each point to its nearest one, recompute each centroid as the mean of its members, repeat. The whole method optimizes a single objective — the within-cluster sum of squared distances, or inertia — and that one choice of objective bakes in three assumptions the output never confesses to.
The recap, then, is a list of failure modes — and a map of which animal in the zoo fixes each one:
| k-means assumes… | How data breaks it | Reach for… |
|---|---|---|
| Clusters are convex, round balls | two interleaving crescents (the two-moons set of Ch 04) get sliced crosswise, never traced | DBSCAN · spectral clustering |
| Clusters are equal-sized & isotropic | a stretched ellipse, or a dense core beside a sparse halo: the boundary lands where variances balance, not where you would draw it | Gaussian mixtures (full covariance, soft assignment) |
| Every point belongs somewhere | a handful of outliers each capture, or badly drag, a centroid | DBSCAN (has a literal noise label) |
| \(k\) is known in advance | it almost never is; inertia falls monotonically with \(k\) and cannot choose it for you | dendrogram cuts · DBSCAN (no \(k\)) · silhouette / BIC (§12.5) |
One honest framing before the algorithms arrive. There is no universally "correct" clustering — clustering is the search for structure no label ever defined, so every method optimizes a proxy (compactness, connectivity, density, likelihood) and the right method is the one whose proxy matches your purpose. The zoo is not a ladder from worse to better; it is a set of differently-shaped lenses. The instrument below is the whole chapter in one frame: one dataset, three lenses, three verdicts.
Hierarchical clustering & dendrograms
k-means demands you commit to \(k\) before you have seen any structure. Agglomerative hierarchical clustering refuses to commit: it builds the entire family of clusterings at once, from \(n\) singletons up to one all-encompassing cluster, and lets you read off whichever level you like afterward. The algorithm is almost embarrassingly simple — start with every point as its own cluster, then repeatedly merge the two closest clusters until only one remains:
The output is not a flat partition but a dendrogram: a binary tree whose leaves are the \(n\) points and whose every internal node is a merge, drawn at a height equal to the linkage distance at which that merge happened. The tree records the complete order in which structure assembled. To extract clusters you draw a horizontal line at some height \(h\) and cut: every branch the line crosses becomes one cluster. Cut low and you get many tight clusters; cut high and they fuse into a few loose ones — and the big vertical gaps in the tree (long stretches where no merge happens) mark the most natural, most stable places to cut.
The arithmetic of the tree itself is fixed and worth internalizing. Each merge reduces the cluster count by exactly one, starting from \(n\) and ending at \(1\), so a dataset of \(n\) points always produces exactly \(n-1\) merges — and therefore \(n-1\) internal nodes in the dendrogram.
Explore the cut directly. The instrument below builds a Ward dendrogram over a small seeded set; drag the cut height and watch clusters split and fuse, with the resulting partition coloured on the scatter beside it.
Cost and when to use it. The naive algorithm is \(O(n^3)\) time and \(O(n^2)\) memory (it holds the full pairwise-distance matrix), so vanilla agglomerative clustering tops out around tens of thousands of points — SLINK/CLINK bring single and complete linkage down to \(O(n^2)\), and for larger \(n\) you sub-sample or switch tools. Its real strengths are the nested structure (taxonomies, gene-expression heatmaps, document hierarchies) and not having to pick \(k\) up front. The honest caveat: agglomerative merges are greedy and irrevocable — a bad early merge can never be undone, which is precisely the weakness density-based methods sidestep next.
DBSCAN — density-based clustering
Centre-based methods ask "which prototype is this point near?". DBSCAN (Density-Based Spatial Clustering of Applications with Noise) asks a different and often better question: "is this point in a crowded neighbourhood, and is that crowd connected to other crowds?". A cluster, in this view, is simply a connected region of high point density, of any shape, surrounded by regions of low density. That single reframing buys three things k-means cannot offer: arbitrary cluster shapes, no need to specify \(k\), and a built-in notion of noise.
DBSCAN has exactly two parameters and three kinds of point. The parameters are \(\varepsilon\) (the neighbourhood radius) and minPts (how many neighbours make a point "dense"). The point types follow:
minPts neighbours in that ball — it sits in a crowd. A border point is not itself dense but falls within \(\varepsilon\) of some core point — it is on the fringe of a crowd. A noise point is neither: too few neighbours, and not close enough to any core. Clusters grow by density-reachability: start at any unvisited core point and absorb everything reachable through a chain of overlapping \(\varepsilon\)-balls of core points. Because the cluster follows the chain of dense neighbours rather than a distance-to-centre, it can bend into any shape — crescents, spirals, rings — that k-means would shatter.The noise label is the quiet superpower. Every other method in this chapter forces every point into some cluster; DBSCAN explicitly refuses, tagging genuinely isolated points as noise rather than letting one outlier drag a centroid across the plane. This is a feature, not a bug — but it is also the source of the most common exam confusion, so state it plainly: a point in a sparse, low-density region (too few neighbours within \(\varepsilon\), and not within \(\varepsilon\) of any core point) is labelled noise and left out of every cluster.
minPts neighbours) nor border (not within \( \varepsilon \) of a core point). Unlike k-means, hierarchical clustering, or GMM — all of which assign every point to some cluster — DBSCAN has a dedicated noise label for low-density points. The statement is true.Choosing the two knobs is the whole craft of DBSCAN. minPts is usually set to roughly \(2 \times \text{dimensionality}\) (so \(\approx 4\) for 2-D data; larger for noisier or higher-dimensional data), and \(\varepsilon\) is read off a k-distance plot: sort every point's distance to its \(k\)-th nearest neighbour and look for the "elbow" where the curve turns sharply upward — that knee is the density boundary between cluster and noise. The build-it-yourself cell below implements the whole algorithm in numpy and runs it on the two-moons set k-means famously fails on.
# DBSCAN from scratch (EQ M12.3) on two interleaving moons.
# k-means slices the crescents; density-reachability traces them.
import numpy as np
rng = np.random.default_rng(7)
def two_moons(n=150, noise=0.07):
t = np.linspace(0, np.pi, n)
a = np.c_[np.cos(t), np.sin(t)] # upper crescent
b = np.c_[1 - np.cos(t), 1 - np.sin(t) - 0.5] # lower, offset crescent
X = np.vstack([a, b]) + rng.normal(0, noise, (2*n, 2))
return X
def dbscan(X, eps, min_pts):
n = len(X); labels = np.full(n, -1); cid = -1 # -1 = noise/unset
D = np.sqrt(((X[:, None] - X[None]) ** 2).sum(-1)) # pairwise distances
neigh = [np.where(D[i] <= eps)[0] for i in range(n)] # eps-balls (EQ M12.3)
visited = np.zeros(n, bool)
for i in range(n):
if visited[i]: continue
visited[i] = True
if len(neigh[i]) < min_pts: continue # not core -> leave as noise
cid += 1; labels[i] = cid; seeds = list(neigh[i]) # start a new cluster
k = 0
while k < len(seeds): # grow by reachability
j = seeds[k]; k += 1
if labels[j] == -1: labels[j] = cid # absorb a border point
if not visited[j]:
visited[j] = True
if len(neigh[j]) >= min_pts: # j is core -> expand
seeds += [q for q in neigh[j] if q not in seeds]
return labels
X = two_moons()
labels = dbscan(X, eps=0.22, min_pts=5)
n_clusters = labels.max() + 1
n_noise = int((labels == -1).sum())
print(f"clusters found : {n_clusters} (truth = 2 crescents)")
print(f"noise points : {n_noise} (k-means would force ALL of these into a cluster)")
plot_scatter(X[:, 0], X[:, 1], labels)
The honest limitations — and the modern fix. DBSCAN uses a single global \(\varepsilon\), so it struggles when clusters have very different densities (the right \(\varepsilon\) for the dense core is wrong for the sparse halo — you watched this in Instrument M12.1's CORE + HALO set), and like all Euclidean-distance methods it degrades in high dimensions as distances concentrate. Its near-universal successor, HDBSCAN (2017), removes the \(\varepsilon\) knob entirely: it builds a hierarchy across all density levels and extracts the most stable clusters, handling variable density gracefully — it is the default density clusterer in practice today. Worth knowing the lineage: DBSCAN is the idea, HDBSCAN is what you usually run.
Gaussian Mixture Models & EM
k-means makes two hard commitments per point: a hard assignment (you belong to cluster 3, full stop) and a single isotropic radius (every cluster is a circle). The Gaussian mixture model softens both. It models the data as having been generated by \(k\) Gaussians blended together: to draw a point, first pick a component \(j\) with probability \(\pi_j\), then sample from that component's Gaussian \(\mathcal{N}(\mu_j, \Sigma_j)\). The density of any point is the weighted sum over all components:
Because each component is a full Gaussian, a GMM does not say "point \(i\) is in cluster \(j\)". It computes a responsibility \(\gamma_{ij}\) — the posterior probability that component \(j\) generated point \(i\) — a soft, fractional membership that can be 70% one cluster and 30% another. That softness is the whole point: it honestly represents the ambiguity of points sitting between clusters, which hard k-means simply discards.
The parameter count is where the freedom shows its price. Each component carries a mean (one number per dimension), a covariance, and a weight. For a 2-D, full-covariance model the means alone are \(2\) numbers per component — \(2k\) in total — and that is the figure the exercise below pins down, because it is the cheapest way to feel how parameters scale with \(k\).
Watch EM converge one step at a time. The stepper seeds two Gaussians badly, then alternates E and M: the E-step recolours every point by its responsibility (a blend, never a hard pick), and the M-step slides and reshapes the two ellipses to fit. The log-likelihood climbs monotonically in the readout — that climb is the convergence guarantee of EQ M12.5.
And the 1-D version in numpy, stripped to its skeleton: EM for a two-component mixture on a line. Watch the two means separate to the true generating centres and the weights settle near their true mixing proportions as the log-likelihood plateaus.
# EM for a 1-D two-component GMM (EQ M12.5). Recover means & weights.
import numpy as np
rng = np.random.default_rng(0)
# true mixture: 40% N(0,1), 60% N(5, 1.5^2)
x = np.concatenate([rng.normal(0.0, 1.0, 200),
rng.normal(5.0, 1.5, 300)])
n = len(x)
def normal(x, mu, var): # 1-D Gaussian density
return np.exp(-(x - mu)**2 / (2*var)) / np.sqrt(2*np.pi*var)
mu = np.array([-1.0, 1.0]) # deliberately bad init
var = np.array([1.0, 1.0])
pi = np.array([0.5, 0.5])
for it in range(40):
# E-step: responsibilities gamma[i, j]
comp = np.stack([pi[j]*normal(x, mu[j], var[j]) for j in range(2)], axis=1)
ll = np.log(comp.sum(1)).sum() # data log-likelihood (climbs each iter)
g = comp / comp.sum(1, keepdims=True)
# M-step: weighted means, variances, weights
Nj = g.sum(0)
pi = Nj / n
mu = (g * x[:, None]).sum(0) / Nj
var = (g * (x[:, None] - mu)**2).sum(0) / Nj
if it in (0, 4, 39):
print(f"iter {it:2d}: loglik={ll:8.1f} means={np.round(mu,2)} weights={np.round(pi,2)}")
print("\nconverged means :", np.round(np.sort(mu), 2), " (true: 0.0, 5.0)")
print("converged weights:", np.round(pi[np.argsort(mu)], 2), " (true: 0.4, 0.6)")
Choosing k & validating clusters
Every method here either needs \(k\) (k-means, GMM, a dendrogram cut) or trades it for density knobs (DBSCAN's \(\varepsilon\), minPts). Without labels there is no accuracy to optimize, so validation splits into two honest questions: internal validation (is the clustering geometrically good on its own terms?) and the far better external answer (does a downstream task care?). The internal tools each have a failure mode, and knowing the failure mode is the skill.
The elbow method plots inertia \(J\) against \(k\) and looks for the bend where adding centres stops paying. It is the weakest tool here, because — as Chapter 05 proved — \(J\) falls monotonically with \(k\) all the way to the absurd \(J = 0\) at one centre per point, so there is often no clean elbow at all. The silhouette score is sharper: for each point it compares the mean distance to its own cluster against the mean distance to the nearest other cluster.
For probabilistic models there is a principled alternative that penalizes complexity directly. A GMM can always raise its likelihood by adding components, so you cannot pick \(k\) by likelihood alone — but the Bayesian Information Criterion subtracts a penalty for every parameter, trading fit against parsimony:
The genuinely decisive answer is external. When the clusters feed something downstream — segments feeding a campaign, codes feeding a classifier, groups a domain expert will inspect — let that task's metric choose \(k\) and the algorithm. This converts an unanswerable unsupervised question back into a measurable one, and it is the most honest move in the whole chapter. Where you have at least some ground-truth labels, the Adjusted Rand Index compares a clustering to the truth while correcting for chance agreement (0 = random, 1 = perfect), and is the standard external score.
Four traps that quietly invalidate a clustering. (1) Scale. Every Euclidean method here inherits Chapter 02's pathology — an unscaled feature in different units silently owns the result; standardize first unless the units are genuinely shared. (2) Metric chases shape. Silhouette and inertia reward convex blobs, so they will rank a wrong k-means clustering above a correct DBSCAN one — never validate a density method with a centre-based score. (3) The curse of dimensionality. In high dimensions all pairwise distances concentrate toward equality, so distance-based clustering loses its grip; cluster in a learned low-dimensional embedding (next chapter) instead. (4) Clusters always appear. Every algorithm here returns clusters even on pure noise — before trusting any partition, ask whether structure exists at all (e.g. the Hopkins statistic), because a confident clustering of structureless data is the most seductive artifact in the field.
Clustering compresses \(n\) points into a handful of groups; the next chapter compresses a whole matrix into a handful of factors. Chapter 13 — Matrix Factorization — turns the user-item rating tables behind every recommender into low-rank products of latent factors, connects SVD, NMF and ALS, and shows how the same algebra that powers collaborative filtering reappears as the embedding machinery throughout modern AI.
References
- Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
- Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm.
- Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.
- Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function.
- Schwarz, G. (1978). Estimating the Dimension of a Model.
- Campello, R. J. G. B., Moulavi, D., Zimek, A. & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.
- Hubert, L. & Arabie, P. (1985). Comparing Partitions.