AI // ENCYCLOPEDIA / MACHINE LEARNING / 12 / CLUSTERING ZOO INDEX NEXT: MATRIX FACTORIZATION →
MACHINE LEARNING · CHAPTER 12 / 15

The Clustering Zoo — Hierarchical, DBSCAN & GMM

k-means is fast and simple, but its squared-distance-to-a-centre objective can only carve the plane into round, equal, convex blobs. Hand it two interleaving crescents, a dense core inside a sparse halo, or a stretched ellipse, and it returns a confident but wrong partition. The rest of the clustering zoo exists to see the shapes k-means cannot: linking points into trees, chasing density instead of distance, and replacing hard balls with soft, full-covariance Gaussians.

LEVELCORE READING TIME≈ 28 MIN BUILDS ONML 05 · 11 INSTRUMENTSCOMPARATOR · DENDROGRAM · EM STEPPER
12.1

What k-means cannot see (recap)

Chapter 05 built k-means from scratch: drop \(k\) centroids, assign each point to its nearest one, recompute each centroid as the mean of its members, repeat. The whole method optimizes a single objective — the within-cluster sum of squared distances, or inertia — and that one choice of objective bakes in three assumptions the output never confesses to.

EQ M12.1 — WHAT THE INERTIA OBJECTIVE ASSUMES $$ J \;=\; \sum_{i=1}^{n}\big\lVert x_i - \mu_{c_i}\big\rVert^{2} \;=\; \sum_{j=1}^{k}\sum_{i \in S_j}\big\lVert x_i - \mu_j \big\rVert^{2} $$
Squared Euclidean distance to a single centre \(\mu_j\) is the only shape information k-means has. Three consequences fall straight out of that formula: clusters are implicitly spherical (every direction is penalized equally — a circle, never an ellipse), equal-sized (a point joins whichever centre is nearer in raw distance, so a big loose cluster cannibalizes a small tight one), and exhaustive (every point must join some cluster — there is no term for "this is an outlier"). Real data violates all three routinely; k-means violates them silently.

The recap, then, is a list of failure modes — and a map of which animal in the zoo fixes each one:

k-means assumes…How data breaks itReach for…
Clusters are convex, round ballstwo interleaving crescents (the two-moons set of Ch 04) get sliced crosswise, never tracedDBSCAN · spectral clustering
Clusters are equal-sized & isotropica stretched ellipse, or a dense core beside a sparse halo: the boundary lands where variances balance, not where you would draw itGaussian mixtures (full covariance, soft assignment)
Every point belongs somewherea handful of outliers each capture, or badly drag, a centroidDBSCAN (has a literal noise label)
\(k\) is known in advanceit almost never is; inertia falls monotonically with \(k\) and cannot choose it for youdendrogram cuts · DBSCAN (no \(k\)) · silhouette / BIC (§12.5)

One honest framing before the algorithms arrive. There is no universally "correct" clustering — clustering is the search for structure no label ever defined, so every method optimizes a proxy (compactness, connectivity, density, likelihood) and the right method is the one whose proxy matches your purpose. The zoo is not a ladder from worse to better; it is a set of differently-shaped lenses. The instrument below is the whole chapter in one frame: one dataset, three lenses, three verdicts.

INSTRUMENT M12.1 — ALGORITHM COMPARATORONE DATASET · k-MEANS vs DBSCAN vs GMM
CLUSTERS FOUND
NOISE / UNASSIGNED
VERDICT vs TRUTH
Each dataset has a "right" answer your eye can see. Start on TWO MOONS and step through the algorithms: k-means slices both crescents straight across (the centres land between the moons, not along them); the Gaussian mixture, still centre-based, does little better; only DBSCAN traces each crescent by following the chain of dense neighbours. Switch to ELLIPSES and the story inverts — there the soft, full-covariance GMM wins and DBSCAN, with one global density threshold, struggles. CORE + HALO punishes the single threshold for everyone. No animal wins every dataset; matching the lens to the shape is the entire skill.
12.2

Hierarchical clustering & dendrograms

k-means demands you commit to \(k\) before you have seen any structure. Agglomerative hierarchical clustering refuses to commit: it builds the entire family of clusterings at once, from \(n\) singletons up to one all-encompassing cluster, and lets you read off whichever level you like afterward. The algorithm is almost embarrassingly simple — start with every point as its own cluster, then repeatedly merge the two closest clusters until only one remains:

EQ M12.2 — AGGLOMERATIVE MERGE & LINKAGE $$ (A^\star, B^\star) \;=\; \arg\min_{A \neq B}\; D(A, B), \qquad D_{\text{single}} = \min_{a \in A,\, b \in B}\lVert a - b\rVert, \quad D_{\text{complete}} = \max_{a \in A,\, b \in B}\lVert a - b\rVert, \quad D_{\text{ward}} = \Delta\,\text{SSE} $$
Everything turns on \(D(A,B)\), the linkage — how you measure the distance between two clusters, not two points. Single linkage uses the nearest pair, so it chains along thin filaments and can trace non-convex shapes (but suffers "chaining", merging distinct groups joined by a single bridge of points). Complete linkage uses the farthest pair, forcing compact, ball-like clusters. Ward linkage merges the pair that increases total within-cluster squared error least — it is, in effect, hierarchical k-means and the most common default. Different linkages produce genuinely different trees from identical data; the choice is a modelling decision, not a detail.

The output is not a flat partition but a dendrogram: a binary tree whose leaves are the \(n\) points and whose every internal node is a merge, drawn at a height equal to the linkage distance at which that merge happened. The tree records the complete order in which structure assembled. To extract clusters you draw a horizontal line at some height \(h\) and cut: every branch the line crosses becomes one cluster. Cut low and you get many tight clusters; cut high and they fuse into a few loose ones — and the big vertical gaps in the tree (long stretches where no merge happens) mark the most natural, most stable places to cut.

The arithmetic of the tree itself is fixed and worth internalizing. Each merge reduces the cluster count by exactly one, starting from \(n\) and ending at \(1\), so a dataset of \(n\) points always produces exactly \(n-1\) merges — and therefore \(n-1\) internal nodes in the dendrogram.

You run agglomerative hierarchical clustering on a dataset of \( n = 50 \) points, all the way up to a single root cluster. How many merge operations does the algorithm perform in total?
Each merge takes two clusters and makes one, so the cluster count drops by exactly 1 per merge. Starting at \( n = 50 \) singletons and ending at 1 cluster requires \( 50 - 1 = \) 49 merges — and the dendrogram therefore has 49 internal nodes, one per merge.
A dendrogram's six leaves are joined by five merges at heights \( 0.4,\ 0.9,\ 1.1,\ 1.2,\ 2.8 \). You cut the tree with a horizontal line at height \( h = 1.0 \). How many clusters does the cut produce?
A cut undoes every merge that happened above \( h \) and keeps every merge below it intact. The merges at \( 0.4 \) and \( 0.9 \) are below \( 1.0 \) and survive; the three at \( 1.1, 1.2, 2.8 \) are above the line and are undone. Each surviving merge fuses two groups into one, so the count is \( 6 \text{ leaves} - 2 \text{ surviving merges} = \) 4 clusters. Equivalently, clusters \( = 1 + (\text{merges above the cut}) = 1 + 3 = 4 \).

Explore the cut directly. The instrument below builds a Ward dendrogram over a small seeded set; drag the cut height and watch clusters split and fuse, with the resulting partition coloured on the scatter beside it.

INSTRUMENT M12.2 — DENDROGRAM EXPLORERWARD LINKAGE · DRAG THE CUT · EQ M12.2
CLUSTERS AT THIS CUT
CUT HEIGHT
LARGEST GAP (NATURAL CUT)
The dashed white line is your cut; everything it crosses is a cluster, coloured live on the scatter at right. Slide it down through a long vertical gap and the cluster count is stable across the whole gap — that flatness is exactly why the gap marks a "natural" number of clusters. Now switch linkage: SINGLE chains the points into long straggly groups (watch one bridge merge two visually-separate blobs early), while COMPLETE and WARD insist on compact balls. Same points, three different trees — linkage is a real choice.

Cost and when to use it. The naive algorithm is \(O(n^3)\) time and \(O(n^2)\) memory (it holds the full pairwise-distance matrix), so vanilla agglomerative clustering tops out around tens of thousands of points — SLINK/CLINK bring single and complete linkage down to \(O(n^2)\), and for larger \(n\) you sub-sample or switch tools. Its real strengths are the nested structure (taxonomies, gene-expression heatmaps, document hierarchies) and not having to pick \(k\) up front. The honest caveat: agglomerative merges are greedy and irrevocable — a bad early merge can never be undone, which is precisely the weakness density-based methods sidestep next.

12.3

DBSCAN — density-based clustering

Centre-based methods ask "which prototype is this point near?". DBSCAN (Density-Based Spatial Clustering of Applications with Noise) asks a different and often better question: "is this point in a crowded neighbourhood, and is that crowd connected to other crowds?". A cluster, in this view, is simply a connected region of high point density, of any shape, surrounded by regions of low density. That single reframing buys three things k-means cannot offer: arbitrary cluster shapes, no need to specify \(k\), and a built-in notion of noise.

DBSCAN has exactly two parameters and three kinds of point. The parameters are \(\varepsilon\) (the neighbourhood radius) and minPts (how many neighbours make a point "dense"). The point types follow:

EQ M12.3 — CORE, BORDER, NOISE $$ N_\varepsilon(p) = \{\, q : \lVert p - q \rVert \le \varepsilon \,\}, \qquad p \text{ is a } \textbf{core point} \iff \lvert N_\varepsilon(p) \rvert \ge \texttt{minPts} $$
\(N_\varepsilon(p)\) is the set of points within radius \(\varepsilon\) of \(p\), including \(p\) itself. A core point has at least minPts neighbours in that ball — it sits in a crowd. A border point is not itself dense but falls within \(\varepsilon\) of some core point — it is on the fringe of a crowd. A noise point is neither: too few neighbours, and not close enough to any core. Clusters grow by density-reachability: start at any unvisited core point and absorb everything reachable through a chain of overlapping \(\varepsilon\)-balls of core points. Because the cluster follows the chain of dense neighbours rather than a distance-to-centre, it can bend into any shape — crescents, spirals, rings — that k-means would shatter.

The noise label is the quiet superpower. Every other method in this chapter forces every point into some cluster; DBSCAN explicitly refuses, tagging genuinely isolated points as noise rather than letting one outlier drag a centroid across the plane. This is a feature, not a bug — but it is also the source of the most common exam confusion, so state it plainly: a point in a sparse, low-density region (too few neighbours within \(\varepsilon\), and not within \(\varepsilon\) of any core point) is labelled noise and left out of every cluster.

True or false: DBSCAN labels points that lie in low-density regions — too few neighbours within \( \varepsilon \), and not within \( \varepsilon \) of any core point — as noise, leaving them out of every cluster. (Enter true or false.)
This is exactly the definition of a noise point in EQ M12.3: neither core (fewer than minPts neighbours) nor border (not within \( \varepsilon \) of a core point). Unlike k-means, hierarchical clustering, or GMM — all of which assign every point to some cluster — DBSCAN has a dedicated noise label for low-density points. The statement is true.

Choosing the two knobs is the whole craft of DBSCAN. minPts is usually set to roughly \(2 \times \text{dimensionality}\) (so \(\approx 4\) for 2-D data; larger for noisier or higher-dimensional data), and \(\varepsilon\) is read off a k-distance plot: sort every point's distance to its \(k\)-th nearest neighbour and look for the "elbow" where the curve turns sharply upward — that knee is the density boundary between cluster and noise. The build-it-yourself cell below implements the whole algorithm in numpy and runs it on the two-moons set k-means famously fails on.

PYTHON · RUNNABLE IN-BROWSER
# DBSCAN from scratch (EQ M12.3) on two interleaving moons.
# k-means slices the crescents; density-reachability traces them.
import numpy as np
rng = np.random.default_rng(7)

def two_moons(n=150, noise=0.07):
    t = np.linspace(0, np.pi, n)
    a = np.c_[np.cos(t),        np.sin(t)]          # upper crescent
    b = np.c_[1 - np.cos(t), 1 - np.sin(t) - 0.5]   # lower, offset crescent
    X = np.vstack([a, b]) + rng.normal(0, noise, (2*n, 2))
    return X

def dbscan(X, eps, min_pts):
    n = len(X); labels = np.full(n, -1); cid = -1            # -1 = noise/unset
    D = np.sqrt(((X[:, None] - X[None]) ** 2).sum(-1))       # pairwise distances
    neigh = [np.where(D[i] <= eps)[0] for i in range(n)]     # eps-balls (EQ M12.3)
    visited = np.zeros(n, bool)
    for i in range(n):
        if visited[i]: continue
        visited[i] = True
        if len(neigh[i]) < min_pts: continue                # not core -> leave as noise
        cid += 1; labels[i] = cid; seeds = list(neigh[i])   # start a new cluster
        k = 0
        while k < len(seeds):                               # grow by reachability
            j = seeds[k]; k += 1
            if labels[j] == -1: labels[j] = cid             # absorb a border point
            if not visited[j]:
                visited[j] = True
                if len(neigh[j]) >= min_pts:                # j is core -> expand
                    seeds += [q for q in neigh[j] if q not in seeds]
    return labels

X = two_moons()
labels = dbscan(X, eps=0.22, min_pts=5)
n_clusters = labels.max() + 1
n_noise = int((labels == -1).sum())
print(f"clusters found : {n_clusters}   (truth = 2 crescents)")
print(f"noise points   : {n_noise}   (k-means would force ALL of these into a cluster)")
plot_scatter(X[:, 0], X[:, 1], labels)
edits are live — drop eps to 0.1 and watch the moons shatter into noise

The honest limitations — and the modern fix. DBSCAN uses a single global \(\varepsilon\), so it struggles when clusters have very different densities (the right \(\varepsilon\) for the dense core is wrong for the sparse halo — you watched this in Instrument M12.1's CORE + HALO set), and like all Euclidean-distance methods it degrades in high dimensions as distances concentrate. Its near-universal successor, HDBSCAN (2017), removes the \(\varepsilon\) knob entirely: it builds a hierarchy across all density levels and extracts the most stable clusters, handling variable density gracefully — it is the default density clusterer in practice today. Worth knowing the lineage: DBSCAN is the idea, HDBSCAN is what you usually run.

12.4

Gaussian Mixture Models & EM

k-means makes two hard commitments per point: a hard assignment (you belong to cluster 3, full stop) and a single isotropic radius (every cluster is a circle). The Gaussian mixture model softens both. It models the data as having been generated by \(k\) Gaussians blended together: to draw a point, first pick a component \(j\) with probability \(\pi_j\), then sample from that component's Gaussian \(\mathcal{N}(\mu_j, \Sigma_j)\). The density of any point is the weighted sum over all components:

EQ M12.4 — THE MIXTURE DENSITY $$ p(x) \;=\; \sum_{j=1}^{k} \pi_j\, \mathcal{N}\!\big(x \mid \mu_j,\, \Sigma_j\big), \qquad \pi_j \ge 0,\quad \sum_{j=1}^{k}\pi_j = 1 $$
\(\pi_j\) is the mixing weight (the prior probability of component \(j\)); \(\mu_j\) its mean; \(\Sigma_j\) its covariance matrix, which is what frees the model from k-means' circles. A diagonal \(\Sigma_j\) gives axis-aligned ellipses; a full \(\Sigma_j\) gives ellipses tilted at any angle and of any aspect ratio — so a GMM can fit the stretched, rotated clusters k-means mangles. The constraint \(\sum_j \pi_j = 1\) makes \(p(x)\) a proper probability density. In fact k-means is the limiting case of a GMM with equal weights, shared spherical covariance \(\Sigma_j = \sigma^2 I\), and \(\sigma \to 0\) forcing assignments hard.

Because each component is a full Gaussian, a GMM does not say "point \(i\) is in cluster \(j\)". It computes a responsibility \(\gamma_{ij}\) — the posterior probability that component \(j\) generated point \(i\) — a soft, fractional membership that can be 70% one cluster and 30% another. That softness is the whole point: it honestly represents the ambiguity of points sitting between clusters, which hard k-means simply discards.

EQ M12.5 — EM: THE E-STEP AND M-STEP $$ \textbf{E:}\;\; \gamma_{ij} = \frac{\pi_j\,\mathcal{N}(x_i \mid \mu_j, \Sigma_j)}{\sum_{l=1}^{k}\pi_l\,\mathcal{N}(x_i \mid \mu_l, \Sigma_l)} \qquad\qquad \textbf{M:}\;\; \pi_j = \frac{N_j}{n},\;\; \mu_j = \frac{1}{N_j}\sum_i \gamma_{ij} x_i,\;\; \Sigma_j = \frac{1}{N_j}\sum_i \gamma_{ij}(x_i - \mu_j)(x_i - \mu_j)^{\top} $$
where \(N_j = \sum_i \gamma_{ij}\) is the effective number of points in component \(j\). The Expectation–Maximization algorithm alternates exactly as Lloyd's loop does, but soft: the E-step fixes the parameters and computes every responsibility (a soft "assign"); the M-step fixes the responsibilities and re-estimates each \(\pi_j, \mu_j, \Sigma_j\) as responsibility-weighted statistics (a soft "update"). Each full iteration is guaranteed to increase the data log-likelihood \(\sum_i \log p(x_i)\) — or leave it unchanged — and it converges to a local maximum that depends on the initialization, so multiple restarts (often k-means++ seeding) are standard. Replace the soft \(\gamma_{ij}\) with a hard 0/1 assignment and EM becomes k-means.

The parameter count is where the freedom shows its price. Each component carries a mean (one number per dimension), a covariance, and a weight. For a 2-D, full-covariance model the means alone are \(2\) numbers per component — \(2k\) in total — and that is the figure the exercise below pins down, because it is the cheapest way to feel how parameters scale with \(k\).

A Gaussian mixture in 2-D has \( k = 5 \) full-covariance components. Counting only the mean parameters (each component's mean is a point in 2-D), how many mean parameters does the model have in total?
Each mean \( \mu_j \) lives in \( \mathbb{R}^2 \), so it contributes 2 numbers; with \( k \) components the means contribute \( 2k \) parameters in total. For \( k = 5 \): \( 2 \times 5 = \) 10 mean parameters. (For the full picture each full covariance adds \( d(d+1)/2 = 3 \) more per component, plus \( k-1 \) free weights — but the means alone scale as \( 2k \).)

Watch EM converge one step at a time. The stepper seeds two Gaussians badly, then alternates E and M: the E-step recolours every point by its responsibility (a blend, never a hard pick), and the M-step slides and reshapes the two ellipses to fit. The log-likelihood climbs monotonically in the readout — that climb is the convergence guarantee of EQ M12.5.

INSTRUMENT M12.3 — GMM / EM STEPPER2 COMPONENTS · RESPONSIBILITIES + FITTED GAUSSIANS · EQ M12.5
NEXT HALF-STEP
E-STEP
ITERATIONS
0
LOG-LIKELIHOOD
The two ellipses are the fitted Gaussians at \( \pm 2\sigma \); point colour blends mint and blue by responsibility, so a purple point is one EM is genuinely unsure about. Press E / M STEP alternately: the E-step only recolours (parameters frozen), the M-step only moves and reshapes the ellipses (colours frozen). The log-likelihood never falls — that monotone climb is EQ M12.5's guarantee made visible. Now RE-INIT a few times: most starts find the two true clusters, but a bad seed can collapse a component onto a few points (likelihood spikes toward \( +\infty \)) — the singularity pathology that real implementations regularize \( \Sigma_j \) to avoid.

And the 1-D version in numpy, stripped to its skeleton: EM for a two-component mixture on a line. Watch the two means separate to the true generating centres and the weights settle near their true mixing proportions as the log-likelihood plateaus.

PYTHON · RUNNABLE IN-BROWSER
# EM for a 1-D two-component GMM (EQ M12.5). Recover means & weights.
import numpy as np
rng = np.random.default_rng(0)

# true mixture: 40% N(0,1), 60% N(5, 1.5^2)
x = np.concatenate([rng.normal(0.0, 1.0, 200),
                    rng.normal(5.0, 1.5, 300)])
n = len(x)

def normal(x, mu, var):                       # 1-D Gaussian density
    return np.exp(-(x - mu)**2 / (2*var)) / np.sqrt(2*np.pi*var)

mu  = np.array([-1.0, 1.0])                    # deliberately bad init
var = np.array([1.0, 1.0])
pi  = np.array([0.5, 0.5])

for it in range(40):
    # E-step: responsibilities gamma[i, j]
    comp = np.stack([pi[j]*normal(x, mu[j], var[j]) for j in range(2)], axis=1)
    ll = np.log(comp.sum(1)).sum()            # data log-likelihood (climbs each iter)
    g = comp / comp.sum(1, keepdims=True)
    # M-step: weighted means, variances, weights
    Nj = g.sum(0)
    pi  = Nj / n
    mu  = (g * x[:, None]).sum(0) / Nj
    var = (g * (x[:, None] - mu)**2).sum(0) / Nj
    if it in (0, 4, 39):
        print(f"iter {it:2d}: loglik={ll:8.1f}  means={np.round(mu,2)}  weights={np.round(pi,2)}")

print("\nconverged means  :", np.round(np.sort(mu), 2), " (true: 0.0, 5.0)")
print("converged weights:", np.round(pi[np.argsort(mu)], 2), " (true: 0.4, 0.6)")
edits are live — set both init means to 2.0 and watch a bad start stall
12.5

Choosing k & validating clusters

Every method here either needs \(k\) (k-means, GMM, a dendrogram cut) or trades it for density knobs (DBSCAN's \(\varepsilon\), minPts). Without labels there is no accuracy to optimize, so validation splits into two honest questions: internal validation (is the clustering geometrically good on its own terms?) and the far better external answer (does a downstream task care?). The internal tools each have a failure mode, and knowing the failure mode is the skill.

The elbow method plots inertia \(J\) against \(k\) and looks for the bend where adding centres stops paying. It is the weakest tool here, because — as Chapter 05 proved — \(J\) falls monotonically with \(k\) all the way to the absurd \(J = 0\) at one centre per point, so there is often no clean elbow at all. The silhouette score is sharper: for each point it compares the mean distance to its own cluster against the mean distance to the nearest other cluster.

EQ M12.6 — SILHOUETTE OF A POINT $$ s(i) \;=\; \frac{b(i) - a(i)}{\max\big(a(i),\, b(i)\big)} \;\in\; [-1,\, 1], \qquad a(i) = \text{mean dist to own cluster}, \quad b(i) = \text{mean dist to nearest other cluster} $$
\(a(i)\) measures cohesion (how tight your own cluster is), \(b(i)\) measures separation (how far the nearest rival cluster is). \(s(i) \to 1\) means the point is deep inside a well-separated cluster; \(s(i) \approx 0\) means it sits on a boundary; \(s(i) < 0\) means it is closer to a neighbouring cluster than its own — a likely misassignment. Average \(s(i)\) over all points and you get a single score in \([-1,1]\); the \(k\) that maximizes it is a defensible choice. The catch: silhouette is built from distance-to-centre logic, so it favours convex, k-means-shaped clusters and will under-rate a correct DBSCAN clustering of crescents. An internal metric can only reward the shape it was designed around.

For probabilistic models there is a principled alternative that penalizes complexity directly. A GMM can always raise its likelihood by adding components, so you cannot pick \(k\) by likelihood alone — but the Bayesian Information Criterion subtracts a penalty for every parameter, trading fit against parsimony:

EQ M12.7 — BIC FOR MODEL SELECTION $$ \text{BIC} \;=\; m\,\ln(n) \;-\; 2\,\ln \hat{L}, \qquad m = \#\text{free parameters},\quad n = \#\text{points},\quad \hat{L} = \text{maximized likelihood} $$
Lower BIC is better. The \(-2\ln\hat L\) term rewards fit; the \(m\ln(n)\) term punishes every extra parameter, so a component that barely improves the likelihood is rejected for the parameters it costs. For a \(d\)-dimensional, full-covariance GMM each component carries \(d\) mean parameters \(+\;d(d+1)/2\) covariance parameters \(+\;1\) weight, with one weight constrained away — so \(m = k\,[\,d + d(d{+}1)/2\,] + (k-1)\). Sweep \(k\), keep the minimum-BIC model. BIC is the most principled \(k\)-selector in this chapter — but it is only valid when the data really is a mixture of Gaussians; on crescents it will confidently choose the wrong \(k\) because the model is wrong, not the criterion.

The genuinely decisive answer is external. When the clusters feed something downstream — segments feeding a campaign, codes feeding a classifier, groups a domain expert will inspect — let that task's metric choose \(k\) and the algorithm. This converts an unanswerable unsupervised question back into a measurable one, and it is the most honest move in the whole chapter. Where you have at least some ground-truth labels, the Adjusted Rand Index compares a clustering to the truth while correcting for chance agreement (0 = random, 1 = perfect), and is the standard external score.

FINE PRINT

Four traps that quietly invalidate a clustering. (1) Scale. Every Euclidean method here inherits Chapter 02's pathology — an unscaled feature in different units silently owns the result; standardize first unless the units are genuinely shared. (2) Metric chases shape. Silhouette and inertia reward convex blobs, so they will rank a wrong k-means clustering above a correct DBSCAN one — never validate a density method with a centre-based score. (3) The curse of dimensionality. In high dimensions all pairwise distances concentrate toward equality, so distance-based clustering loses its grip; cluster in a learned low-dimensional embedding (next chapter) instead. (4) Clusters always appear. Every algorithm here returns clusters even on pure noise — before trusting any partition, ask whether structure exists at all (e.g. the Hopkins statistic), because a confident clustering of structureless data is the most seductive artifact in the field.

NEXT

Clustering compresses \(n\) points into a handful of groups; the next chapter compresses a whole matrix into a handful of factors. Chapter 13 — Matrix Factorization — turns the user-item rating tables behind every recommender into low-rank products of latent factors, connects SVD, NMF and ALS, and shows how the same algebra that powers collaborative filtering reappears as the embedding machinery throughout modern AI.

12.R

References

  1. Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. KDD-96 — the original DBSCAN: core/border/noise points and density-reachability (§12.3, EQ M12.3).
  2. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B 39(1) — the Expectation–Maximization algorithm behind GMM fitting (§12.4, EQ M12.5).
  3. Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20 — the silhouette score for choosing and validating k (§12.5, EQ M12.6).
  4. Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58(301) — Ward's minimum-variance linkage for agglomerative clustering (§12.2, EQ M12.2).
  5. Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics 6(2) — the Bayesian Information Criterion for model (and component-count) selection (§12.5, EQ M12.7).
  6. Campello, R. J. G. B., Moulavi, D., Zimek, A. & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM TKDD 10(1) — HDBSCAN, the variable-density successor that removes DBSCAN's ε knob (§12.3 note).
  7. Hubert, L. & Arabie, P. (1985). Comparing Partitions. Journal of Classification 2(1) — the Adjusted Rand Index for chance-corrected external validation (§12.5).