Distance & Similarity Metrics

11.1

Why the distance defines the model

k-NN (Chapter 04) has no parameters, no training step, no loss function. Strip it down and what remains is a stored dataset and one function that decides which points count as neighbors. The same is true of k-means, of DBSCAN, of hierarchical clustering, of the approximate-nearest-neighbor index inside every vector database. None of them is really an algorithm over data; each is an algorithm over a distance. Change the distance and you have changed the model — usually more than any hyperparameter could.

Mathematicians demand four properties before they will call a function $d(\mathbf{x},\mathbf{y})$ a metric. They are worth stating because the moment you violate one, geometric intuition stops being trustworthy:

EQ M11.1 — THE METRIC AXIOMS $$ d(\mathbf{x},\mathbf{y}) \ge 0,\quad d(\mathbf{x},\mathbf{y}) = 0 \iff \mathbf{x}=\mathbf{y},\quad d(\mathbf{x},\mathbf{y}) = d(\mathbf{y},\mathbf{x}),\quad d(\mathbf{x},\mathbf{z}) \le d(\mathbf{x},\mathbf{y}) + d(\mathbf{y},\mathbf{z}) $$

In order: non-negativity, identity of indiscernibles (zero distance means the same point), symmetry, and the triangle inequality — a detour through $\mathbf{y}$ can never be shorter than going straight. Euclidean, Manhattan, and Mahalanobis distance satisfy all four. Cosine and squared-Euclidean distance do not — both break the triangle inequality, which is exactly why some fast indexes that assume a true metric cannot be used with them unmodified (§11.4).

The deepest consequence hides inside the most innocent-looking choice. A distance combines coordinates, so it implicitly weights them. If one feature is measured in millimetres and another in kilometres, the kilometres dominate every Euclidean comparison and the millimetres become invisible — the model decides almost everything on a single axis without anyone choosing that. This is why distance-based methods demand standardized features, why a covariance correction exists at all (§11.3), and why "what distance?" is never a footnote. The distance is the inductive bias.

SCALE FIRST

An unscaled feature is a silent vote. Before any distance is computed, the standard moves are z-scoring each column to mean 0, variance 1, or min–max scaling to $[0,1]$. Skip it and the column with the largest raw spread quietly becomes the model. Trees (Chapter 04) are the exception that proves the rule — they care only about the order of values within a column, so they are invariant to rescaling and never need it.

11.2

The Minkowski family — Euclidean & Manhattan

The workhorse distances are a single formula with one dial. The Minkowski distance of order $p$ raises each coordinate gap to the power $p$, sums them, and takes the $p$-th root:

EQ M11.2 — THE MINKOWSKI DISTANCE $$ d_p(\mathbf{x},\mathbf{y}) \;=\; \left( \sum_{i=1}^{n} \lvert x_i - y_i \rvert^{\,p} \right)^{1/p}, \qquad p \ge 1 $$

Turning the single knob $p$ gives every distance you use daily. $p=1$ is Manhattan (taxicab) distance, the sum of absolute coordinate gaps — how far you walk on a city grid. $p=2$ is Euclidean distance, the straight-line length from Pythagoras. As $p \to \infty$ the largest single coordinate gap swallows the sum and you get the Chebyshev distance $\max_i \lvert x_i - y_i\rvert$ — the king's move on a chessboard. For $p \ge 1$ it is a true metric; for $0 < p < 1$ the triangle inequality fails and it is only a "fractional distance."

The order $p$ is not a cosmetic choice — it reshapes the geometry of "near." The set of all points at distance exactly 1 from the origin, the unit ball, changes form completely: a diamond for Manhattan, a circle for Euclidean, a square for Chebyshev. Two points that are equidistant under one $p$ need not be under another, so the nearest neighbor itself can change with $p$. The instrument below lets you drag both points and read all three off at once.

Using EQ M11.2 with $p=1$, what is the Manhattan distance between $(1, 2)$ and $(4, 6)$?

Sum the absolute coordinate gaps: $\lvert 4-1\rvert + \lvert 6-2\rvert = 3 + 4 = $ 7. Manhattan ignores the diagonal shortcut — it counts only axis-aligned travel, as if walking city blocks.

Using EQ M11.2 with $p=2$, what is the Euclidean distance between $(0, 0)$ and $(3, 4)$?

$\sqrt{(3-0)^2 + (4-0)^2} = \sqrt{9 + 16} = \sqrt{25} = $ 5 — the hypotenuse of the classic 3-4-5 right triangle. The straight-line route is shorter than the Manhattan route of $3+4=7$, as the triangle inequality guarantees.

PYTHON · RUNNABLE IN-BROWSER

# EQ M11.2: the Minkowski family, plus cosine and Mahalanobis, between two vectors
import numpy as np

x = np.array([1.0, 1.0])
y = np.array([4.0, 5.0])
diff = x - y

manhattan = np.abs(diff).sum()                      # p = 1
euclidean = np.sqrt((diff ** 2).sum())              # p = 2
chebyshev = np.abs(diff).max()                      # p -> infinity
print(f"Manhattan (p=1) : {manhattan:.4f}")
print(f"Euclidean (p=2) : {euclidean:.4f}")
print(f"Chebyshev (inf) : {chebyshev:.4f}")
print("ordering p1 >= p2 >= pinf :", manhattan >= euclidean >= chebyshev)

cos_sim = (x @ y) / (np.linalg.norm(x) * np.linalg.norm(y))
print(f"\ncosine similarity : {cos_sim:.4f}   (1 = same direction)")

cov = np.array([[1.0, 0.8], [0.8, 1.0]])            # correlated features
m2 = diff @ np.linalg.inv(cov) @ diff               # squared Mahalanobis (EQ M11.3)
print(f"Mahalanobis dist  : {np.sqrt(m2):.4f}   (vs Euclidean {euclidean:.4f})")

edits are live — try p between 1 and 2, or set the off-diagonal covariance to 0

INSTRUMENT M11.1 — DISTANCE EXPLORERTWO POINTS · EUCLIDEAN · MANHATTAN · CHEBYSHEV · EQ M11.2

POINT P — x 1.0

POINT P — y 1.0

POINT Q — x 4.0

POINT Q — y 5.0

EUCLIDEAN (p=2)

—

MANHATTAN (p=1)

—

CHEBYSHEV (p=∞)

—

The faint coloured outlines around P are the three unit balls — the diamond is Manhattan, the circle Euclidean, the square Chebyshev — every shape a locus of "distance 1 from P." The mint line is the straight Euclidean route; the blue staircase is one Manhattan path of equal taxicab length. Default P = (1,1), Q = (4,5) gives Euclidean 5, Manhattan 7, Chebyshev 4. Drag Q onto a diagonal of P (equal x- and y-gaps) and watch Manhattan stretch furthest while Chebyshev stays small — the same two points, three different verdicts on "how far."

11.3

Mahalanobis distance — correcting for covariance

Euclidean distance treats every direction as equal and every feature as independent. Real data rarely obliges. Suppose height and weight are strongly correlated: a point that is tall-and-light sits far from the data cloud even if its raw Euclidean distance to the centre is modest, because it violates the correlation the rest of the data obeys. The Mahalanobis distance (Mahalanobis, 1936) fixes this by measuring distance in units of the data's own spread, deflating directions of high variance and inflating directions of low variance:

EQ M11.3 — MAHALANOBIS DISTANCE $$ d_M(\mathbf{x},\boldsymbol{\mu}) \;=\; \sqrt{(\mathbf{x}-\boldsymbol{\mu})^{\top}\, \Sigma^{-1}\, (\mathbf{x}-\boldsymbol{\mu})} $$

$\boldsymbol{\mu}$ is the data mean and $\Sigma$ its covariance matrix; $\Sigma^{-1}$ is the inverse covariance (the "precision" matrix). The quadratic form rotates into the eigen-axes of $\Sigma$ and rescales each by $1/\sqrt{\lambda_i}$ — exactly the principal directions of Stats 06. When $\Sigma = I$ (uncorrelated, unit-variance features), Mahalanobis collapses back to plain Euclidean distance. Curves of constant $d_M$ are the ellipses aligned with the data cloud, not circles — which is why a point along the cloud's long axis can be "closer" than a nearer point lying across it.

There is a clean way to see what $\Sigma^{-1}$ is doing: Mahalanobis distance is just Euclidean distance computed after whitening the data — applying the linear transform $\Sigma^{-1/2}$ that decorrelates the features and rescales each to unit variance. Whiten first, then measure with an ordinary ruler. This also explains its starring role in outlier and anomaly detection: for multivariate-Gaussian data, $d_M^2$ follows a $\chi^2$ distribution with $n$ degrees of freedom, giving a principled threshold for "too far to be one of us."

Features are uncorrelated with variances 9 and 16, so $\Sigma = \begin{bmatrix} 9 & 0 \\ 0 & 16 \end{bmatrix}$ ($\sigma_x = 3$, $\sigma_y = 4$). A point sits a gap of $(9, 16)$ from the mean. What is its Mahalanobis distance $d_M$ (EQ M11.3)?

For diagonal $\Sigma$, $\Sigma^{-1} = \begin{bmatrix} 1/9 & 0 \\ 0 & 1/16 \end{bmatrix}$, so each gap is measured in standard deviations. The squared distance is $\dfrac{9^2}{9} + \dfrac{16^2}{16} = 9 + 16 = 25$. So $d_M = \sqrt{25} = $ 5. The point sits $3\sigma$ out along $x$ and $4\sigma$ out along $y$ — a 3-4-5 right triangle in σ-units, even though its raw Euclidean gap of $\sqrt{9^2+16^2}\approx 18.4$ is far larger.

INSTRUMENT M11.2 — MAHALANOBIS vs EUCLIDEANCORRELATED CLOUD · COVARIANCE ELLIPSE · EQ M11.3

VARIANCE σ²ₓ 3.0

VARIANCE σ²ᵧ 0.6

CORRELATION ρ 0.80

TEST POINT — x 2.4

TEST POINT — y 2.0

EUCLIDEAN TO MEAN

—

MAHALANOBIS TO MEAN

—

VERDICT (χ² ≈ 2.45σ)

—

The mint dots are a seeded Gaussian cloud with the covariance you dial in; the mint ellipses are the contours of constant Mahalanobis distance (1, 2, 3 σ). The white dot is your test point, the white line its straight Euclidean reach to the mean. Crank $\rho$ toward 0.8 and slide the test point along the cloud's long axis: Euclidean distance stays large while Mahalanobis shrinks — the point is unremarkable given the correlation. Now push it across the short axis and the verdict flips to OUTLIER even at a small Euclidean distance. That divergence is the entire reason Mahalanobis exists.

PYTHON · RUNNABLE IN-BROWSER

# Mahalanobis = Euclidean after whitening — two points, same Euclidean distance,
# but very different Mahalanobis distance on a correlated cloud (EQ M11.3)
import numpy as np

mu  = np.array([0.0, 0.0])
cov = np.array([[1.0, 0.9],
                [0.9, 1.0]])          # strongly correlated features
Sinv = np.linalg.inv(cov)

# two test points the SAME Euclidean distance (sqrt 2) from the mean...
along  = np.array([1.0,  1.0])        # lies ALONG the correlation axis
across = np.array([1.0, -1.0])        # lies ACROSS it

def mahal(p):
    d = p - mu
    return np.sqrt(d @ Sinv @ d)

for name, p in [("along ", along), ("across", across)]:
    print(f"{name}: Euclidean = {np.linalg.norm(p - mu):.3f}"
          f"   Mahalanobis = {mahal(p):.3f}")

print("\nIdentical Euclidean distance, ~6x difference in Mahalanobis:")
print("the across-the-grain point violates the correlation, so it reads as far.")

edits are live — set the off-diagonal to 0 and watch the two distances become equal

11.4

Cosine & Jaccard similarity

Sometimes magnitude is noise and only direction carries meaning. A document that uses the word "neural" twice as often as another, in the same proportions across every other word, is about the same thing — yet its longer count vector sits far away under Euclidean distance. Cosine similarity throws magnitude away by normalizing both vectors to unit length, then taking their dot product — the cosine of the angle between them:

EQ M11.4 — COSINE SIMILARITY $$ \cos\theta \;=\; \frac{\mathbf{x}\cdot\mathbf{y}}{\lVert\mathbf{x}\rVert\,\lVert\mathbf{y}\rVert} \;=\; \frac{\sum_i x_i y_i}{\sqrt{\sum_i x_i^2}\,\sqrt{\sum_i y_i^2}} \;\in\; [-1, 1] $$

$1$ means the vectors point the same way, $0$ means orthogonal (unrelated), $-1$ means opposite. The companion cosine distance is $1 - \cos\theta$; it is not a true metric (it violates the triangle inequality), which is why some metric-tree indexes refuse it. The standard trick: on L2-normalized vectors, squared Euclidean distance is $2(1 - \cos\theta)$ — a strictly increasing function of cosine distance — so cosine ranking equals Euclidean ranking, and you can use any Euclidean index after normalizing. This is why embedding pipelines normalize before they store. Cosine is the default for sparse text vectors and dense embeddings precisely because it ignores document length and activation scale.

When the data is not counts but sets — the tags on a photo, the words in a tweet, the products in a basket — direction is the wrong picture entirely. Jaccard similarity measures overlap: the size of the intersection over the size of the union.

EQ M11.5 — JACCARD SIMILARITY $$ J(A,B) \;=\; \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}, \qquad d_J(A,B) \;=\; 1 - J(A,B) $$

Pure set overlap: $1$ when the sets are identical, $0$ when they are disjoint. The Jaccard distance $1 - J$ is a true metric. On binary vectors it counts only shared presences and ignores the vast sea of shared absences — the right instinct for sparse data, where two short documents agreeing that they both lack 49,998 of 50,000 vocabulary words tells you nothing. At web scale Jaccard is estimated, not computed, via MinHash + locality-sensitive hashing — the engine behind near-duplicate detection and the data deduplication that cleans LLM training corpora (Vol II · CH 04).

Two sets $A = \{1, 2, 3\}$ and $B = \{2, 3, 4\}$. What is their Jaccard similarity $J(A,B)$ (EQ M11.5)?

Intersection $A \cap B = \{2, 3\}$, size 2. Union $A \cup B = \{1, 2, 3, 4\}$, size 4. So $J = \dfrac{2}{4} = $ 0.5 — the sets share half of their combined elements. The Jaccard distance is $1 - 0.5 = 0.5$.

Measure	Sees	True metric?	Reach for it when…
Euclidean	straight-line gap	yes	features are scaled & roughly independent; the default for k-means
Manhattan	axis-aligned gap	yes	grid-like or high-dimensional data; more robust to outliers than L2
Mahalanobis	gap in σ-units	yes	features are correlated; outlier/anomaly detection
Cosine	angle only	no	text TF-IDF, dense embeddings — when length is irrelevant
Jaccard	set overlap	yes (1−J)	sparse binary / set data; near-duplicate detection

PYTHON · RUNNABLE IN-BROWSER

# Cosine ignores magnitude; Jaccard measures set overlap (EQ M11.4-M11.5)
import numpy as np

# a short doc and the SAME doc with every count doubled
a = np.array([3.0, 1.0, 0.0, 2.0])
b = np.array([6.0, 2.0, 0.0, 4.0])          # b = 2*a : same direction

cos = (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))
euc = np.linalg.norm(a - b)
print(f"cosine(a, b)     : {cos:.4f}   (1.0 = identical direction)")
print(f"euclidean(a, b)  : {euc:.4f}   (large! magnitude differs)")
print("-> cosine calls them identical; Euclidean is fooled by length\n")

# Jaccard on two sets, computed via binary membership vectors
A = {1, 2, 3}
B = {2, 3, 4}
inter = len(A & B)
union = len(A | B)
print(f"A & B = {A & B}, A | B = {A | B}")
print(f"Jaccard(A, B)    : {inter/union:.4f}")
print(f"Jaccard distance : {1 - inter/union:.4f}")

edits are live — add an element to B and watch Jaccard fall

11.5

The curse of dimensionality

Everything above is built on a quiet assumption: that "nearest" is meaningfully different from "farthest." In low dimensions it obviously is. In high dimensions it quietly stops being true — and this is the single most important, least intuitive fact in this chapter. As the number of dimensions grows, the distances between random points concentrate: the nearest neighbor and the farthest neighbor of a query end up at almost the same distance, and the very notion of a closest point loses its bite.

EQ M11.6 — DISTANCE CONCENTRATION $$ \lim_{n \to \infty} \;\mathbb{E}\!\left[ \frac{\mathrm{dist}_{\max}(n) - \mathrm{dist}_{\min}(n)}{\mathrm{dist}_{\min}(n)} \right] \;\to\; 0 \qquad (\text{i.i.d. features}) $$

For data with independent coordinates, the relative contrast between the farthest and nearest points of a query vanishes as the dimension $n$ grows (Beyer et al., 1999). Each new coordinate adds roughly equal, independent "noise" to every pairwise distance, so by a law-of-large-numbers effect all distances pile up around the same mean and their spread relative to that mean collapses. The practical reading is brutal: in enough dimensions, every point is almost equidistant from every other, k-NN's votes become coin flips, and the index gains nothing over a linear scan. Aggarwal et al. (2001) add a twist — lower $p$ (Manhattan over Euclidean, even fractional norms) concentrates more slowly, so Manhattan is often the better high-dimensional choice.

A second face of the same curse is geometric. To capture a fixed fraction of uniformly spread points, a neighborhood must grow until it is no longer "local" at all. In a $d$-dimensional unit cube, a sub-cube holding just 1% of the volume needs edge length $0.01^{1/d}$: at $d=2$ that is 0.10, but at $d=100$ it is $\approx 0.955$ — to grab 1% of the data your "neighborhood" must span 95% of every single axis. Locality evaporates.

THE ESCAPE HATCH

Why does any of this still work? Because real high-dimensional data is almost never i.i.d. across its coordinates. Images, text embeddings, and audio live near a much lower-dimensional manifold inside the ambient space — the manifold hypothesis. The concentration result describes the worst case of structureless noise; genuine data has structure, so its intrinsic dimension is small even when its nominal dimension is thousands. This is precisely why k-NN on a good learned embedding still works, while k-NN on raw high-dimensional pixels does not. Dimensionality reduction (PCA, UMAP — next chapters) and learned representations are, at bottom, attempts to recover that low intrinsic dimension before you ever compute a distance.

INSTRUMENT M11.3 — DISTANCE CONCENTRATIONRANDOM POINTS · CONTRAST vs DIMENSION · EQ M11.6

DIMENSION n 2

NORM

NEAREST / FARTHEST DIST

—

RELATIVE CONTRAST

—

REGIME

—

200 points are drawn uniformly in the $n$-cube; the histogram is their distances to one query, and the contrast $(d_{\max}-d_{\min})/d_{\min}$ is EQ M11.6 made live. At $n=2$ the distances are spread wide and "nearest" is meaningful. Slide $n$ toward 512 and watch the histogram tighten into a spike — the contrast plunges from ~10 toward a fraction, and a nearest neighbor stops being distinguishable from the rest. Flip to Manhattan and the collapse is gentler at every dimension: the Aggarwal result, drawn from random data.

PYTHON · RUNNABLE IN-BROWSER

# EQ M11.6: the max/min pairwise-distance ratio collapsing as dimension rises
import numpy as np
rng = np.random.default_rng(0)

print(" dim    mean dist   min      max     contrast (max-min)/min")
dims, contrasts = [], []
for n in (2, 4, 8, 16, 64, 256, 1024):
    X = rng.random((200, n))                  # 200 uniform points in the n-cube
    q = rng.random(n)                         # one query point
    d = np.sqrt(((X - q) ** 2).sum(axis=1))   # Euclidean distance to each
    contrast = (d.max() - d.min()) / d.min()
    dims.append(n); contrasts.append(contrast)
    print(f"{n:5d} {d.mean():10.3f} {d.min():8.3f} {d.max():8.3f} {contrast:14.3f}")

print("\nIn 2-D the farthest point is ~10x the nearest; by 1024-D the gap is tiny.")
print("Nearest-neighbor 'distance' loses its meaning -- the curse, quantified.")
plot_xy(dims, contrasts)                       # contrast vs dimension

edits are live — switch to Manhattan (np.abs(X-q).sum) and watch it decay slower

You now own the ruler; next you point it at unlabeled data. Every clustering algorithm is a distance plus a rule for grouping by it — k-means minimizes squared Euclidean distance to centroids, DBSCAN grows clusters by a distance radius, hierarchical methods merge by inter-cluster distance. Chapter 12 tours the clustering zoo and shows how the choice you made here decides the shapes each one can — and cannot — find.

11.R

References

Mahalanobis, P. C. (1936, repr. 2018). On the generalised distance in statistics. Sankhyā A, 80(S1), 1–7. the original covariance-corrected distance (EQ M11.3), reprinted with commentary.
Aggarwal, C. C., Hinneburg, A. & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT 2001, LNCS 1973, 420–434. shows lower-order (fractional, Manhattan) norms concentrate more slowly than Euclidean (§11.5).
Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When Is "Nearest Neighbor" Meaningful? ICDT 1999, LNCS 1540, 217–235. the foundational distance-concentration result behind EQ M11.6.
Cover, T. & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Information Theory, 13(1), 21–27. why the distance choice is the model: the founding analysis of k-NN.
Broder, A. Z. (1997/1998). On the resemblance and containment of documents & Min-wise independent permutations. SEQUENCES / STOC. MinHash estimation of Jaccard similarity at scale (§11.4).