Why the distance defines the model
k-NN (Chapter 04) has no parameters, no training step, no loss function. Strip it down and what remains is a stored dataset and one function that decides which points count as neighbors. The same is true of k-means, of DBSCAN, of hierarchical clustering, of the approximate-nearest-neighbor index inside every vector database. None of them is really an algorithm over data; each is an algorithm over a distance. Change the distance and you have changed the model — usually more than any hyperparameter could.
Mathematicians demand four properties before they will call a function \(d(\mathbf{x},\mathbf{y})\) a metric. They are worth stating because the moment you violate one, geometric intuition stops being trustworthy:
The deepest consequence hides inside the most innocent-looking choice. A distance combines coordinates, so it implicitly weights them. If one feature is measured in millimetres and another in kilometres, the kilometres dominate every Euclidean comparison and the millimetres become invisible — the model decides almost everything on a single axis without anyone choosing that. This is why distance-based methods demand standardized features, why a covariance correction exists at all (§11.3), and why "what distance?" is never a footnote. The distance is the inductive bias.
An unscaled feature is a silent vote. Before any distance is computed, the standard moves are z-scoring each column to mean 0, variance 1, or min–max scaling to \([0,1]\). Skip it and the column with the largest raw spread quietly becomes the model. Trees (Chapter 04) are the exception that proves the rule — they care only about the order of values within a column, so they are invariant to rescaling and never need it.
The Minkowski family — Euclidean & Manhattan
The workhorse distances are a single formula with one dial. The Minkowski distance of order \(p\) raises each coordinate gap to the power \(p\), sums them, and takes the \(p\)-th root:
The order \(p\) is not a cosmetic choice — it reshapes the geometry of "near." The set of all points at distance exactly 1 from the origin, the unit ball, changes form completely: a diamond for Manhattan, a circle for Euclidean, a square for Chebyshev. Two points that are equidistant under one \(p\) need not be under another, so the nearest neighbor itself can change with \(p\). The instrument below lets you drag both points and read all three off at once.
# EQ M11.2: the Minkowski family, plus cosine and Mahalanobis, between two vectors
import numpy as np
x = np.array([1.0, 1.0])
y = np.array([4.0, 5.0])
diff = x - y
manhattan = np.abs(diff).sum() # p = 1
euclidean = np.sqrt((diff ** 2).sum()) # p = 2
chebyshev = np.abs(diff).max() # p -> infinity
print(f"Manhattan (p=1) : {manhattan:.4f}")
print(f"Euclidean (p=2) : {euclidean:.4f}")
print(f"Chebyshev (inf) : {chebyshev:.4f}")
print("ordering p1 >= p2 >= pinf :", manhattan >= euclidean >= chebyshev)
cos_sim = (x @ y) / (np.linalg.norm(x) * np.linalg.norm(y))
print(f"\ncosine similarity : {cos_sim:.4f} (1 = same direction)")
cov = np.array([[1.0, 0.8], [0.8, 1.0]]) # correlated features
m2 = diff @ np.linalg.inv(cov) @ diff # squared Mahalanobis (EQ M11.3)
print(f"Mahalanobis dist : {np.sqrt(m2):.4f} (vs Euclidean {euclidean:.4f})")
Mahalanobis distance — correcting for covariance
Euclidean distance treats every direction as equal and every feature as independent. Real data rarely obliges. Suppose height and weight are strongly correlated: a point that is tall-and-light sits far from the data cloud even if its raw Euclidean distance to the centre is modest, because it violates the correlation the rest of the data obeys. The Mahalanobis distance (Mahalanobis, 1936) fixes this by measuring distance in units of the data's own spread, deflating directions of high variance and inflating directions of low variance:
There is a clean way to see what \(\Sigma^{-1}\) is doing: Mahalanobis distance is just Euclidean distance computed after whitening the data — applying the linear transform \(\Sigma^{-1/2}\) that decorrelates the features and rescales each to unit variance. Whiten first, then measure with an ordinary ruler. This also explains its starring role in outlier and anomaly detection: for multivariate-Gaussian data, \(d_M^2\) follows a \(\chi^2\) distribution with \(n\) degrees of freedom, giving a principled threshold for "too far to be one of us."
# Mahalanobis = Euclidean after whitening — two points, same Euclidean distance,
# but very different Mahalanobis distance on a correlated cloud (EQ M11.3)
import numpy as np
mu = np.array([0.0, 0.0])
cov = np.array([[1.0, 0.9],
[0.9, 1.0]]) # strongly correlated features
Sinv = np.linalg.inv(cov)
# two test points the SAME Euclidean distance (sqrt 2) from the mean...
along = np.array([1.0, 1.0]) # lies ALONG the correlation axis
across = np.array([1.0, -1.0]) # lies ACROSS it
def mahal(p):
d = p - mu
return np.sqrt(d @ Sinv @ d)
for name, p in [("along ", along), ("across", across)]:
print(f"{name}: Euclidean = {np.linalg.norm(p - mu):.3f}"
f" Mahalanobis = {mahal(p):.3f}")
print("\nIdentical Euclidean distance, ~6x difference in Mahalanobis:")
print("the across-the-grain point violates the correlation, so it reads as far.")
Cosine & Jaccard similarity
Sometimes magnitude is noise and only direction carries meaning. A document that uses the word "neural" twice as often as another, in the same proportions across every other word, is about the same thing — yet its longer count vector sits far away under Euclidean distance. Cosine similarity throws magnitude away by normalizing both vectors to unit length, then taking their dot product — the cosine of the angle between them:
When the data is not counts but sets — the tags on a photo, the words in a tweet, the products in a basket — direction is the wrong picture entirely. Jaccard similarity measures overlap: the size of the intersection over the size of the union.
| Measure | Sees | True metric? | Reach for it when… |
|---|---|---|---|
| Euclidean | straight-line gap | yes | features are scaled & roughly independent; the default for k-means |
| Manhattan | axis-aligned gap | yes | grid-like or high-dimensional data; more robust to outliers than L2 |
| Mahalanobis | gap in σ-units | yes | features are correlated; outlier/anomaly detection |
| Cosine | angle only | no | text TF-IDF, dense embeddings — when length is irrelevant |
| Jaccard | set overlap | yes (1−J) | sparse binary / set data; near-duplicate detection |
# Cosine ignores magnitude; Jaccard measures set overlap (EQ M11.4-M11.5)
import numpy as np
# a short doc and the SAME doc with every count doubled
a = np.array([3.0, 1.0, 0.0, 2.0])
b = np.array([6.0, 2.0, 0.0, 4.0]) # b = 2*a : same direction
cos = (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))
euc = np.linalg.norm(a - b)
print(f"cosine(a, b) : {cos:.4f} (1.0 = identical direction)")
print(f"euclidean(a, b) : {euc:.4f} (large! magnitude differs)")
print("-> cosine calls them identical; Euclidean is fooled by length\n")
# Jaccard on two sets, computed via binary membership vectors
A = {1, 2, 3}
B = {2, 3, 4}
inter = len(A & B)
union = len(A | B)
print(f"A & B = {A & B}, A | B = {A | B}")
print(f"Jaccard(A, B) : {inter/union:.4f}")
print(f"Jaccard distance : {1 - inter/union:.4f}")
The curse of dimensionality
Everything above is built on a quiet assumption: that "nearest" is meaningfully different from "farthest." In low dimensions it obviously is. In high dimensions it quietly stops being true — and this is the single most important, least intuitive fact in this chapter. As the number of dimensions grows, the distances between random points concentrate: the nearest neighbor and the farthest neighbor of a query end up at almost the same distance, and the very notion of a closest point loses its bite.
A second face of the same curse is geometric. To capture a fixed fraction of uniformly spread points, a neighborhood must grow until it is no longer "local" at all. In a \(d\)-dimensional unit cube, a sub-cube holding just 1% of the volume needs edge length \(0.01^{1/d}\): at \(d=2\) that is 0.10, but at \(d=100\) it is \(\approx 0.955\) — to grab 1% of the data your "neighborhood" must span 95% of every single axis. Locality evaporates.
Why does any of this still work? Because real high-dimensional data is almost never i.i.d. across its coordinates. Images, text embeddings, and audio live near a much lower-dimensional manifold inside the ambient space — the manifold hypothesis. The concentration result describes the worst case of structureless noise; genuine data has structure, so its intrinsic dimension is small even when its nominal dimension is thousands. This is precisely why k-NN on a good learned embedding still works, while k-NN on raw high-dimensional pixels does not. Dimensionality reduction (PCA, UMAP — next chapters) and learned representations are, at bottom, attempts to recover that low intrinsic dimension before you ever compute a distance.
# EQ M11.6: the max/min pairwise-distance ratio collapsing as dimension rises
import numpy as np
rng = np.random.default_rng(0)
print(" dim mean dist min max contrast (max-min)/min")
dims, contrasts = [], []
for n in (2, 4, 8, 16, 64, 256, 1024):
X = rng.random((200, n)) # 200 uniform points in the n-cube
q = rng.random(n) # one query point
d = np.sqrt(((X - q) ** 2).sum(axis=1)) # Euclidean distance to each
contrast = (d.max() - d.min()) / d.min()
dims.append(n); contrasts.append(contrast)
print(f"{n:5d} {d.mean():10.3f} {d.min():8.3f} {d.max():8.3f} {contrast:14.3f}")
print("\nIn 2-D the farthest point is ~10x the nearest; by 1024-D the gap is tiny.")
print("Nearest-neighbor 'distance' loses its meaning -- the curse, quantified.")
plot_xy(dims, contrasts) # contrast vs dimension
You now own the ruler; next you point it at unlabeled data. Every clustering algorithm is a distance plus a rule for grouping by it — k-means minimizes squared Euclidean distance to centroids, DBSCAN grows clusters by a distance radius, hierarchical methods merge by inter-cluster distance. Chapter 12 tours the clustering zoo and shows how the choice you made here decides the shapes each one can — and cannot — find.
References
- Mahalanobis, P. C. (1936, repr. 2018). On the generalised distance in statistics. Sankhyā A, 80(S1), 1–7.
- Aggarwal, C. C., Hinneburg, A. & Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT 2001, LNCS 1973, 420–434.
- Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). When Is "Nearest Neighbor" Meaningful? ICDT 1999, LNCS 1540, 217–235.
- Cover, T. & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Information Theory, 13(1), 21–27.
- Broder, A. Z. (1997/1998). On the resemblance and containment of documents & Min-wise independent permutations. SEQUENCES / STOC.