Low-rank structure in real data
An \(m \times n\) matrix has up to \(mn\) free numbers, but its rank — the number of linearly independent rows (equivalently, columns) — is often far smaller than \(\min(m,n)\). A matrix of rank \(r\) can be written exactly as a product of an \(m \times r\) and an \(r \times n\) matrix, so it really only carries \(r(m+n)\) degrees of freedom. When \(r \ll \min(m,n)\), that is a colossal saving.
Why should real matrices be low-rank? Because they are generated by a small number of latent causes. A ratings table is driven by a handful of taste dimensions, not by a million independent whims. A term–document matrix is driven by a few dozen topics. A grayscale photo's columns are nearly redundant because neighboring columns look almost identical. The rank measures how many independent patterns are actually present; everything else is a combination of them.
Real data is rarely exactly low-rank — there is noise — but it is very often approximately low-rank: a sharp drop in the singular value spectrum (§13.2) followed by a long tail of small values. The art is choosing where to cut the tail: keep enough rank to capture the signal, drop enough to discard the noise. The rest of the chapter is variations on that single decision.
SVD recap & truncation
The singular value decomposition is the factorization that always exists, for every real matrix, and is in a precise sense the best one. It writes \(A\) as a rotation, a non-negative scaling, and another rotation:
Now truncate. Keep only the top \(k\) layers and you get the rank-\(k\) matrix \(A_k = \sum_{i=1}^{k}\sigma_i u_i v_i^{\top}\). The Eckart–Young theorem — one of the load-bearing results of applied linear algebra — says this is not just a good rank-\(k\) approximation, it is the best possible one:
How big should \(k\) be? Plot the singular values (a scree plot) or, better, the cumulative energy retained, \(\sum_{i\le k}\sigma_i^2 \big/ \sum_i \sigma_i^2\). A common rule is to keep enough components to retain 90–99% of the energy — the elbow where the curve flattens marks the boundary between signal and noise.
# Truncated-SVD recommender: held-out RMSE as a function of rank (EQ M13.2-3)
import numpy as np
rng = np.random.default_rng(1)
m, n, true_r = 40, 25, 2 # data is REALLY rank 2
A = rng.normal(0, 1, (m, true_r)) @ rng.normal(0, 1, (true_r, n))
A = 1 + 4 * (A - A.min()) / (A.max() - A.min()) # squash into 1..5 stars
A += rng.normal(0, 0.15, A.shape) # + observation noise
test = rng.random(A.shape) < 0.2 # hide 20% of cells as test
mu = A[~test].mean() # global mean fills the holes
F = np.where(test, mu, A) # mean-imputed train matrix
ranks, rmses = [1, 2, 3, 5, 10], []
for r in ranks:
Uu, s, Vt = np.linalg.svd(F - mu, full_matrices=False)
Ahat = mu + (Uu[:, :r] * s[:r]) @ Vt[:r] # best rank-r fit (EQ M13.3)
rmse = np.sqrt(np.mean((A[test] - Ahat[test]) ** 2))
rmses.append(rmse)
print(f"rank {r:2d}: held-out RMSE {rmse:.3f}")
print("\nRMSE bottoms out near the TRUE rank (2); higher rank just refits noise.")
plot_xy(ranks, rmses)
Recommender systems — latent factors
The canonical application, made famous by the 2006–2009 Netflix Prize, is collaborative filtering. You have a sparse \(m\times n\) ratings matrix \(R\): users by items, with the vast majority of entries missing (a typical user has rated 0.1% of the catalog). The task is to fill in the blanks. Matrix factorization's answer: assign every user a latent vector \(p_u \in \mathbb{R}^{k}\) and every item a latent vector \(q_i \in \mathbb{R}^{k}\), and predict the rating as their dot product.
You cannot run a plain SVD here, because SVD needs a complete matrix and \(R\) is mostly holes — imputing the holes with a constant first then running SVD biases the result toward that constant. The fix is to fit only the observed entries, minimizing regularized squared error by gradient descent:
The cold-start caveat, honestly. A new user or item with no ratings has no factors to estimate — the model defaults to biases alone, which is to say it guesses the average. Pure collaborative filtering is also blind to content (genre, text, image) and prone to popularity bias. Production systems since the mid-2010s blend factorization with content features and, increasingly, deep retrieval models — but a two-tower neural recommender is still computing a dot product of learned user and item embeddings. The latent-factor idea did not go away; it got an encoder in front of it.
# Funk SVD: matrix factorization by gradient descent on OBSERVED entries (EQ M13.5)
import numpy as np
rng = np.random.default_rng(0)
R = np.array([ # 6 users x 5 movies; NaN = unrated
[5, 3, np.nan, 1, np.nan],
[4, np.nan, np.nan, 1, 2],
[1, 1, np.nan, 5, np.nan],
[1, np.nan, np.nan, 4, 5],
[np.nan, 1, 5, 4, np.nan],
[2, 1, 4, np.nan, 5]], float)
mask = ~np.isnan(R) # True where a rating is known
k, lr, lam = 2, 0.02, 0.1
U = rng.normal(0, .1, (R.shape[0], k)) # user factors P
V = rng.normal(0, .1, (R.shape[1], k)) # item factors Q
for step in range(4000):
E = np.where(mask, R - U @ V.T, 0.0) # error on observed cells only
U += lr * (E @ V - lam * U) # gradient step (EQ M13.5)
V += lr * (E.T @ U - lam * V)
P = U @ V.T
print(f"train RMSE on observed entries: {np.sqrt(np.nanmean((R - P)[mask] ** 2)):.3f}")
print("reconstructed matrix (blanks are now PREDICTIONS):")
print(np.round(P, 1))
print("\nuser 0's two unrated movies (cols 2, 4) predicted:", np.round(P[0, [2, 4]], 2))
Non-negative Matrix Factorization
SVD's singular vectors mix positive and negative entries freely, so its factors add and subtract. That makes them mathematically optimal but often uninterpretable: a "component" of a face might be a ghostly blend that only makes sense once you cancel it against another. Non-negative Matrix Factorization (NMF) imposes one extra constraint — every entry of both factors must be \(\ge 0\) — and that constraint changes everything about what the parts look like.
NMF is fit by multiplicative updates, a pair of element-wise rules that preserve non-negativity automatically (no projection step, no learning rate to tune) and monotonically decrease the reconstruction error:
SVD vs NMF, the honest trade. SVD gives the provably best low-rank fit (Eckart–Young) with orthogonal, ordered, unique factors — ideal when you want compression or principal directions. NMF gives a usually-worse fit with non-orthogonal, unordered, non-unique factors — but ones a human can read as additive parts. Choose SVD/PCA when you want the optimal subspace; choose NMF when you want interpretable, parts-based components and the data is naturally non-negative (counts, intensities, spectra).
PCA as SVD; the connections
You met Principal Component Analysis as variance-hunting in Chapter 05. Here is the secret it shares with everything above: PCA is just the SVD of the centered data matrix. Center each column to mean zero (call the result \(X_c\)); then the principal components are the right singular vectors, and the variance along each is the squared singular value, scaled by the sample count.
That equivalence is the unifying thread of this chapter. The same decomposition, read three ways, becomes three tools:
| You want… | The factorization | What the factors mean |
|---|---|---|
| Best low-rank fit / compression | truncated SVD \(A_k\) | orthogonal, ordered, optimal (Eckart–Young) |
| Principal directions / decorrelation | SVD of centered \(X_c\) | PCA components = right singular vectors |
| Fill missing entries (recommend) | Funk SVD on observed \(\mathcal{K}\) | user/item latent factors \(p_u, q_i\) |
| Interpretable additive parts | NMF \(WH\), \(W,H\ge 0\) | topics, strokes, spectra — parts you can name |
Where this reaches. Latent Semantic Analysis is truncated SVD of a term–document matrix. Classical word embeddings (GloVe, and the implicit factorization behind word2vec) are matrix factorizations of co-occurrence statistics. Spectral clustering factorizes a graph Laplacian. Image and video codecs lean on related transforms. The low-rank prior — "a few hidden factors explain most of the data" — is one of the most reusable assumptions in all of machine learning, and factorization is how you cash it in.
Factorization compresses one matrix into a few smart parts; ensembles compress many weak models into one strong vote. Chapter 14: bagging, boosting, and stacking — why a forest of mediocre trees beats one clever one, and how the bias–variance trade-off plays out when you combine predictors instead of features.
References
- Koren, Y., Bell, R. & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems.
- Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization.
- Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank.
- Lee, D. D. & Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization.
- Halko, N., Martinsson, P.-G. & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis.
- Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization.