AI // ENCYCLOPEDIA / MACHINE LEARNING / 13 / MATRIX FACTORIZATION INDEX NEXT: ENSEMBLES →
MACHINE LEARNING · CHAPTER 13 / 15

Matrix Factorization & SVD in Practice

A ratings table, a term-document count, a pixel grid, an adjacency matrix: most large matrices that show up in practice are not full-rank. They are low-rank. A few hidden factors explain almost everything, and writing the matrix as a product of two thin ones recovers those factors. Factorization is the shared engine under recommender systems, word and item embeddings, image compression, and the PCA from Chapter 05. This chapter builds it from the singular value decomposition outward, then covers the three variants you will deploy.

LEVELCORE READING TIME≈ 24 MIN BUILDS ONML 02 · 05 · STATS 06 INSTRUMENTSRATINGS · SCREE · NMF
13.1

Low-rank structure in real data

An \(m \times n\) matrix has up to \(mn\) free numbers, but its rank — the number of linearly independent rows (equivalently, columns) — is often far smaller than \(\min(m,n)\). A matrix of rank \(r\) can be written exactly as a product of an \(m \times r\) and an \(r \times n\) matrix, so it really only carries \(r(m+n)\) degrees of freedom. When \(r \ll \min(m,n)\), that is a colossal saving.

Why should real matrices be low-rank? Because they are generated by a small number of latent causes. A ratings table is driven by a handful of taste dimensions, not by a million independent whims. A term–document matrix is driven by a few dozen topics. A grayscale photo's columns are nearly redundant because neighboring columns look almost identical. The rank measures how many independent patterns are actually present; everything else is a combination of them.

EQ M13.1 — RANK-r FACTORIZATION $$ \underbrace{A}_{m\times n} \;=\; \underbrace{U}_{m\times r}\, \underbrace{V^{\top}}_{r\times n}, \qquad \operatorname{rank}(A) = r \;\le\; \min(m,n) $$
Row \(i\) of \(A\) is \(U_{i,:}V^{\top}\): a weighted mix of the \(r\) rows of \(V^{\top}\). Those \(r\) rows are the shared patterns; the row of \(U\) is the recipe that reconstructs entity \(i\) from them. Parameter count drops from \(mn\) to \(r(m+n)\). For a \(10{,}000 \times 10{,}000\) matrix of rank \(20\) that is \(10^8\) numbers collapsing to \(4\times10^5\) — a 250× compression with zero error if the matrix truly has that rank.

Real data is rarely exactly low-rank — there is noise — but it is very often approximately low-rank: a sharp drop in the singular value spectrum (§13.2) followed by a long tail of small values. The art is choosing where to cut the tail: keep enough rank to capture the signal, drop enough to discard the noise. The rest of the chapter is variations on that single decision.

You factor a \(10 \times 10\) matrix as \(A = UV^{\top}\) with rank \(r = 2\) (so \(U\) is \(10\times 2\) and \(V\) is \(10\times 2\)). How many free parameters does this factorization use in total?
A rank-\(r\) factorization of an \(m\times n\) matrix uses \(r(m+n)\) parameters. Here \(r=2,\ m=n=10\): \(2\,(10+10) = 2 \times 20 = \) 40. The dense matrix has \(100\) entries, so even a rank-2 factorization more than halves the storage — and the gap widens fast as the matrix grows.
INSTRUMENT M13.1 — LOW-RANK RATINGSFACTORIZE · PREDICT MISSING · EQ M13.1
OBSERVED CELLS
TRAIN RMSE
PARAMS k(m+n)
Left grid: the observed ratings (blank = unrated). Right grid: the model's reconstruction \(UV^{\top}\) — the blank cells are now predictions, the whole point of a recommender. Drag training steps up to watch gradient descent (EQ M13.2) fit the observed cells and, in doing so, fill the gaps. Bump \(k\) past the true structure and the train RMSE keeps dropping while the predictions get noisier — the overfitting you will fight in §13.3.
13.2

SVD recap & truncation

The singular value decomposition is the factorization that always exists, for every real matrix, and is in a precise sense the best one. It writes \(A\) as a rotation, a non-negative scaling, and another rotation:

EQ M13.2 — THE SVD $$ A \;=\; U \Sigma V^{\top} \;=\; \sum_{i=1}^{r} \sigma_i\, u_i v_i^{\top}, \qquad \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r \ge 0 $$
\(U\) (\(m\times m\)) and \(V\) (\(n\times n\)) are orthogonal (\(U^{\top}U = I\)); their columns \(u_i, v_i\) are the left/right singular vectors. \(\Sigma\) is diagonal with the singular values \(\sigma_i\), which are always non-negative — they are the lengths the unit sphere is stretched to, and a length cannot be negative. The right form sums \(r\) rank-1 layers, each a singular value times an outer product, ordered most-important-first. The squared singular values \(\sigma_i^2\) are the energy (variance) each layer carries.

Now truncate. Keep only the top \(k\) layers and you get the rank-\(k\) matrix \(A_k = \sum_{i=1}^{k}\sigma_i u_i v_i^{\top}\). The Eckart–Young theorem — one of the load-bearing results of applied linear algebra — says this is not just a good rank-\(k\) approximation, it is the best possible one:

EQ M13.3 — ECKART–YOUNG (BEST RANK-k FIT) $$ \min_{\operatorname{rank}(B)\le k}\; \lVert A - B\rVert_F \;=\; \lVert A - A_k\rVert_F \;=\; \sqrt{\sum_{i=k+1}^{r}\sigma_i^{2}} $$
Among all matrices of rank \(\le k\), the truncated SVD \(A_k\) minimizes the Frobenius (and spectral) error, and the leftover error is exactly the energy of the singular values you threw away. This is why "keep the top \(k\) singular values" is the right thing to do, not a heuristic. The same theorem justifies LoRA (Vol II · EQ 6.1): if a weight update is low-rank, its best compression is its truncated SVD.

How big should \(k\) be? Plot the singular values (a scree plot) or, better, the cumulative energy retained, \(\sum_{i\le k}\sigma_i^2 \big/ \sum_i \sigma_i^2\). A common rule is to keep enough components to retain 90–99% of the energy — the elbow where the curve flattens marks the boundary between signal and noise.

True or false: the singular values \(\sigma_i\) produced by the SVD of a real matrix are always non-negative. (Answer true or false.)
The singular values are the square roots of the eigenvalues of \(A^{\top}A\), a positive-semidefinite matrix, so each \(\sigma_i = \sqrt{\lambda_i} \ge 0\). Geometrically they are the factors by which the unit sphere is stretched along the principal axes, and a stretch length is never negative. The answer is true. (Their signs are absorbed into the singular vectors instead.)
A matrix has singular values \(\sigma = (6,\, 3,\, 2,\, 1)\). What percent of the total energy (sum of squares) is retained by the best rank-2 approximation? Enter the percent, e.g. 90 for 90%.
Energy is \(\sigma_i^2\). Total \(= 36 + 9 + 4 + 1 = 50\). The top two layers carry \(36 + 9 = 45\). Retained fraction \(= 45/50 = 0.90\), so \(\times 100 = \) 90%. Rank-2 already captures nine-tenths of this matrix; the last two layers — energy \(4 + 1 = 5\) — are nearly noise.
PYTHON · RUNNABLE IN-BROWSER
# Truncated-SVD recommender: held-out RMSE as a function of rank (EQ M13.2-3)
import numpy as np
rng = np.random.default_rng(1)
m, n, true_r = 40, 25, 2                              # data is REALLY rank 2
A = rng.normal(0, 1, (m, true_r)) @ rng.normal(0, 1, (true_r, n))
A = 1 + 4 * (A - A.min()) / (A.max() - A.min())      # squash into 1..5 stars
A += rng.normal(0, 0.15, A.shape)                    # + observation noise
test = rng.random(A.shape) < 0.2                     # hide 20% of cells as test
mu = A[~test].mean()                                 # global mean fills the holes
F = np.where(test, mu, A)                            # mean-imputed train matrix

ranks, rmses = [1, 2, 3, 5, 10], []
for r in ranks:
    Uu, s, Vt = np.linalg.svd(F - mu, full_matrices=False)
    Ahat = mu + (Uu[:, :r] * s[:r]) @ Vt[:r]         # best rank-r fit (EQ M13.3)
    rmse = np.sqrt(np.mean((A[test] - Ahat[test]) ** 2))
    rmses.append(rmse)
    print(f"rank {r:2d}: held-out RMSE {rmse:.3f}")
print("\nRMSE bottoms out near the TRUE rank (2); higher rank just refits noise.")
plot_xy(ranks, rmses)
edits are live — break it on purpose
INSTRUMENT M13.2 — SCREE & ENERGYPICK k · VARIANCE RETAINED · EQ M13.3
ENERGY RETAINED
ERROR ‖A−A_k‖_F / ‖A‖_F
COMPRESSION (k=N→k)
Bars are squared singular values (energy); the mint line is cumulative energy retained as you keep more components. Slide \(k\) to the elbow — where the bars go flat — and you keep nearly all the energy for a fraction of the rank. Switch the decay from fast (sharp spectrum, a few components suffice) to slow (flat spectrum, no good low-rank fit exists) to feel when factorization helps and when it does not.
13.3

Recommender systems — latent factors

The canonical application, made famous by the 2006–2009 Netflix Prize, is collaborative filtering. You have a sparse \(m\times n\) ratings matrix \(R\): users by items, with the vast majority of entries missing (a typical user has rated 0.1% of the catalog). The task is to fill in the blanks. Matrix factorization's answer: assign every user a latent vector \(p_u \in \mathbb{R}^{k}\) and every item a latent vector \(q_i \in \mathbb{R}^{k}\), and predict the rating as their dot product.

EQ M13.4 — LATENT-FACTOR MODEL (WITH BIASES) $$ \hat r_{ui} \;=\; \mu + b_u + b_i + p_u^{\top} q_i, \qquad p_u, q_i \in \mathbb{R}^{k} $$
\(\mu\) is the global mean, \(b_u\) and \(b_i\) the user/item biases (a generous rater, a beloved film), and \(p_u^{\top}q_i\) is the interaction: how much user \(u\)'s tastes align with item \(i\)'s attributes along \(k\) hidden axes the model discovers for itself (one axis might turn out to be "arthouse ↔ blockbuster", another "comedy ↔ drama"). The latent vectors are exactly the rows of \(U\) and \(V\) in EQ M13.1 — collaborative filtering is matrix factorization, only on the observed entries.

You cannot run a plain SVD here, because SVD needs a complete matrix and \(R\) is mostly holes — imputing the holes with a constant first then running SVD biases the result toward that constant. The fix is to fit only the observed entries, minimizing regularized squared error by gradient descent:

EQ M13.5 — OBSERVED-ENTRY OBJECTIVE $$ \min_{P,Q,b}\; \sum_{(u,i)\in\mathcal{K}} \Big(r_{ui} - \hat r_{ui}\Big)^2 \;+\; \lambda\Big(\lVert p_u\rVert^2 + \lVert q_i\rVert^2 + b_u^2 + b_i^2\Big) $$
\(\mathcal{K}\) is the set of known ratings — the sum skips every missing cell, which is the whole trick. \(\lambda\) is the regularization strength that stops the model from memorizing the sparse observations; without it, a high \(k\) overfits instantly. The gradient for \(p_u\) is \(-2\,e_{ui}\,q_i + 2\lambda p_u\) with error \(e_{ui} = r_{ui} - \hat r_{ui}\) — the update each known rating sends to its user and item vectors. This is "Funk SVD", Simon Funk's Netflix-Prize method, and it is still the textbook recommender.

The cold-start caveat, honestly. A new user or item with no ratings has no factors to estimate — the model defaults to biases alone, which is to say it guesses the average. Pure collaborative filtering is also blind to content (genre, text, image) and prone to popularity bias. Production systems since the mid-2010s blend factorization with content features and, increasingly, deep retrieval models — but a two-tower neural recommender is still computing a dot product of learned user and item embeddings. The latent-factor idea did not go away; it got an encoder in front of it.

With \(\mu = 3.5\), user bias \(b_u = -0.2\), item bias \(b_i = 0.1\), and latent vectors \(p_u = (1,\, 0.5)\), \(q_i = (0.2,\, 0.6)\), what rating does EQ M13.4 predict for \(\hat r_{ui}\)?
Dot product \(p_u^{\top}q_i = (1)(0.2) + (0.5)(0.6) = 0.2 + 0.3 = 0.5\). Then \(\hat r_{ui} = \mu + b_u + b_i + p_u^{\top}q_i = 3.5 - 0.2 + 0.1 + 0.5\): step by step \(3.5 - 0.2 = 3.3\), \(+0.1 = 3.4\), \(+0.5 = \) 3.9. The biases nudge a baseline 3.5 down for a harsh rater and up for a well-liked film; the interaction term adds the personalized lift.
PYTHON · RUNNABLE IN-BROWSER
# Funk SVD: matrix factorization by gradient descent on OBSERVED entries (EQ M13.5)
import numpy as np
rng = np.random.default_rng(0)
R = np.array([                                   # 6 users x 5 movies; NaN = unrated
 [5, 3, np.nan, 1, np.nan],
 [4, np.nan, np.nan, 1, 2],
 [1, 1, np.nan, 5, np.nan],
 [1, np.nan, np.nan, 4, 5],
 [np.nan, 1, 5, 4, np.nan],
 [2, 1, 4, np.nan, 5]], float)
mask = ~np.isnan(R)                              # True where a rating is known
k, lr, lam = 2, 0.02, 0.1
U = rng.normal(0, .1, (R.shape[0], k))          # user factors P
V = rng.normal(0, .1, (R.shape[1], k))          # item factors Q
for step in range(4000):
    E = np.where(mask, R - U @ V.T, 0.0)         # error on observed cells only
    U += lr * (E @ V - lam * U)                  # gradient step (EQ M13.5)
    V += lr * (E.T @ U - lam * V)
P = U @ V.T
print(f"train RMSE on observed entries: {np.sqrt(np.nanmean((R - P)[mask] ** 2)):.3f}")
print("reconstructed matrix (blanks are now PREDICTIONS):")
print(np.round(P, 1))
print("\nuser 0's two unrated movies (cols 2, 4) predicted:", np.round(P[0, [2, 4]], 2))
edits are live — break it on purpose
13.4

Non-negative Matrix Factorization

SVD's singular vectors mix positive and negative entries freely, so its factors add and subtract. That makes them mathematically optimal but often uninterpretable: a "component" of a face might be a ghostly blend that only makes sense once you cancel it against another. Non-negative Matrix Factorization (NMF) imposes one extra constraint — every entry of both factors must be \(\ge 0\) — and that constraint changes everything about what the parts look like.

EQ M13.6 — NMF $$ A \approx W H, \qquad A \in \mathbb{R}_{\ge 0}^{m\times n},\; W \in \mathbb{R}_{\ge 0}^{m\times k},\; H \in \mathbb{R}_{\ge 0}^{k\times n} $$
With no subtraction allowed, the only way to build the data is to add up parts. Lee & Seung's 1999 result: applied to a set of face images, NMF discovers localized features — a nose, an eyebrow, a mouth — because a face is literally a sum of its parts, never a part minus another. On text, the \(k\) columns of \(W\) become interpretable topics (clusters of co-occurring words) — NMF is one of the classic topic models. The price: no closed form, and the factorization is not unique.

NMF is fit by multiplicative updates, a pair of element-wise rules that preserve non-negativity automatically (no projection step, no learning rate to tune) and monotonically decrease the reconstruction error:

EQ M13.7 — MULTIPLICATIVE UPDATE RULES $$ H \leftarrow H \odot \frac{W^{\top}A}{W^{\top}WH}, \qquad W \leftarrow W \odot \frac{A H^{\top}}{W H H^{\top}} \qquad (\odot,\, / \text{ element-wise}) $$
Each entry is multiplied by a ratio of "what the data wants" over "what the current model gives". Because every term is non-negative, a non-negative factor stays non-negative — the constraint is enforced by the algebra itself, not bolted on. A tiny \(\varepsilon\) in the denominator avoids division by zero. These converge to a local (not global) minimum, so initialization matters; NNDSVD initialization is the common fix.

SVD vs NMF, the honest trade. SVD gives the provably best low-rank fit (Eckart–Young) with orthogonal, ordered, unique factors — ideal when you want compression or principal directions. NMF gives a usually-worse fit with non-orthogonal, unordered, non-unique factors — but ones a human can read as additive parts. Choose SVD/PCA when you want the optimal subspace; choose NMF when you want interpretable, parts-based components and the data is naturally non-negative (counts, intensities, spectra).

INSTRUMENT M13.3 — NMF PARTS DECOMPOSITIONADD-ONLY PARTS · EQ M13.6-7
RECONSTRUCTION ERR
ALL ENTRIES ≥ 0
PARTS DISCOVERED
The data (left) is a set of glyphs built from a few overlapping strokes. NMF with \(k\) parts finds those strokes (middle, the columns of \(W\)) and reconstructs each glyph as a non-negative sum of them (right). Set \(k\) to the true number of strokes and the parts snap into clean, localized pieces; set it too low and parts get smeared together. Drag iterations from 0 to watch the multiplicative updates (EQ M13.7) drive the error down while every entry stays non-negative.
13.5

PCA as SVD; the connections

You met Principal Component Analysis as variance-hunting in Chapter 05. Here is the secret it shares with everything above: PCA is just the SVD of the centered data matrix. Center each column to mean zero (call the result \(X_c\)); then the principal components are the right singular vectors, and the variance along each is the squared singular value, scaled by the sample count.

EQ M13.8 — PCA = SVD OF CENTERED DATA $$ X_c = U\Sigma V^{\top} \;\Longrightarrow\; \underbrace{\tfrac{1}{m-1}X_c^{\top}X_c}_{\text{covariance } C} = V\,\frac{\Sigma^2}{m-1}\,V^{\top}, \qquad \lambda_i = \frac{\sigma_i^2}{m-1} $$
The eigenvectors of the covariance \(C\) are the right singular vectors \(V\); the eigenvalues are \(\lambda_i = \sigma_i^2/(m-1)\) — the variance captured by component \(i\). So "directions of maximum variance" (PCA) and "best low-rank approximation" (truncated SVD) are the same computation. The energy-retained curve from §13.2 is identical to PCA's explained-variance ratio. In practice you run the SVD directly on \(X_c\): it is more numerically stable than forming \(C\) and eigendecomposing it.

That equivalence is the unifying thread of this chapter. The same decomposition, read three ways, becomes three tools:

You want…The factorizationWhat the factors mean
Best low-rank fit / compressiontruncated SVD \(A_k\)orthogonal, ordered, optimal (Eckart–Young)
Principal directions / decorrelationSVD of centered \(X_c\)PCA components = right singular vectors
Fill missing entries (recommend)Funk SVD on observed \(\mathcal{K}\)user/item latent factors \(p_u, q_i\)
Interpretable additive partsNMF \(WH\), \(W,H\ge 0\)topics, strokes, spectra — parts you can name

Where this reaches. Latent Semantic Analysis is truncated SVD of a term–document matrix. Classical word embeddings (GloVe, and the implicit factorization behind word2vec) are matrix factorizations of co-occurrence statistics. Spectral clustering factorizes a graph Laplacian. Image and video codecs lean on related transforms. The low-rank prior — "a few hidden factors explain most of the data" — is one of the most reusable assumptions in all of machine learning, and factorization is how you cash it in.

NEXT

Factorization compresses one matrix into a few smart parts; ensembles compress many weak models into one strong vote. Chapter 14: bagging, boosting, and stacking — why a forest of mediocre trees beats one clever one, and how the bias–variance trade-off plays out when you combine predictors instead of features.

13.R

References

  1. Koren, Y., Bell, R. & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42(8) — the canonical write-up of the latent-factor model and biases (EQ M13.4–M13.5), from the Netflix-Prize winners.
  2. Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 — NMF and the parts-based decomposition of §13.4 (EQ M13.6).
  3. Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1(3) — the theorem that makes truncated SVD the best low-rank fit (EQ M13.3).
  4. Lee, D. D. & Seung, H. S. (2001). Algorithms for Non-negative Matrix Factorization. NIPS 13 — the multiplicative update rules of EQ M13.7 and their convergence guarantee.
  5. Halko, N., Martinsson, P.-G. & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review 53(2) — randomized SVD, how truncated factorizations are actually computed at scale.
  6. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. JASIS 41(6) — LSA: truncated SVD of a term–document matrix, the §13.5 connection to embeddings.
  7. Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NIPS 27 — proof that word2vec's skip-gram is implicitly factorizing a shifted PMI matrix.