Vectors, matrices & the operations that matter
A vector is an ordered list of numbers, and you should hold two pictures of it at once: an arrow from the origin in \(\mathbb{R}^n\), and a single data point — one row of your dataset, one word embedding, one image flattened. A matrix \(A \in \mathbb{R}^{m \times n}\) is a grid of numbers, and it too wears two hats: a table of data (rows are examples, columns are features), and — the view this chapter cares about — a function that takes a vector in and returns a vector out.
Three operations carry essentially all the weight. Scaling and addition let you form linear combinations \(\alpha\mathbf{u} + \beta\mathbf{v}\); the set of all such combinations of some vectors is their span, a flat subspace through the origin. Matrix–vector multiplication applies the map. And matrix–matrix multiplication composes two maps — its only subtlety is that inner dimensions must agree and order matters: \(AB \neq BA\) in general.
Matrix multiplication is associative and distributive but, crucially, not commutative: rotating then stretching is not the same as stretching then rotating. The transpose \(A^\top\) flips rows and columns and obeys \((AB)^\top = B^\top A^\top\). The identity \(I\) leaves vectors untouched, and an inverse \(A^{-1}\) (when it exists — only for square, full-rank \(A\)) undoes the map. Most matrices in machine learning are not invertible, which is the whole reason §6.5 exists.
Half of all numerical bugs are shape bugs. For \(AB\) to be defined, \(A\) must be \(m\times k\) and \(B\) must be \(k\times n\) — the inner \(k\) must match, and the result is \(m\times n\). When a model crashes at 3 a.m., the first thing a practitioner prints is .shape. Reading every product right-to-left as "a map applied to the output of another map" makes the dimensions self-checking.
Norms, dot products & projections
To do geometry you need length and angle. The dot product supplies both: it is the engine behind similarity scores, attention logits, least-squares, and the kernel trick. The L2 (Euclidean) norm is a vector's length; the dot product of two unit vectors is the cosine of the angle between them.
The geometric companion to the dot product is the projection. To project \(\mathbf{x}\) onto the direction of \(\mathbf{a}\) is to find the point on the line through \(\mathbf{a}\) closest to \(\mathbf{x}\) — the "shadow" \(\mathbf{x}\) casts on that line. This single idea, generalized to projecting onto the column space of a matrix, is least-squares regression.
# Norms, dot products, cosine similarity, and projection (EQ S6.2-S6.3)
import numpy as np
u = np.array([3.0, 4.0])
v = np.array([4.0, -3.0])
x = np.array([2.0, 1.0])
print("||u||_2 (sqrt of sum of squares):", np.linalg.norm(u)) # -> 5.0
print("||u||_1 (sum of abs) :", np.linalg.norm(u, 1)) # -> 7.0
print("||u||_inf (max abs) :", np.linalg.norm(u, np.inf))
dot = u @ v
print("\nu . v :", dot, " -> orthogonal" if abs(dot) < 1e-9 else "")
print("cosine(u, v) :", (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v)))
# project x onto the direction of u
coef = (u @ x) / (u @ u)
proj = coef * u
resid = x - proj
print("\nproj_u(x) :", proj.round(4))
print("residual :", resid.round(4))
print("residual . u :", round(float(resid @ u), 12), " <- exactly orthogonal")
A matrix as a linear map; rank & null space
Here is the conceptual hinge of the chapter. A matrix \(A\) is a linear map \(\mathbf{x}\mapsto A\mathbf{x}\): it sends lines to lines, keeps the origin fixed, and respects linear combinations, \(A(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha A\mathbf{u} + \beta A\mathbf{v}\). Geometrically a \(2\times 2\) matrix can rotate, scale, shear, reflect, or flatten the plane — and nothing else. The instrument below lets you grab the four entries and watch the unit grid deform.
Two subspaces describe everything a map does. The column space (range) is the set of all outputs \(A\mathbf{x}\) — by EQ S6.1 it is precisely the span of the columns. The null space (kernel) is the set of inputs that get crushed to zero, \(A\mathbf{x}=\mathbf{0}\). The dimension of the column space is the rank — the number of genuinely independent directions the map can produce.
For a \(2\times 2\) map the determinant \(\det A = ad - bc\) (for \(A = \left[\begin{smallmatrix} a & b \\ c & d \end{smallmatrix}\right]\)) is the signed area-scaling factor: a unit square of area 1 becomes a parallelogram of area \(|\det A|\). A determinant of zero means the map flattens the plane onto a line (or a point) — it loses a dimension, drops rank, and cannot be inverted. A negative determinant means the map also flips orientation (a reflection).
Eigenvalues & eigenvectors — the invariant directions
Most directions get rotated when you apply a matrix. A precious few do not: the map only stretches them, leaving their line untouched. Those special directions are the eigenvectors, and the stretch factors are the eigenvalues. They are the natural coordinate system of the map — the axes along which its behavior is pure scaling.
Eigen-decomposition is everywhere once you know its face. PCA diagonalizes the covariance matrix; its eigenvectors are the principal axes of the data cloud and the eigenvalues are the variance along each. PageRank is the dominant eigenvector of the web's link matrix. The spectral theorem guarantees that any symmetric matrix \(A = A^\top\) — covariance matrices, graph Laplacians, Gram matrices, Hessians — has real eigenvalues and a full set of orthogonal eigenvectors, which is what makes those objects so well-behaved. A symmetric matrix is positive-definite exactly when all its eigenvalues are positive: the condition for a strictly convex quadratic and a unique loss minimum (Stats 02).
Not every matrix is so tidy. A real matrix can have complex eigenvalues — a pure rotation \(\left[\begin{smallmatrix}0&-1\\1&0\end{smallmatrix}\right]\) has eigenvalues \(\pm i\) and no real eigenvector, because it turns every direction. Non-symmetric matrices may also be defective: they lack a full set of independent eigenvectors and cannot be diagonalized at all. The SVD (§6.5) sidesteps every one of these pathologies — it exists for any matrix, square or not, real or rank-deficient — which is why it, not eigen-decomposition, is the workhorse of applied linear algebra.
# Power iteration: find the dominant eigenvector, compare to numpy.linalg.eig
import numpy as np
rng = np.random.default_rng(0)
A = np.array([[2.0, 1.0],
[1.0, 2.0]]) # symmetric: eigenvalues 3 and 1
v = rng.normal(size=2) # random start
v /= np.linalg.norm(v)
for it in range(50): # repeatedly apply A and renormalize
w = A @ v
v = w / np.linalg.norm(w)
lam = v @ (A @ v) # Rayleigh quotient -> the eigenvalue
# numpy's reference decomposition
vals, vecs = np.linalg.eig(A)
top = np.argmax(np.abs(vals))
print("power-iteration eigenvalue :", round(float(lam), 6))
print("numpy top eigenvalue :", round(float(vals[top].real), 6))
print("power-iteration eigenvector:", np.round(np.abs(v), 4))
print("numpy top eigenvector :", np.round(np.abs(vecs[:, top]), 4))
print("A v - lambda v (~0) :", np.round(A @ v - lam * v, 6))
The Singular Value Decomposition — the master factorization
Eigen-decomposition is fussy: it wants square, ideally symmetric, non-defective matrices. The Singular Value Decomposition has no such demands. Every matrix — rectangular, rank-deficient, whatever — factors as a rotation, a pure axis-aligned scaling, and another rotation:
The SVD's superpower is optimal compression. Keep only the largest \(k\) singular values — zero out the rest — and you get the best possible rank-\(k\) approximation of \(A\), in a precise sense. This is the Eckart–Young theorem, one of the most useful results in all of applied mathematics:
The connection to PCA is exact: center your data matrix, take its SVD, and the right singular vectors \(\mathbf{v}_i\) are the principal components, with variance \(\sigma_i^2/(N-1)\) along each. Computing PCA via the SVD rather than by forming \(X^\top X\) is also numerically preferable — squaring the matrix squares its condition number and throws away precision.
# SVD a small matrix, reconstruct with the top-k singular values, print the error
import numpy as np
rng = np.random.default_rng(1)
# a 6x5 matrix with a deliberately low-rank core plus a little noise
core = rng.normal(size=(6, 2)) @ rng.normal(size=(2, 5)) # true rank 2
A = core + 0.05 * rng.normal(size=(6, 5))
U, s, Vt = np.linalg.svd(A, full_matrices=False)
print("singular values:", np.round(s, 3))
normA = np.linalg.norm(A) # Frobenius norm
for k in range(1, len(s) + 1):
Ak = (U[:, :k] * s[:k]) @ Vt[:k] # rank-k reconstruction
err_measured = np.linalg.norm(A - Ak) / normA
err_formula = np.sqrt((s[k:] ** 2).sum()) / normA # Eckart-Young, EQ S6.7
print(f"k={k}: rel.error {err_measured:.4f} "
f"(formula sqrt(sum sigma_i>k^2): {err_formula:.4f})")
print("\nerror collapses after k=2 -- the matrix is essentially rank 2,")
print("and the measured error matches the Eckart-Young formula exactly.")
plot_xy(list(range(1, len(s) + 1)), list(s)) # the singular-value spectrum
You now have the static geometry; next comes the dynamics. A linear map applied over and over — a stochastic transition matrix stepping a probability distribution forward — is a Markov chain, and its long-run behavior is decided by exactly the eigenstructure you just met: the dominant eigenvalue is 1, and its eigenvector is the stationary distribution. Chapter 07 turns the matrix loose in time.
References
- Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press.
- Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
- Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218.
- Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.
- Stewart, G. W. (1993). On the early history of the singular value decomposition. SIAM Review, 35(4), 551–566.