Linear Algebra for Machine Learning

6.1

Vectors, matrices & the operations that matter

A vector is an ordered list of numbers, and you should hold two pictures of it at once: an arrow from the origin in $\mathbb{R}^n$, and a single data point — one row of your dataset, one word embedding, one image flattened. A matrix $A \in \mathbb{R}^{m \times n}$ is a grid of numbers, and it too wears two hats: a table of data (rows are examples, columns are features), and — the view this chapter cares about — a function that takes a vector in and returns a vector out.

Three operations carry essentially all the weight. Scaling and addition let you form linear combinations $\alpha\mathbf{u} + \beta\mathbf{v}$; the set of all such combinations of some vectors is their span, a flat subspace through the origin. Matrix–vector multiplication applies the map. And matrix–matrix multiplication composes two maps — its only subtlety is that inner dimensions must agree and order matters: $AB \neq BA$ in general.

EQ S6.1 — MATRIX–VECTOR PRODUCT (TWO READINGS) $$ (A\mathbf{x})_i \;=\; \sum_{j=1}^{n} A_{ij}\, x_j \qquad\Longleftrightarrow\qquad A\mathbf{x} \;=\; \sum_{j=1}^{n} x_j\, \mathbf{a}_{:,j} $$

The left form is the textbook one: output entry $i$ is the dot product of row $i$ of $A$ with $\mathbf{x}$. The right form is the one that builds intuition: $A\mathbf{x}$ is a linear combination of the columns of $A$, weighted by the entries of $\mathbf{x}$. That single re-reading explains rank, span, and the column space (§6.3) in one move — and it is exactly what a fully-connected layer $\mathbf{y} = W\mathbf{x} + \mathbf{b}$ computes a few billion times a second.

Matrix multiplication is associative and distributive but, crucially, not commutative: rotating then stretching is not the same as stretching then rotating. The transpose $A^\top$ flips rows and columns and obeys $(AB)^\top = B^\top A^\top$. The identity $I$ leaves vectors untouched, and an inverse $A^{-1}$ (when it exists — only for square, full-rank $A$) undoes the map. Most matrices in machine learning are not invertible, which is the whole reason §6.5 exists.

SHAPE DISCIPLINE

Half of all numerical bugs are shape bugs. For $AB$ to be defined, $A$ must be $m\times k$ and $B$ must be $k\times n$ — the inner $k$ must match, and the result is $m\times n$. When a model crashes at 3 a.m., the first thing a practitioner prints is .shape. Reading every product right-to-left as "a map applied to the output of another map" makes the dimensions self-checking.

Let $A$ be $2\times 3$ and $B$ be $3\times 4$. The inner dimensions match (both $3$), so $AB$ is defined. How many entries does the resulting matrix $AB$ have?

An $m\times k$ times a $k\times n$ matrix gives an $m\times n$ result. Here $AB$ is $2\times 4$, so it has $2 \times 4 = $ 8 entries. The shared inner dimension $k=3$ is summed over and does not appear in the output shape.

6.2

Norms, dot products & projections

To do geometry you need length and angle. The dot product supplies both: it is the engine behind similarity scores, attention logits, least-squares, and the kernel trick. The L2 (Euclidean) norm is a vector's length; the dot product of two unit vectors is the cosine of the angle between them.

EQ S6.2 — DOT PRODUCT, NORM & COSINE $$ \mathbf{u}\cdot\mathbf{v} = \sum_{i=1}^{n} u_i v_i = \|\mathbf{u}\|\,\|\mathbf{v}\|\cos\theta, \qquad \|\mathbf{v}\|_2 = \sqrt{\mathbf{v}\cdot\mathbf{v}} = \sqrt{\textstyle\sum_i v_i^2} $$

Two vectors are orthogonal exactly when $\mathbf{u}\cdot\mathbf{v}=0$ ($\cos\theta = 0$). Cosine similarity $\dfrac{\mathbf{u}\cdot\mathbf{v}}{\|\mathbf{u}\|\,\|\mathbf{v}\|}$ is the dot product after normalizing length — the default way to compare embeddings, because it cares about direction, not magnitude. The L1 norm $\|\mathbf{v}\|_1 = \sum_i|v_i|$ and the max norm $\|\mathbf{v}\|_\infty = \max_i|v_i|$ are the other two you meet daily; L1 regularization owes its sparsity to the diamond shape of its unit ball.

The geometric companion to the dot product is the projection. To project $\mathbf{x}$ onto the direction of $\mathbf{a}$ is to find the point on the line through $\mathbf{a}$ closest to $\mathbf{x}$ — the "shadow" $\mathbf{x}$ casts on that line. This single idea, generalized to projecting onto the column space of a matrix, is least-squares regression.

EQ S6.3 — PROJECTION ONTO A DIRECTION $$ \mathrm{proj}_{\mathbf{a}}(\mathbf{x}) \;=\; \frac{\mathbf{a}\cdot\mathbf{x}}{\mathbf{a}\cdot\mathbf{a}}\;\mathbf{a}, \qquad\text{and the residual } \mathbf{x} - \mathrm{proj}_{\mathbf{a}}(\mathbf{x}) \perp \mathbf{a} $$

The scalar $\dfrac{\mathbf{a}\cdot\mathbf{x}}{\mathbf{a}\cdot\mathbf{a}}$ is "how many copies of $\mathbf{a}$" you need; multiplying by $\mathbf{a}$ places the shadow on the line. The leftover, the residual, is orthogonal to $\mathbf{a}$ by construction — the defining property that makes projection the unique closest point. Ordinary least squares is exactly this with $\mathbf{a}$ replaced by a whole subspace: $\hat{\mathbf{y}} = X(X^\top X)^{-1}X^\top\mathbf{y}$ projects the targets onto the column space of the design matrix.

What is the L2 (Euclidean) norm of the vector $ \mathbf{v} = (3, 4) $? Compute $ \|\mathbf{v}\|_2 = \sqrt{3^2 + 4^2} $.

$ \|\mathbf{v}\|_2 = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = $ 5. This is the length of the hypotenuse of a 3-4-5 right triangle.

PYTHON · RUNNABLE IN-BROWSER

# Norms, dot products, cosine similarity, and projection (EQ S6.2-S6.3)
import numpy as np

u = np.array([3.0, 4.0])
v = np.array([4.0, -3.0])
x = np.array([2.0, 1.0])

print("||u||_2 (sqrt of sum of squares):", np.linalg.norm(u))      # -> 5.0
print("||u||_1 (sum of abs)            :", np.linalg.norm(u, 1))    # -> 7.0
print("||u||_inf (max abs)             :", np.linalg.norm(u, np.inf))

dot = u @ v
print("\nu . v        :", dot, " -> orthogonal" if abs(dot) < 1e-9 else "")
print("cosine(u, v) :", (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v)))

# project x onto the direction of u
coef = (u @ x) / (u @ u)
proj = coef * u
resid = x - proj
print("\nproj_u(x)    :", proj.round(4))
print("residual     :", resid.round(4))
print("residual . u :", round(float(resid @ u), 12), " <- exactly orthogonal")

edits are live — break it on purpose

6.3

A matrix as a linear map; rank & null space

Here is the conceptual hinge of the chapter. A matrix $A$ is a linear map $\mathbf{x}\mapsto A\mathbf{x}$: it sends lines to lines, keeps the origin fixed, and respects linear combinations, $A(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha A\mathbf{u} + \beta A\mathbf{v}$. Geometrically a $2\times 2$ matrix can rotate, scale, shear, reflect, or flatten the plane — and nothing else. The instrument below lets you grab the four entries and watch the unit grid deform.

Two subspaces describe everything a map does. The column space (range) is the set of all outputs $A\mathbf{x}$ — by EQ S6.1 it is precisely the span of the columns. The null space (kernel) is the set of inputs that get crushed to zero, $A\mathbf{x}=\mathbf{0}$. The dimension of the column space is the rank — the number of genuinely independent directions the map can produce.

EQ S6.4 — RANK–NULLITY THEOREM $$ \underbrace{\operatorname{rank}(A)}_{\dim(\text{column space})} \;+\; \underbrace{\operatorname{nullity}(A)}_{\dim(\text{null space})} \;=\; n \quad (\text{number of columns of } A \in \mathbb{R}^{m\times n}) $$

Every input dimension is accounted for: it either survives into the output (contributing to rank) or is annihilated (contributing to nullity). A square matrix is invertible iff it is full rank, i.e. nullity $=0$, i.e. $\det A \neq 0$. Row rank always equals column rank — a small miracle worth pausing on. Real data matrices are nearly low-rank: a thousand columns of survey answers might have effective rank 20, because the columns are correlated. That redundancy is exactly what §6.5 exploits.

For a $2\times 2$ map the determinant $\det A = ad - bc$ (for $A = \left[\begin{smallmatrix} a & b \\ c & d \end{smallmatrix}\right]$) is the signed area-scaling factor: a unit square of area 1 becomes a parallelogram of area $|\det A|$. A determinant of zero means the map flattens the plane onto a line (or a point) — it loses a dimension, drops rank, and cannot be inverted. A negative determinant means the map also flips orientation (a reflection).

INSTRUMENT S6.1 — 2×2 LINEAR-MAP VISUALIZERDRAG ENTRIES · UNIT GRID · EIGENVECTORS

a (col 1, row 1) 2.0

b (col 2, row 1) 1.0

c (col 1, row 2) 1.0

d (col 2, row 2) 2.0

det A (AREA × ORIENT.)

—

RANK

—

REAL EIGENVALUES λ

—

The faint grid is the input plane; the mint grid is its image under $A$. The blue arrows are the images of the standard basis — they are the columns of $A$. When real eigenvalues exist, their eigenvectors are drawn as red rays: directions the map only stretches, never turns. Pull the determinant to zero (try a=2, b=1, c=2, d=1) and the whole plane collapses onto a single line — rank drops to 1, the map is no longer invertible. Default $A = \left[\begin{smallmatrix}2&1\\1&2\end{smallmatrix}\right]$ has eigenvalues 3 and 1 along the diagonals $(1,1)$ and $(1,-1)$.

6.4

Eigenvalues & eigenvectors — the invariant directions

Most directions get rotated when you apply a matrix. A precious few do not: the map only stretches them, leaving their line untouched. Those special directions are the eigenvectors, and the stretch factors are the eigenvalues. They are the natural coordinate system of the map — the axes along which its behavior is pure scaling.

EQ S6.5 — THE EIGENVALUE EQUATION $$ A\mathbf{v} = \lambda\mathbf{v}, \quad \mathbf{v}\neq\mathbf{0} \qquad\Longleftrightarrow\qquad \det(A - \lambda I) = 0 $$

$A\mathbf{v}=\lambda\mathbf{v}$ says: applying $A$ to $\mathbf{v}$ just rescales it by $\lambda$. For a nonzero $\mathbf{v}$ to exist, $A-\lambda I$ must be singular, giving the characteristic polynomial $\det(A-\lambda I)=0$. For a $2\times 2$ matrix this is the tidy quadratic $\lambda^2 - (\operatorname{tr}A)\,\lambda + \det A = 0$, where $\operatorname{tr}A = a+d$. Two facts fall straight out: the eigenvalues sum to the trace and multiply to the determinant.

Eigen-decomposition is everywhere once you know its face. PCA diagonalizes the covariance matrix; its eigenvectors are the principal axes of the data cloud and the eigenvalues are the variance along each. PageRank is the dominant eigenvector of the web's link matrix. The spectral theorem guarantees that any symmetric matrix $A = A^\top$ — covariance matrices, graph Laplacians, Gram matrices, Hessians — has real eigenvalues and a full set of orthogonal eigenvectors, which is what makes those objects so well-behaved. A symmetric matrix is positive-definite exactly when all its eigenvalues are positive: the condition for a strictly convex quadratic and a unique loss minimum (Stats 02).

CAVEAT

Not every matrix is so tidy. A real matrix can have complex eigenvalues — a pure rotation $\left[\begin{smallmatrix}0&-1\\1&0\end{smallmatrix}\right]$ has eigenvalues $\pm i$ and no real eigenvector, because it turns every direction. Non-symmetric matrices may also be defective: they lack a full set of independent eigenvectors and cannot be diagonalized at all. The SVD (§6.5) sidesteps every one of these pathologies — it exists for any matrix, square or not, real or rank-deficient — which is why it, not eigen-decomposition, is the workhorse of applied linear algebra.

INSTRUMENT S6.2 — 2×2 EIGEN-DECOMPOSITION STEPPERtrace · det · CHARACTERISTIC ROOTS · EQ S6.5

a 2.0

b 1.0

c 1.0

d 2.0

trace = λ₁ + λ₂

—

det = λ₁ · λ₂

—

EIGENVALUES

—

Reads off the characteristic equation step by step: trace, determinant, discriminant $\tau^2 - 4\delta$, then the roots $\lambda = \tfrac{1}{2}\big(\tau \pm \sqrt{\tau^2 - 4\delta}\big)$. When the discriminant goes negative the roots turn complex (a rotational map — no real eigenvector). The default $\left[\begin{smallmatrix}2&1\\1&2\end{smallmatrix}\right]$ gives $\tau=4$, $\delta=3$, discriminant $4$, and eigenvalues 3 and 1. Notice trace and determinant always equal the sum and product of the eigenvalues — a free correctness check.

What is the largest eigenvalue of the diagonal matrix $ \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix} $?

For a diagonal (or triangular) matrix the eigenvalues are exactly the diagonal entries: here $ \{2, 3\} $. The largest is 3. (Check: trace $= 2+3 = 5 = \lambda_1 + \lambda_2$; determinant $= 2\cdot 3 = 6 = \lambda_1\lambda_2$. ✓)

PYTHON · RUNNABLE IN-BROWSER

# Power iteration: find the dominant eigenvector, compare to numpy.linalg.eig
import numpy as np
rng = np.random.default_rng(0)

A = np.array([[2.0, 1.0],
              [1.0, 2.0]])           # symmetric: eigenvalues 3 and 1

v = rng.normal(size=2)               # random start
v /= np.linalg.norm(v)
for it in range(50):                 # repeatedly apply A and renormalize
    w = A @ v
    v = w / np.linalg.norm(w)
lam = v @ (A @ v)                    # Rayleigh quotient -> the eigenvalue

# numpy's reference decomposition
vals, vecs = np.linalg.eig(A)
top = np.argmax(np.abs(vals))

print("power-iteration eigenvalue :", round(float(lam), 6))
print("numpy top eigenvalue       :", round(float(vals[top].real), 6))
print("power-iteration eigenvector:", np.round(np.abs(v), 4))
print("numpy top eigenvector      :", np.round(np.abs(vecs[:, top]), 4))
print("A v - lambda v (~0)        :", np.round(A @ v - lam * v, 6))

edits are live — break it on purpose

6.5

The Singular Value Decomposition — the master factorization

Eigen-decomposition is fussy: it wants square, ideally symmetric, non-defective matrices. The Singular Value Decomposition has no such demands. Every matrix — rectangular, rank-deficient, whatever — factors as a rotation, a pure axis-aligned scaling, and another rotation:

EQ S6.6 — THE SVD $$ A = U\Sigma V^\top, \qquad A \in \mathbb{R}^{m\times n},\; U^\top U = I,\; V^\top V = I,\; \Sigma = \operatorname{diag}(\sigma_1 \ge \sigma_2 \ge \cdots \ge 0) $$

$V$ (right singular vectors) is an orthonormal basis of the input space, $U$ (left singular vectors) of the output space, and the singular values $\sigma_i$ on $\Sigma$ say how much each direction is stretched. Read it as a recipe: $A$ rotates by $V^\top$, scales each axis by $\sigma_i$, then rotates by $U$. The rank of $A$ is just the count of nonzero $\sigma_i$. The singular values are the square roots of the eigenvalues of $A^\top A$, connecting the SVD back to §6.4. Every matrix has one; it is the closest thing linear algebra has to a universal tool.

The SVD's superpower is optimal compression. Keep only the largest $k$ singular values — zero out the rest — and you get the best possible rank-$k$ approximation of $A$, in a precise sense. This is the Eckart–Young theorem, one of the most useful results in all of applied mathematics:

EQ S6.7 — ECKART–YOUNG: BEST LOW-RANK APPROXIMATION $$ A_k = \sum_{i=1}^{k} \sigma_i\, \mathbf{u}_i \mathbf{v}_i^\top \;=\; \arg\min_{\operatorname{rank}(B)\le k} \|A - B\|, \qquad \|A - A_k\|_F = \sqrt{\textstyle\sum_{i>k}\sigma_i^2} $$

No rank-$k$ matrix approximates $A$ better than truncating its SVD — true in both the spectral norm (error $=\sigma_{k+1}$) and the Frobenius norm (error $=\sqrt{\sum_{i>k}\sigma_i^2}$). The reconstruction error is governed entirely by the singular values you threw away. If the spectrum decays fast — as it does for almost all real data — a tiny $k$ captures nearly everything. This single theorem underlies PCA, image compression, latent-semantic indexing, collaborative-filtering recommenders, and the low-rank update at the heart of LoRA fine-tuning (Vol II · CH 06).

The connection to PCA is exact: center your data matrix, take its SVD, and the right singular vectors $\mathbf{v}_i$ are the principal components, with variance $\sigma_i^2/(N-1)$ along each. Computing PCA via the SVD rather than by forming $X^\top X$ is also numerically preferable — squaring the matrix squares its condition number and throws away precision.

INSTRUMENT S6.3 — SVD LOW-RANK APPROXIMATIONRANK-k RECONSTRUCTION · ECKART–YOUNG · EQ S6.7

TARGET

KEPT RANK k 3

FULL RANK

16

RELATIVE ERROR ‖A−Aₖ‖/‖A‖

—

STORAGE vs FULL

—

Left panel is the original $16\times16$ matrix as a heatmap; right panel is its rank-$k$ reconstruction $A_k = \sum_{i\le k}\sigma_i\mathbf{u}_i\mathbf{v}_i^\top$. Drag $k$ and watch the error fall. The smooth ramp is essentially rank 2 — by $k=2$ the error is already near zero. The ring has a longer tail of singular values, so its error decays gradually. The blocky checkerboard hides a surprise: despite looking high-frequency it has only two nonzero singular values (both $\approx 8$), so $k=2$ reconstructs it perfectly — a reminder that visual complexity and algebraic rank are different things. A rank-$k$ factorization stores $k(m+n)$ numbers instead of $mn$: at $k=3$ on a $16\times16$ grid that is 96 vs 256 numbers, and the gap widens enormously at scale. The error reported is exactly $\sqrt{\sum_{i>k}\sigma_i^2}/\|A\|_F$, as Eckart–Young promises.

PYTHON · RUNNABLE IN-BROWSER

# SVD a small matrix, reconstruct with the top-k singular values, print the error
import numpy as np
rng = np.random.default_rng(1)

# a 6x5 matrix with a deliberately low-rank core plus a little noise
core = rng.normal(size=(6, 2)) @ rng.normal(size=(2, 5))   # true rank 2
A = core + 0.05 * rng.normal(size=(6, 5))

U, s, Vt = np.linalg.svd(A, full_matrices=False)
print("singular values:", np.round(s, 3))

normA = np.linalg.norm(A)             # Frobenius norm
for k in range(1, len(s) + 1):
    Ak = (U[:, :k] * s[:k]) @ Vt[:k]                  # rank-k reconstruction
    err_measured = np.linalg.norm(A - Ak) / normA
    err_formula  = np.sqrt((s[k:] ** 2).sum()) / normA   # Eckart-Young, EQ S6.7
    print(f"k={k}: rel.error {err_measured:.4f}  "
          f"(formula sqrt(sum sigma_i>k^2): {err_formula:.4f})")

print("\nerror collapses after k=2 -- the matrix is essentially rank 2,")
print("and the measured error matches the Eckart-Young formula exactly.")
plot_xy(list(range(1, len(s) + 1)), list(s))   # the singular-value spectrum

edits are live — break it on purpose

You now have the static geometry; next comes the dynamics. A linear map applied over and over — a stochastic transition matrix stepping a probability distribution forward — is a Markov chain, and its long-run behavior is decided by exactly the eigenstructure you just met: the dominant eigenvalue is 1, and its eigenvector is the stationary distribution. Chapter 07 turns the matrix loose in time.

6.R

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. the standard intuition-first text for the column-space, rank, eigenvalue, and SVD material here.
Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. the canonical numerical reference for the SVD, power iteration, and conditioning.
Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218. the optimal low-rank approximation theorem (EQ S6.7).
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572. the origin of principal-component analysis as projection onto a best-fit subspace (§6.2, §6.5).
Stewart, G. W. (1993). On the early history of the singular value decomposition. SIAM Review, 35(4), 551–566. historical context tracing the SVD from Beltrami and Jordan to its modern role.