The model and the loss surface
A linear model predicts by taking a weighted sum of the input features: each feature gets one learned number saying how much it matters and in which direction. Stack all \(n\) training examples as rows of a matrix \(X\), and the whole dataset's predictions become a single matrix–vector product. One bookkeeping trick makes the intercept disappear as a special case: append a constant feature of 1 to every example, and the bias \(b\) becomes just another weight.
How wrong is a given \(w\)? Chapter 01's answer was a loss function; here the natural one is mean squared error — average the squared miss over the dataset. Squaring does three jobs at once: misses in both directions count, large misses count disproportionately, and the result is smooth everywhere, so it has a well-defined slope we can follow.
Convexity, in words. A loss surface is convex when the straight line between any two points on it never dips below the surface — no hidden dimples for an optimizer to fall into. The practical guarantee is the one that matters: every downhill path leads to the same lowest point. Wherever you start and however clumsily you descend, if you keep going down, you arrive. Linear regression with MSE is convex; deep networks are emphatically not — which makes this chapter the one place you can watch optimization work with the guarantee switched on, before later chapters take it away.
One honest wrinkle: if two features are exact copies of each other (or one is a linear combination of others), the bowl's floor becomes a flat trench — infinitely many \(w\) achieve the same minimum loss. Convexity still holds; uniqueness doesn't. Real pipelines hit this with duplicated columns and one-hot encodings more often than you'd think.
The closed form — and why we don't use it at scale
A smooth bowl has exactly one flat point: the bottom. Setting the gradient of EQ M2.2 to zero and solving gives the minimizer outright — no iteration, no hyperparameters, no luck. This is the normal equation, and linear regression is nearly alone among learning algorithms in having one:
np.linalg.solve or QR-based lstsq do the same job with far better numerical behavior.So why does the rest of machine learning iterate instead? Three reasons, in increasing order of importance:
| Concern | Normal equation | Gradient descent |
|---|---|---|
| Compute | O(nd² + d³) — the d³ solve is fatal once d hits 10⁵+ | O(nd) per pass; scales to billions of parameters |
| Memory | must hold the d×d matrix XᵀX | only w and one gradient |
| Numerics | conditioning of XᵀX is the square of X's — correlated features amplify rounding error | tolerant; error shrinks geometrically |
| Data access | needs all data at once | works on streams and mini-batches |
| Generality | exists only because h is linear and the loss quadratic | works for any differentiable h — including a transformer |
The last row is the real verdict. The normal equation is a one-off gift of linear algebra: change the model to anything nonlinear — a logistic curve, a two-layer network — and the closed form vanishes forever. Gradient descent asks only that the loss have a slope. We learn the closed form not to use it, but to keep it as an oracle: in §2.5 we'll let it grade gradient descent's answer.
Gradient descent: feel the slope, step downhill
Imagine standing on the bowl blindfolded. You can't see the bottom, but at your feet you can feel which way is steepest. The gradient \(\nabla \mathcal{L}\) is that feeling, made precise: the vector of partial derivatives, pointing in the direction of steepest increase. So walk the other way. For MSE the gradient comes out of the chain rule in one line, and it has a shape worth memorizing:
That's the whole algorithm: predict, measure residuals, push the error back through the features, step, repeat. Watch it run on a real (toy) regression below — the left panel shows the bowl from above as loss contours in \((w, b)\) space; the right panel shows what each position means: a candidate line through the data.
The same loop in real code — numpy, no libraries, nothing hidden. Run it, then break it: try eta = 0.9, or delete the 2.0 / n and watch the effective step size change.
import numpy as np
rng = np.random.default_rng(0)
n = 80
x = rng.uniform(-2, 2, n)
y = 1.7 * x - 0.4 + rng.normal(0, 0.5, n) # truth: slope 1.7, intercept -0.4
X = np.column_stack([x, np.ones(n)]) # bias trick: w = [slope, intercept]
w = np.zeros(2)
eta = 0.1
steps, losses = [], []
for t in range(201):
r = X @ w - y # residuals
mse = (r @ r) / n
steps.append(t); losses.append(mse)
if t % 20 == 0:
print(f"step {t:3d} mse {mse:7.4f} w = {np.round(w, 3)}")
grad = 2.0 / n * (X.T @ r) # EQ M2.4
w = w - eta * grad # EQ M2.5
plot_xy(steps, losses)
The loss curve you just plotted has the signature shape of healthy gradient descent on a convex problem: steep early progress, then a long geometric glide toward the noise floor — it never reaches zero, because the data contains genuine noise no line can explain. A training curve that does hit zero should make you suspicious (Chapter 01: memorization).
The learning rate: the most important hyperparameter
Everything about gradient descent's behavior is decided by one number. Too small, and you inch downhill for geological time. Too large, and each step overshoots the bottom, lands on the far wall higher than it started, and the loss explodes exponentially. The boundary between those fates is sharp, and on a quadratic bowl you can compute it exactly. Take the one-dimensional case: a parabola with curvature \(\lambda\) (its second derivative). One algebra step shows each update multiplies the remaining error by a fixed factor:
Four learning rates, one problem, eighty steps each — the canonical picture every practitioner carries in their head:
Beyond the bowl. On a deep network's non-convex surface no single \(\lambda\) exists — curvature varies wildly across the landscape and across training time. That's why real recipes use learning-rate schedules (a gentle warmup so early chaotic gradients don't launch the weights, then a long decay) and per-coordinate adaptive optimizers like Adam, which effectively give every weight its own η. The full machinery appears with pre-training at scale in Vol II · Chapter 04. But every schedule and every optimizer is still negotiating with EQ M2.6's constraint — they never escape it.
Features and preprocessing: why scale decides convergence
Here is the trap every beginner falls into once. Suppose one feature is "number of rooms" (range 1–10) and another is "square footage" (range 500–5,000). The loss bowl in those two weight directions has wildly different curvatures — the square-footage direction is roughly \((500)^2\) times steeper, because curvature scales with the square of the feature's magnitude. EQ M2.6 says η must stay below \(2/\lambda\) for the steepest direction; at that η, the shallow direction's error shrinks by a factor so close to 1 that convergence takes millions of steps. One η, shared by all weights, can only be as brave as the steepest direction allows. Geometrically: the bowl is a canyon, and gradient descent ping-pongs between its walls while drifting imperceptibly along its floor.
The fix is almost embarrassingly simple — standardization: shift each feature to zero mean and rescale to unit variance, \(x' = (x - \mu)/\sigma\). All directions of the bowl now have comparable curvature, the contours become near-circles, and a single η serves every weight. For gradient descent this is not a nicety; it is frequently the difference between converging in a hundred steps and not converging at all. (Its descendants — BatchNorm, LayerNorm — apply the same idea inside deep networks, and Volume II leans on them constantly.)
The leakage rule, again. Compute \(\mu\) and \(\sigma\) on the training set only, then apply those frozen values to validation and test data. Estimating them on the full dataset leaks test-set statistics into training — Chapter 01's cardinal sin, in its most common disguise.
Proof, in code: the normal equation and gradient descent — run on standardized features, then un-standardized back — agree to four decimals. The oracle approves of the iterator. Then sabotage it: set eta = 0.1 on the raw features instead of the standardized ones, and watch the explosion EQ M2.6 predicts (the raw feature's curvature puts critical η near 0.03).
import numpy as np
rng = np.random.default_rng(1)
n = 60
x = rng.uniform(0, 10, n) # raw, unscaled feature
y = 3.0 * x + 7.0 + rng.normal(0, 2.0, n) # truth: slope 3, intercept 7
X = np.column_stack([x, np.ones(n)])
# --- oracle: normal equation (EQ M2.3), via solve, never the inverse ---
w_exact = np.linalg.solve(X.T @ X, X.T @ y)
# --- iterator: GD on standardized x, then map weights back ---
mu, sd = x.mean(), x.std()
Xs = np.column_stack([(x - mu) / sd, np.ones(n)])
w = np.zeros(2)
for t in range(500):
grad = 2.0 / n * (Xs.T @ (Xs @ w - y))
w = w - 0.1 * grad
w_gd = np.array([w[0] / sd, w[1] - w[0] * mu / sd]) # un-standardize
print("normal equation :", np.round(w_exact, 4))
print("gradient descent :", np.round(w_gd, 4))
print("max difference :", float(np.abs(w_exact - w_gd).max()))
# now try GD with eta=0.1 directly on raw X — it diverges. that gap is this section.
You now own the loop: predict with \(h\), measure the loss, follow the gradient, repeat. Every model for the rest of this encyclopedia — logistic regression next chapter, neural networks after that, GPT in Volume II — is exactly this loop with a fancier \(h\) and a loss to match. Chapter 03 makes the first swap: when the target is a category instead of a number, the line becomes a sigmoid, squared error becomes cross-entropy, and the gradient — remarkably — keeps the same residual-times-features shape you memorized in EQ M2.4.
Further reading
- Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. — the first published statement of the method of least squares, the loss this whole chapter minimizes.
- Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. — ties least squares to the normal distribution and the maximum-likelihood justification for squared-error loss.
- Cauchy, A.-L. (1847). Méthode générale pour la résolution des systèmes d'équations simultanées. — the origin of gradient descent: follow the slope downhill to a minimum.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, Ch. 3. — the modern reference for the normal equations and why the closed form is rarely used at scale.
- Boyd, S. & Vandenberghe, L. (2004). Convex Optimization. — rigorous treatment of why a convex loss surface has one minimum and how step size governs convergence.
- Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 3. — connects linear models, basis functions, and feature scaling to the probabilistic view.