AI // ENCYCLOPEDIA / VOL I / ML FOUNDATIONS / 02 / LINEAR REGRESSION INDEX NEXT: CLASSIFICATION →
VOLUME I — FOUNDATIONS OF ML · CHAPTER 02 / 08

Linear Regression & Gradient Descent

One model, a weighted sum. One loss, squared error. One algorithm, step downhill. Linear regression is the smallest setting in which the full machinery of modern machine learning runs end to end. The training loop you learn here is, line for line, the loop that trains GPT; every later model swaps in a richer function and a different error.

LEVELINTRO READING TIME≈ 24 MIN BUILDS ONCH 01 INSTRUMENTSDESCENT STEPPER · LR SWEEP
2.1

The model and the loss surface

A linear model predicts by taking a weighted sum of the input features: each feature gets one learned number saying how much it matters and in which direction. Stack all \(n\) training examples as rows of a matrix \(X\), and the whole dataset's predictions become a single matrix–vector product. One bookkeeping trick makes the intercept disappear as a special case: append a constant feature of 1 to every example, and the bias \(b\) becomes just another weight.

EQ M2.1 — THE MODEL, VECTORIZED $$ \hat{y} \;=\; X w, \qquad X \in \mathbb{R}^{n \times (d+1)}, \quad w \in \mathbb{R}^{d+1} $$
Row \(i\) of \(X\) is example \(i\)'s features plus the appended 1; \(\hat{y}_i = x_i^\top w\) is its prediction. The entire model is \(d+1\) numbers. Everything this volume builds — and everything Volume II builds — replaces this product with a richer function \(h(x; w)\), but the surrounding machinery never changes.

How wrong is a given \(w\)? Chapter 01's answer was a loss function; here the natural one is mean squared error — average the squared miss over the dataset. Squaring does three jobs at once: misses in both directions count, large misses count disproportionately, and the result is smooth everywhere, so it has a well-defined slope we can follow.

EQ M2.2 — MSE: THE LOSS IS A BOWL $$ \mathcal{L}(w) \;=\; \frac{1}{n}\,\lVert Xw - y \rVert^2 \;=\; \frac{1}{n}\sum_{i=1}^{n} \left( x_i^\top w - y_i \right)^2 $$
Because \(\hat{y}\) is linear in \(w\) and the error is squared, \(\mathcal{L}\) is a quadratic bowl in weight space: slice it at any height and you get nested ellipses. The bowl may be stretched, squashed, or tilted — that shape decides everything in §2.4 and §2.5 — but it has no ripples and no false valleys.
A linear model produces predictions \(Xw = (2, 5, 7)\) for three examples whose targets are \(y = (3, 3, 8)\). What is the MSE \(\tfrac{1}{n}\lVert Xw - y\rVert^2\) (EQ M2.2)?
Residuals \(Xw - y = (2-3,\ 5-3,\ 7-8) = (-1,\ 2,\ -1)\). Squared: \(1, 4, 1\). Sum \(= 6\); divide by \(n = 3\): \(6/3 = \) 2.

Convexity, in words. A loss surface is convex when the straight line between any two points on it never dips below the surface — no hidden dimples for an optimizer to fall into. The practical guarantee is the one that matters: every downhill path leads to the same lowest point. Wherever you start and however clumsily you descend, if you keep going down, you arrive. Linear regression with MSE is convex; deep networks are emphatically not — which makes this chapter the one place you can watch optimization work with the guarantee switched on, before later chapters take it away.

One honest wrinkle: if two features are exact copies of each other (or one is a linear combination of others), the bowl's floor becomes a flat trench — infinitely many \(w\) achieve the same minimum loss. Convexity still holds; uniqueness doesn't. Real pipelines hit this with duplicated columns and one-hot encodings more often than you'd think.

2.2

The closed form — and why we don't use it at scale

A smooth bowl has exactly one flat point: the bottom. Setting the gradient of EQ M2.2 to zero and solving gives the minimizer outright — no iteration, no hyperparameters, no luck. This is the normal equation, and linear regression is nearly alone among learning algorithms in having one:

EQ M2.3 — THE NORMAL EQUATION $$ \nabla \mathcal{L}(w^\star) = 0 \quad\Longrightarrow\quad w^\star \;=\; \left( X^\top X \right)^{-1} X^\top y $$
\(X^\top X\) is the \((d{+}1)\times(d{+}1)\) matrix of feature co-occurrences; \(X^\top y\) measures how each feature co-varies with the target. In one matrix solve, the exact bottom of the bowl. In practice you never form the inverse — np.linalg.solve or QR-based lstsq do the same job with far better numerical behavior.

So why does the rest of machine learning iterate instead? Three reasons, in increasing order of importance:

ConcernNormal equationGradient descent
ComputeO(nd² + d³) — the d³ solve is fatal once d hits 10⁵+O(nd) per pass; scales to billions of parameters
Memorymust hold the d×d matrix XᵀXonly w and one gradient
Numericsconditioning of XᵀX is the square of X's — correlated features amplify rounding errortolerant; error shrinks geometrically
Data accessneeds all data at onceworks on streams and mini-batches
Generalityexists only because h is linear and the loss quadraticworks for any differentiable h — including a transformer

The last row is the real verdict. The normal equation is a one-off gift of linear algebra: change the model to anything nonlinear — a logistic curve, a two-layer network — and the closed form vanishes forever. Gradient descent asks only that the loss have a slope. We learn the closed form not to use it, but to keep it as an oracle: in §2.5 we'll let it grade gradient descent's answer.

2.3

Gradient descent: feel the slope, step downhill

Imagine standing on the bowl blindfolded. You can't see the bottom, but at your feet you can feel which way is steepest. The gradient \(\nabla \mathcal{L}\) is that feeling, made precise: the vector of partial derivatives, pointing in the direction of steepest increase. So walk the other way. For MSE the gradient comes out of the chain rule in one line, and it has a shape worth memorizing:

EQ M2.4 — THE GRADIENT OF MSE $$ \nabla_w \mathcal{L} \;=\; \frac{2}{n}\, X^\top \left( Xw - y \right) $$
Read it from the inside out: \(Xw - y\) is the vector of residuals — current prediction minus truth, one per example. \(X^\top(\cdot)\) then credits each feature with the residuals of the examples where it was active. The gradient is the data's errors, projected back onto the features that caused them. That sentence, generalized through many layers by backpropagation, is all of deep learning's credit assignment.
Two examples with the bias trick: rows \(x_1 = (1, 1)\), \(x_2 = (3, 1)\); targets \(y = (4, 6)\); current weights \(w = (1, 1)\). Using EQ M2.4, what is the first component of the gradient \(\nabla_w\mathcal{L}\) (the feature weight)?
Predictions \(Xw = (1\cdot1 + 1\cdot1,\ 1\cdot3 + 1\cdot1) = (2, 4)\). Residuals \(Xw - y = (2-4,\ 4-6) = (-2, -2)\). Project onto the feature column \((1, 3)\): \(1\cdot(-2) + 3\cdot(-2) = -8\). Scale by \(2/n = 2/2 = 1\): the first gradient component is −8. Negative, so the update rule will push this weight up.
EQ M2.5 — THE UPDATE RULE $$ w \;\leftarrow\; w \;-\; \eta\, \nabla_w \mathcal{L}(w) $$
\(\eta\) (eta) is the learning rate: how far to step in the downhill direction. This single line — compute the slope, take a step — is the update inside every optimizer in this encyclopedia. Adam, momentum, and friends decorate it; none replace it.
Take \(w = (1, 0)\), gradient \(\nabla\mathcal{L} = (-3, -2)\), and learning rate \(\eta = 0.2\). After one gradient-descent step, what is the new first weight \(w_1'\)?
\(w_1' = w_1 - \eta\,\nabla_1 = 1 - 0.2\cdot(-3) = 1 + 0.6 = \) 1.6. Subtracting a negative gradient pushes the weight up — exactly what a too-low prediction should do.

That's the whole algorithm: predict, measure residuals, push the error back through the features, step, repeat. Watch it run on a real (toy) regression below — the left panel shows the bowl from above as loss contours in \((w, b)\) space; the right panel shows what each position means: a candidate line through the data.

INSTRUMENT M2.1 — DESCENT STEPPEREQ M2.4 + M2.5 · LIVE ON 60 POINTS
STEP
0
MSE
η STATUS
η CRITICAL (THIS DATA)
STEP applies EQ M2.5 once; AUTO loops it. At the default η the path makes an L: it drops fast down the steep wall, then crawls along the valley floor. Raise η toward the critical value and the path starts ricocheting across the valley; push past it and every step lands higher than the last — divergence, live. The critical value isn't a constant of nature: it is computed from this dataset's curvature (§2.4).

The same loop in real code — numpy, no libraries, nothing hidden. Run it, then break it: try eta = 0.9, or delete the 2.0 / n and watch the effective step size change.

PYTHON · RUNNABLE IN-BROWSER
import numpy as np

rng = np.random.default_rng(0)
n = 80
x = rng.uniform(-2, 2, n)
y = 1.7 * x - 0.4 + rng.normal(0, 0.5, n)     # truth: slope 1.7, intercept -0.4

X = np.column_stack([x, np.ones(n)])          # bias trick: w = [slope, intercept]
w = np.zeros(2)
eta = 0.1

steps, losses = [], []
for t in range(201):
    r = X @ w - y                             # residuals
    mse = (r @ r) / n
    steps.append(t); losses.append(mse)
    if t % 20 == 0:
        print(f"step {t:3d}   mse {mse:7.4f}   w = {np.round(w, 3)}")
    grad = 2.0 / n * (X.T @ r)                # EQ M2.4
    w = w - eta * grad                        # EQ M2.5

plot_xy(steps, losses)
edits are live — break it on purpose

The loss curve you just plotted has the signature shape of healthy gradient descent on a convex problem: steep early progress, then a long geometric glide toward the noise floor — it never reaches zero, because the data contains genuine noise no line can explain. A training curve that does hit zero should make you suspicious (Chapter 01: memorization).

2.4

The learning rate: the most important hyperparameter

Everything about gradient descent's behavior is decided by one number. Too small, and you inch downhill for geological time. Too large, and each step overshoots the bottom, lands on the far wall higher than it started, and the loss explodes exponentially. The boundary between those fates is sharp, and on a quadratic bowl you can compute it exactly. Take the one-dimensional case: a parabola with curvature \(\lambda\) (its second derivative). One algebra step shows each update multiplies the remaining error by a fixed factor:

EQ M2.6 — THE CONVERGENCE BAND $$ w_{t+1} - w^\star \;=\; \left( 1 - \eta\lambda \right) \left( w_t - w^\star \right) \qquad\Longrightarrow\qquad \text{stable} \iff 0 < \eta < \frac{2}{\lambda} $$
If \(|1-\eta\lambda| < 1\) the error shrinks every step; the moment \(\eta\) exceeds \(2/\lambda\), the factor passes \(-1\) and the error flips sign and grows — the overshooting spiral you saw in Instrument M2.1. Between \(1/\lambda\) and \(2/\lambda\) the factor is negative with magnitude below 1: the iterate converges while hopping from side to side of the minimum. With many dimensions, each direction of the bowl has its own curvature, and η must respect the steepest one — while progress is paced by the shallowest. That tension is the whole story of §2.5.
A one-dimensional loss bowl has curvature \(\lambda = 5\). By EQ M2.6, what is the critical learning rate — the largest \(\eta\) for which gradient descent still converges?
Stability requires \(\eta < 2/\lambda\). The boundary is \(2/\lambda = 2/5 = \) 0.4. Any \(\eta\) above this multiplies the error by a factor below \(-1\) every step, and the loss explodes.
With learning rate \(\eta = 0.3\) and curvature \(\lambda = 2\), what is the convergence factor \(1 - \eta\lambda\) by which each step shrinks the remaining error?
\(1 - \eta\lambda = 1 - 0.3\cdot 2 = 1 - 0.6 = \) 0.4. Its magnitude is below 1, so the error shrinks 60% per step — smooth, monotone convergence.

Four learning rates, one problem, eighty steps each — the canonical picture every practitioner carries in their head:

INSTRUMENT M2.2 — LR SWEEPSAME DATA AS M2.1 · 80 GD STEPS PER η · LOG SCALE
Each curve is gradient descent actually run in your browser on the M2.1 dataset, from the same starting point. η = 0.01 descends — but after 80 steps it still sits an order of magnitude above the noise floor. η = 0.1 glides down and settles on it. η = 0.5 lives in EQ M2.6's zigzag band: on this perfectly quadratic bowl its loss still falls every step (fast, even), but its parameter path in M2.1 ricochets wall to wall — and on real, non-convex surfaces that ricochet turns into loss spikes. η = 1.1 is past critical: every step multiplies the error, and the curve exits the chart within ten steps.

Beyond the bowl. On a deep network's non-convex surface no single \(\lambda\) exists — curvature varies wildly across the landscape and across training time. That's why real recipes use learning-rate schedules (a gentle warmup so early chaotic gradients don't launch the weights, then a long decay) and per-coordinate adaptive optimizers like Adam, which effectively give every weight its own η. The full machinery appears with pre-training at scale in Vol II · Chapter 04. But every schedule and every optimizer is still negotiating with EQ M2.6's constraint — they never escape it.

2.5

Features and preprocessing: why scale decides convergence

Here is the trap every beginner falls into once. Suppose one feature is "number of rooms" (range 1–10) and another is "square footage" (range 500–5,000). The loss bowl in those two weight directions has wildly different curvatures — the square-footage direction is roughly \((500)^2\) times steeper, because curvature scales with the square of the feature's magnitude. EQ M2.6 says η must stay below \(2/\lambda\) for the steepest direction; at that η, the shallow direction's error shrinks by a factor so close to 1 that convergence takes millions of steps. One η, shared by all weights, can only be as brave as the steepest direction allows. Geometrically: the bowl is a canyon, and gradient descent ping-pongs between its walls while drifting imperceptibly along its floor.

The fix is almost embarrassingly simple — standardization: shift each feature to zero mean and rescale to unit variance, \(x' = (x - \mu)/\sigma\). All directions of the bowl now have comparable curvature, the contours become near-circles, and a single η serves every weight. For gradient descent this is not a nicety; it is frequently the difference between converging in a hundred steps and not converging at all. (Its descendants — BatchNorm, LayerNorm — apply the same idea inside deep networks, and Volume II leans on them constantly.)

The leakage rule, again. Compute \(\mu\) and \(\sigma\) on the training set only, then apply those frozen values to validation and test data. Estimating them on the full dataset leaks test-set statistics into training — Chapter 01's cardinal sin, in its most common disguise.

Proof, in code: the normal equation and gradient descent — run on standardized features, then un-standardized back — agree to four decimals. The oracle approves of the iterator. Then sabotage it: set eta = 0.1 on the raw features instead of the standardized ones, and watch the explosion EQ M2.6 predicts (the raw feature's curvature puts critical η near 0.03).

PYTHON · RUNNABLE IN-BROWSER
import numpy as np

rng = np.random.default_rng(1)
n = 60
x = rng.uniform(0, 10, n)                     # raw, unscaled feature
y = 3.0 * x + 7.0 + rng.normal(0, 2.0, n)     # truth: slope 3, intercept 7

X = np.column_stack([x, np.ones(n)])

# --- oracle: normal equation (EQ M2.3), via solve, never the inverse ---
w_exact = np.linalg.solve(X.T @ X, X.T @ y)

# --- iterator: GD on standardized x, then map weights back ---
mu, sd = x.mean(), x.std()
Xs = np.column_stack([(x - mu) / sd, np.ones(n)])
w = np.zeros(2)
for t in range(500):
    grad = 2.0 / n * (Xs.T @ (Xs @ w - y))
    w = w - 0.1 * grad
w_gd = np.array([w[0] / sd, w[1] - w[0] * mu / sd])   # un-standardize

print("normal equation  :", np.round(w_exact, 4))
print("gradient descent :", np.round(w_gd, 4))
print("max difference   :", float(np.abs(w_exact - w_gd).max()))
# now try GD with eta=0.1 directly on raw X — it diverges. that gap is this section.
edits are live — break it on purpose
NEXT

You now own the loop: predict with \(h\), measure the loss, follow the gradient, repeat. Every model for the rest of this encyclopedia — logistic regression next chapter, neural networks after that, GPT in Volume II — is exactly this loop with a fancier \(h\) and a loss to match. Chapter 03 makes the first swap: when the target is a category instead of a number, the line becomes a sigmoid, squared error becomes cross-entropy, and the gradient — remarkably — keeps the same residual-times-features shape you memorized in EQ M2.4.

§

Further reading

  • Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. — the first published statement of the method of least squares, the loss this whole chapter minimizes.
  • Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. — ties least squares to the normal distribution and the maximum-likelihood justification for squared-error loss.
  • Cauchy, A.-L. (1847). Méthode générale pour la résolution des systèmes d'équations simultanées. — the origin of gradient descent: follow the slope downhill to a minimum.
  • Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, Ch. 3. — the modern reference for the normal equations and why the closed form is rarely used at scale.
  • Boyd, S. & Vandenberghe, L. (2004). Convex Optimization. — rigorous treatment of why a convex loss surface has one minimum and how step size governs convergence.
  • Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 3. — connects linear models, basis functions, and feature scaling to the probabilistic view.