08 · Backpropagation & Optimization

8.1

The credit-assignment problem

A network maps inputs through millions of weights to a single scalar loss. When that loss is bad, which weights are at fault, and by how much? That is credit assignment, and it is the whole problem of learning. The output layer's culpability is easy to see — it touched the answer directly. But a weight three layers deep influenced the loss only through everything stacked above it; its blame arrives diluted, rerouted, and mixed with everyone else's.

The brute-force answer exists and is worth respecting: nudge one weight by $\varepsilon$, re-run the network, and watch the loss move — $\partial L / \partial \theta_i \approx (L(\theta_i + \varepsilon) - L(\theta_i - \varepsilon)) / 2\varepsilon$. It is exactly correct in the limit and catastrophically expensive: two full forward passes per parameter, per step. For GPT-class models that is trillions of forward passes to compute what backpropagation delivers in roughly the cost of one. The brute-force method survives in one honorable role — as the referee that checks backprop implementations (§8.4) — and nowhere else.

Backpropagation was applied to neural networks and popularized by Rumelhart, Hinton and Williams in 1986 (the underlying reverse-mode differentiation is older — Linnainmaa, 1970). It ended the seventeen-year winter that Chapter 07's perceptron proof began. The insight is not deep math; it is deep bookkeeping: the chain rule, applied once per node of a graph, in the right order, sharing every intermediate result.

8.2

The chain rule on a computational graph

Stop thinking of a network as a formula and start thinking of it as a computational graph: a directed graph in which every node is a primitive operation (multiply, add, $\sigma$, square) and every edge carries a value forward. The crucial move is to label each edge with its local derivative — not a symbolic expression but a number, evaluated at the values that just flowed through. The edge from $z$ into $a = \sigma(z)$ is labeled $\partial a / \partial z = a(1-a)$: one number, known the moment the forward pass computes $a$.

EQ M8.1 — THE CHAIN RULE, GRAPH FORM $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}\,\frac{\partial y}{\partial x} \qquad\Longrightarrow\qquad \frac{\partial L}{\partial v} \;=\; \sum_{c \,\in\, \mathrm{children}(v)} \frac{\partial L}{\partial c}\,\frac{\partial c}{\partial v} $$

Left: the one-step rule — blame flowing into $x$ is blame at $y$ times the local edge derivative. Right: the same rule on a graph — a node's gradient is the sum over its outgoing edges of (downstream gradient × local derivative). Multiply along paths, add across paths. Process nodes from the loss backward and every downstream gradient is already in hand when you need it: each edge is touched exactly once. That single scheduling decision is the entire difference between exponential and linear cost.

On the graph $L = \tfrac12(a-y)^2$ with $a = \sigma(z)$: at this step $a = 0.5$ and target $y = 1$. The two edge derivatives are $\partial L/\partial a = a - y$ and $\partial a/\partial z = a(1-a)$. What is $\partial L/\partial z$?

Chain-rule multiply along the path: $\partial L/\partial a = 0.5 - 1 = -0.5$; $\partial a/\partial z = 0.5(1 - 0.5) = 0.25$. So $\partial L/\partial z = (-0.5)(0.25) = $ −0.125. The negative sign means raising $z$ lowers the loss — the update pushes $a$ up toward $y = 1$.

Walk it on the smallest model that exercises every move — one weight, one bias, a sigmoid, a squared loss: $ L = \tfrac{1}{2}(\sigma(w x + b) - y)^2 $. The forward pass fills node values left to right and records each edge's local derivative as it goes. The backward pass then starts from $\partial L / \partial L = 1$ and multiplies its way left, one edge at a time. Step both directions yourself:

INSTRUMENT M8.1 — GRAPH STEPPERL = ½(σ(w·x + b) − y)² · EVERY NUMBER COMPUTED LIVE

PRESET

PASS

LOSS L

—

∂L/∂w

—

∂L/∂b

—

LOSS AFTER 1 SGD STEP (η = 1)

—

Press FORWARD: values fill left to right, and each edge's local derivative (blue) is recorded the moment its node computes — exactly what an autodiff tape stores. Press BACKWARD: gradients flow right to left in mint, each one the product of the downstream gradient and one blue edge label. On preset A you should land on ∂L/∂w = −0.1774; apply the η = 1 update to w and b and the last readout shows the loss genuinely drops. Preset B flips the target to 0 — watch ∂L/∂b change sign and every gradient shrink (the unit is already nearly right).

Notice what the instrument makes obvious: the backward pass never recomputes anything. Local derivatives were priced during the forward pass; backward just multiplies and adds them in reverse topological order. And the node $z$ — with one input from $u$ and one from $b$ — shows EQ M8.1's sum degenerating to single terms, while the inputs $w$ and $x$ each receive their gradient through one path. In a real network a hidden unit feeds many downstream nodes, and the sum over children is doing real work: that is the backward product of §8.3.

8.3

Backprop through a two-layer net, worked

Now the classic: Chapter 07's MLP, $ h = \varphi(W_1 x + b_1) $, $ \hat{y} = \sigma(W_2 h + b_2) $, binary cross-entropy loss. Define $\delta_\ell$ as the gradient of the loss with respect to layer $\ell$'s pre-activation — the quantity backprop actually ferries between layers. At the output, sigmoid and cross-entropy collapse into the cleanest result in the field:

EQ M8.2 — OUTPUT-LAYER GRADIENT $$ \delta_2 \;\equiv\; \frac{\partial L}{\partial z_2} \;=\; \hat{y} - y, \qquad\quad \frac{\partial L}{\partial W_2} = \delta_2\, h^{\top}, \qquad \frac{\partial L}{\partial b_2} = \delta_2 $$

The $\sigma'$ from the activation and the $1/\hat{y}(1-\hat{y})$ from the cross-entropy cancel exactly — the same cancellation that made logistic regression's gradient clean in Chapter 03, and the reason this loss–activation pairing is universal. Prediction minus target: the error itself is the gradient signal. A weight's gradient is then (its layer's δ) × (the activation it multiplied) — a weight that fed on a zero activation gets zero blame, which is exactly fair.

Output $\hat y = 0.8$, target $y = 0.5$, so $\delta_2 = \hat y - y$. One hidden unit fed activation $h = 2.0$ into the output. By EQ M8.2, what is $\partial L/\partial W_2$ for that weight ($= \delta_2 \cdot h$)?

First the output-layer signal: $\delta_2 = \hat y - y = 0.8 - 0.5 = 0.3$ (sigmoid and cross-entropy cancel — the error itself is the gradient). Then $\partial L/\partial W_2 = \delta_2 \cdot h = 0.3 \times 2.0 = $ 0.6. Blame is proportional to participation: a loud unit earns a large gradient.

EQ M8.3 — HIDDEN-LAYER GRADIENT: THE BACKWARD PRODUCT $$ \delta_1 \;=\; \big( W_2^{\top}\, \delta_2 \big) \,\odot\, \varphi'(z_1), \qquad\quad \frac{\partial L}{\partial W_1} = \delta_1\, x^{\top}, \qquad \frac{\partial L}{\partial b_1} = \delta_1 $$

Read $W_2^{\top} \delta_2$ as EQ M8.1's sum-over-children done for every hidden unit at once: the same weights that carried activations forward carry blame backward, transposed. Then $\odot\, \varphi'(z_1)$ gates the blame through each unit's local slope — a saturated unit ($\varphi' \approx 0$) absorbs no gradient. Deeper nets just iterate this line: $\delta_{\ell} = (W_{\ell+1}^{\top} \delta_{\ell+1}) \odot \varphi'(z_{\ell})$. One matrix multiply per layer, backward — the mirror image of the forward pass, at roughly twice its FLOPs.

That is the entire algorithm. Forward to get values, EQ M8.2 to start the blame, EQ M8.3 once per hidden layer to pass it down, then step every parameter against its gradient. The cell below is the complete loop — a 2-4-1 network learning XOR from eight points, every gradient written by hand. Two hundred epochs, loss printed and plotted; the predictions at the end are the proof Chapter 07 promised:

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(3)

# 8 points on the XOR pattern -- the dataset Ch07 proved no linear model can fit
X = np.array([[0,0],[0,1],[1,0],[1,1],[.1,.1],[.1,.9],[.9,.1],[.9,.9]], float)
y = np.array([0,1,1,0,0,1,1,0], float).reshape(-1,1)

W1 = rng.normal(0, 1.0, (4,2)); b1 = np.zeros(4)        # a 2-4-1 net, tanh hidden
W2 = rng.normal(0, 1.0, (1,4)); b2 = np.zeros(1)
lr, losses = 2.0, []

for epoch in range(201):
    H = np.tanh(X @ W1.T + b1)                          # forward
    p = 1/(1 + np.exp(-(H @ W2.T + b2)))
    L = -np.mean(y*np.log(p+1e-9) + (1-y)*np.log(1-p+1e-9))
    losses.append(L)
    dZ2 = (p - y)/len(X)                                # EQ M8.2: error IS the gradient
    dW2 = dZ2.T @ H;  db2 = dZ2.sum(0)
    dZ1 = (dZ2 @ W2) * (1 - H**2)                       # EQ M8.3: backward product, tanh gate
    dW1 = dZ1.T @ X;  db1 = dZ1.sum(0)
    W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2   # gradient step
    if epoch % 25 == 0: print(f"epoch {epoch:3d}   loss {L:.4f}")

print("\npredictions:", np.round(p.ravel(), 2))
print("targets:    ", y.ravel())
plot_xy(list(range(len(losses))), losses)

change lr to 8.0 or the seed to 1 — watch the curve, not the code

Sixteen lines of algorithm, and they are the same sixteen lines that train a frontier model — more layers in the loop, attention and normalization among the ops, ~10¹³ parameters instead of 17, but EQ M8.2 and M8.3 are doing all the work either way.

8.4

Autodiff: you never write gradients

You just hand-derived a network's gradients for the last time. Every framework implements automatic differentiation: as your code runs the forward pass, each primitive operation appends a node to a tape (in PyTorch, the grad_fn chain) recording which tensors fed it and how to compute its local derivative — precisely the blue edge labels of Instrument M8.1. Calling loss.backward() walks that tape in reverse topological order, applying EQ M8.1 at every node. This is reverse-mode autodiff, and its defining property is the one that built deep learning:

Mode	One pass computes	Cost for n params → 1 scalar loss	Right when
Forward-mode	∂(everything)/∂(one input)	n passes	Few inputs, many outputs
Reverse-mode	∂(loss)/∂(everything)	1 backward pass ≈ 2× forward FLOPs	Many params, one loss — i.e. all of ML
Numerical	one ∂L/∂θᵢ, approximately	2n forward passes	Testing the other two

Reverse mode's bill arrives in memory, not time: every activation must be kept alive from the forward pass until the backward pass consumes it. That is why training a model needs several times the memory of running it, and why activation checkpointing (recompute instead of store — Vol II · CH 04) exists. Three practical PyTorch facts complete the picture: gradients accumulate into .grad (hence zero_grad() every step — forgetting it is the classic silent bug); the tape is rebuilt every forward pass, so Python control flow is differentiated for free; and anything inside torch.no_grad() records nothing, which is what makes inference cheap.

And the referee from §8.1 gets its one honorable job. Whenever a gradient is written by hand — a custom op, a new layer, a paper reimplementation — it is checked against central differences. The contract: analytic and numerical agree to ~10⁻⁷ in float64, or the backward pass is wrong. Run the audit on §8.3's network:

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(0)

X = rng.normal(size=(8,2)); y = rng.integers(0,2,(8,1)).astype(float)
theta = rng.normal(0, 0.6, size=12)        # 2-4-1 net, MSE loss, params flattened

def unpack(t): return t[:8].reshape(4,2), t[8:].reshape(1,4)
def loss(t):
    W1, W2 = unpack(t)
    p = 1/(1 + np.exp(-(np.tanh(X @ W1.T) @ W2.T)))
    return np.mean((p - y)**2)

def grad_backprop(t):                      # analytic: one forward + one backward
    W1, W2 = unpack(t)
    H = np.tanh(X @ W1.T); p = 1/(1 + np.exp(-(H @ W2.T)))
    dZ2 = 2*(p - y)/y.size * p*(1 - p)     # chain: MSE then sigmoid
    dW1 = ((dZ2 @ W2)*(1 - H**2)).T @ X    # EQ M8.3 again
    return np.concatenate([dW1.ravel(), (dZ2.T @ H).ravel()])

eps  = 1e-5                                # numerical: 2 forwards PER parameter
g_bp = grad_backprop(theta)
g_num = np.array([(loss(theta + eps*np.eye(12)[i]) -
                   loss(theta - eps*np.eye(12)[i]))/(2*eps) for i in range(12)])

print("max |analytic - numerical| =", f"{np.abs(g_bp - g_num).max():.2e}")
print("np.allclose verdict:", np.allclose(g_bp, g_num, rtol=1e-5, atol=1e-7))
print(f"\ncost: backprop = 2 passes total; numerical = {2*12} passes for 12 params")

sabotage grad_backprop — drop the (1 − H**2) — and watch the verdict flip

8.5

SGD and minibatches

The loss that matters is an average over the whole dataset, so the true gradient is too — and on a trillion tokens, computing it once would cost more than most entire training runs. Stochastic gradient descent declines to pay: sample a minibatch $\mathcal{B}$, average its per-example gradients, and step on that estimate:

EQ M8.4 — THE NOISY GRADIENT ESTIMATOR $$ \hat{g} \;=\; \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_{\theta}\, \ell_i(\theta), \qquad \mathbb{E}\big[\hat{g}\big] = \nabla_{\theta} L(\theta), \qquad \mathrm{Var}\big[\hat{g}\big] \;\propto\; \frac{1}{|\mathcal{B}|} $$

The estimator is unbiased — on average it points exactly downhill — and its noise shrinks only as $1/|\mathcal{B}|$ in variance ($1/\sqrt{|\mathcal{B}|}$ in magnitude): quadrupling the batch halves the noise. The decisive property: the cost of a step is independent of the dataset size. That one fact is why training on internet-scale data is possible at all.

You grow the minibatch from $|\mathcal{B}| = 25$ to $|\mathcal{B}| = 100$. Since the gradient-noise magnitude scales as $1/\sqrt{|\mathcal{B}|}$, by what factor does the noise magnitude change (new ÷ old)?

Magnitude $\propto 1/\sqrt{|\mathcal{B}|}$, so the ratio is $\sqrt{25}/\sqrt{100} = 5/10 = $ 0.5. The variance (the square) drops by $25/100 = 1/4$, but the magnitude only halves — quadrupling the batch buys a 2× cleaner gradient, the diminishing return behind the critical batch size.

Batch size is then an engineering trade, not a statistical one. Larger batches use accelerators efficiently and parallelize across devices (Vol II · CH 04), but past a critical batch size the extra averaging buys almost nothing — the noise is no longer the limiting factor, and you are spending more compute per unit of progress. A common heuristic when scaling the batch is to scale the learning rate with it, which works until it abruptly doesn't. And the noise itself is not purely a tax: it helps escape saddle points, and there is evidence — genuinely contested — that it biases training toward flatter minima that generalize better (Chapter 06's themes). What is not contested: the learning rate $\eta$ is the single most important hyperparameter in deep learning. Too low wastes compute; ~3× too high diverges; the usable window is often well under one order of magnitude, which is why Vol II · EQ 4.4's warmup-and-decay schedules exist.

8.6

Momentum and Adam

Real loss surfaces are ravines: curvature differs wildly by direction (recall Chapter 02's elongated bowls). Plain SGD must keep $\eta$ small enough not to explode along the steepest direction — and at that $\eta$ it crawls along the shallow one, zigzagging across the valley while barely advancing down it. Momentum fixes this with one extra vector — an exponential moving average of gradients:

EQ M8.5 — MOMENTUM (HEAVY BALL) $$ v_t \;=\; \beta\, v_{t-1} + \hat{g}_t, \qquad\quad \theta_{t+1} \;=\; \theta_t - \eta\, v_t $$

$v$ remembers roughly the last $1/(1-\beta)$ gradients — ten, at the standard $\beta = 0.9$. Across the ravine, gradients alternate sign and cancel in the average; along the valley floor they agree and accumulate, up to a $1/(1-\beta)$ ≈ 10× effective speedup. The physics name is honest: $v$ is velocity, $\beta$ is friction, and a rolling ball coasts through small bumps and minor noise that stop a memoryless walker cold.

Momentum with friction $\beta = 0.8$, starting from $v_0 = 0$, on a valley floor where every gradient is $g = +1$. Using $v_t = \beta\,v_{t-1} + g$, what is the velocity $v_3$ after three steps?

$v_1 = 0.8\cdot 0 + 1 = 1$; $v_2 = 0.8\cdot 1 + 1 = 1.8$; $v_3 = 0.8\cdot 1.8 + 1 = 1.44 + 1 = $ 2.44. Agreement compounds toward the limit $1/(1-\beta) = 1/0.2 = 5$ — momentum accelerates along a consistent direction.

Adam keeps momentum's first moment and adds a second: a running average of each coordinate's squared gradient, used to divide the step — so every parameter gets an automatically calibrated per-coordinate learning rate, and rarely-updated or small-gradient directions are not starved. The full update, bias corrections and the decoupled weight decay that makes it AdamW, is Vol II · EQ 4.3 — we will not re-derive it here. The standings in practice:

Optimizer	State per param	Character	Where it rules
SGD	none	Honest, noisy, ravine-bound	Theory; small problems
SGD + momentum	+1 (v)	Coasts valleys, rolls over bumps, overshoots	The CNN/vision era; still competitive there
Adam / AdamW	+2 (m, v)	Per-coordinate scaling; robust to bad conditioning	Every LLM you have heard of

The state column is real money at scale: weights + gradients + fp32 master copy + Adam's two moments is where Vol II · CH 06's "≈16 bytes per parameter to train" comes from — a 70B model wants ~1.1 TB of optimizer-laden memory before a single activation is stored.

INSTRUMENT M8.2 — OPTIMIZER RACEELONGATED VALLEY + BUMP · 3 LIVE UPDATE RULES · SEEDED NOISE

CONTROL

STEP

0

SGD LOSS (η=1.6)

—

MOMENTUM LOSS (η=.22 β=.9)

—

ADAM LOSS (η=.30)

—

A synthetic two-parameter loss — 10:1 curvature ratio plus a Gaussian bump squarely in the path — but every trajectory is its real update rule run live, all three fed identical seeded gradient noise. Run AUTO: SGD zigzags across the valley, then parks — the bump carves a shallow local minimum in front of itself, and a memoryless stepper has no way out. Momentum's stored velocity carries it straight over (and then past — watch the red overshoot swing back). Adam's per-coordinate normalization takes the bump as a detour and settles cleanly. Around step 120 the scoreboard reads ≈1.34 / 0.03 / 0.02 — same surface, same noise, three different fates.

8.7

Vanishing, exploding, and the fixes that built deep learning

Iterate EQ M8.3 through $L$ layers and the gradient reaching layer 1 is a product of $L$ matrices and $L$ activation slopes. Products compound geometrically: if each factor shrinks the signal by 0.9, a hundred layers leave $0.9^{100} \approx 3 \times 10^{-5}$ of it; if each grows it by 1.1, the same hundred layers amplify ~14,000×. Vanishing gradients mean the early layers stop learning while the late ones overfit; exploding gradients mean a single step flings the weights to infinity. For two decades this product was the practical wall — "deep networks don't train" — and the modern stack is, to a first approximation, the list of fixes:

Initialization that respects the product. Xavier/Glorot and He initialization choose weight variance (≈ $2/d_{\text{in}}$ for ReLU) so each layer's factor starts with norm ≈ 1 — the product begins neutral instead of doomed. One line of code; it is the difference between training and not.
Activations that pass gradient. EQ M7.2's argument: sigmoid contributes at most 0.25 per layer; ReLU contributes exactly 1 wherever active. The 2012 switch to ReLU is most of why "deep" stopped being a euphemism for "broken".
Residual connections. Reformulate each layer as $ h_{\ell+1} = h_\ell + F(h_\ell) $. The backward Jacobian becomes $ I + \partial F / \partial h_\ell $: the identity term gives gradients a multiplication-free expressway from the loss to every layer, and the product of $(I + \text{small})$ terms stays tame where a product of raw matrices would not. He et al. (2015) used it to train 152 layers the year 20 was hard.
Normalization and clipping. LayerNorm/RMSNorm re-standardize activations so the slopes stay in their responsive range; gradient clipping caps the global gradient norm as a circuit breaker against the exploding side — still standard in every LLM pretraining run (Vol II · CH 04).

Carry the third fix with you across the volume boundary: a transformer is a residual network through and through, and what Volume II calls the residual stream (Vol II · §2.2) is exactly this gradient expressway, promoted from a training trick to the architecture's central data structure — every attention head and MLP reads from it and adds back into it, and gradients ride it undamped through a hundred layers.

You now know everything GPT knows about learning — Volume II shows what happens at a trillion times the scale. The forward pass of Chapter 07, this chapter's backward pass, AdamW on EQ M8.4's noisy gradients: that is, literally and completely, the training loop of a frontier model. Volume II begins where the loop meets reality — tokens, embeddings, attention, and the engineering of running it across tens of thousands of GPUs.

§