The credit-assignment problem
A network maps inputs through millions of weights to a single scalar loss. When that loss is bad, which weights are at fault, and by how much? That is credit assignment, and it is the whole problem of learning. The output layer's culpability is easy to see — it touched the answer directly. But a weight three layers deep influenced the loss only through everything stacked above it; its blame arrives diluted, rerouted, and mixed with everyone else's.
The brute-force answer exists and is worth respecting: nudge one weight by \(\varepsilon\), re-run the network, and watch the loss move — \(\partial L / \partial \theta_i \approx (L(\theta_i + \varepsilon) - L(\theta_i - \varepsilon)) / 2\varepsilon\). It is exactly correct in the limit and catastrophically expensive: two full forward passes per parameter, per step. For GPT-class models that is trillions of forward passes to compute what backpropagation delivers in roughly the cost of one. The brute-force method survives in one honorable role — as the referee that checks backprop implementations (§8.4) — and nowhere else.
Backpropagation was applied to neural networks and popularized by Rumelhart, Hinton and Williams in 1986 (the underlying reverse-mode differentiation is older — Linnainmaa, 1970). It ended the seventeen-year winter that Chapter 07's perceptron proof began. The insight is not deep math; it is deep bookkeeping: the chain rule, applied once per node of a graph, in the right order, sharing every intermediate result.
The chain rule on a computational graph
Stop thinking of a network as a formula and start thinking of it as a computational graph: a directed graph in which every node is a primitive operation (multiply, add, \(\sigma\), square) and every edge carries a value forward. The crucial move is to label each edge with its local derivative — not a symbolic expression but a number, evaluated at the values that just flowed through. The edge from \(z\) into \(a = \sigma(z)\) is labeled \(\partial a / \partial z = a(1-a)\): one number, known the moment the forward pass computes \(a\).
Walk it on the smallest model that exercises every move — one weight, one bias, a sigmoid, a squared loss: \( L = \tfrac{1}{2}(\sigma(w x + b) - y)^2 \). The forward pass fills node values left to right and records each edge's local derivative as it goes. The backward pass then starts from \(\partial L / \partial L = 1\) and multiplies its way left, one edge at a time. Step both directions yourself:
Notice what the instrument makes obvious: the backward pass never recomputes anything. Local derivatives were priced during the forward pass; backward just multiplies and adds them in reverse topological order. And the node \(z\) — with one input from \(u\) and one from \(b\) — shows EQ M8.1's sum degenerating to single terms, while the inputs \(w\) and \(x\) each receive their gradient through one path. In a real network a hidden unit feeds many downstream nodes, and the sum over children is doing real work: that is the backward product of §8.3.
Backprop through a two-layer net, worked
Now the classic: Chapter 07's MLP, \( h = \varphi(W_1 x + b_1) \), \( \hat{y} = \sigma(W_2 h + b_2) \), binary cross-entropy loss. Define \(\delta_\ell\) as the gradient of the loss with respect to layer \(\ell\)'s pre-activation — the quantity backprop actually ferries between layers. At the output, sigmoid and cross-entropy collapse into the cleanest result in the field:
That is the entire algorithm. Forward to get values, EQ M8.2 to start the blame, EQ M8.3 once per hidden layer to pass it down, then step every parameter against its gradient. The cell below is the complete loop — a 2-4-1 network learning XOR from eight points, every gradient written by hand. Two hundred epochs, loss printed and plotted; the predictions at the end are the proof Chapter 07 promised:
import numpy as np
rng = np.random.default_rng(3)
# 8 points on the XOR pattern -- the dataset Ch07 proved no linear model can fit
X = np.array([[0,0],[0,1],[1,0],[1,1],[.1,.1],[.1,.9],[.9,.1],[.9,.9]], float)
y = np.array([0,1,1,0,0,1,1,0], float).reshape(-1,1)
W1 = rng.normal(0, 1.0, (4,2)); b1 = np.zeros(4) # a 2-4-1 net, tanh hidden
W2 = rng.normal(0, 1.0, (1,4)); b2 = np.zeros(1)
lr, losses = 2.0, []
for epoch in range(201):
H = np.tanh(X @ W1.T + b1) # forward
p = 1/(1 + np.exp(-(H @ W2.T + b2)))
L = -np.mean(y*np.log(p+1e-9) + (1-y)*np.log(1-p+1e-9))
losses.append(L)
dZ2 = (p - y)/len(X) # EQ M8.2: error IS the gradient
dW2 = dZ2.T @ H; db2 = dZ2.sum(0)
dZ1 = (dZ2 @ W2) * (1 - H**2) # EQ M8.3: backward product, tanh gate
dW1 = dZ1.T @ X; db1 = dZ1.sum(0)
W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2 # gradient step
if epoch % 25 == 0: print(f"epoch {epoch:3d} loss {L:.4f}")
print("\npredictions:", np.round(p.ravel(), 2))
print("targets: ", y.ravel())
plot_xy(list(range(len(losses))), losses)
Sixteen lines of algorithm, and they are the same sixteen lines that train a frontier model — more layers in the loop, attention and normalization among the ops, ~10¹³ parameters instead of 17, but EQ M8.2 and M8.3 are doing all the work either way.
Autodiff: you never write gradients
You just hand-derived a network's gradients for the last time. Every framework implements automatic differentiation: as your code runs the forward pass, each primitive operation appends a node to a tape (in PyTorch, the grad_fn chain) recording which tensors fed it and how to compute its local derivative — precisely the blue edge labels of Instrument M8.1. Calling loss.backward() walks that tape in reverse topological order, applying EQ M8.1 at every node. This is reverse-mode autodiff, and its defining property is the one that built deep learning:
| Mode | One pass computes | Cost for n params → 1 scalar loss | Right when |
|---|---|---|---|
| Forward-mode | ∂(everything)/∂(one input) | n passes | Few inputs, many outputs |
| Reverse-mode | ∂(loss)/∂(everything) | 1 backward pass ≈ 2× forward FLOPs | Many params, one loss — i.e. all of ML |
| Numerical | one ∂L/∂θᵢ, approximately | 2n forward passes | Testing the other two |
Reverse mode's bill arrives in memory, not time: every activation must be kept alive from the forward pass until the backward pass consumes it. That is why training a model needs several times the memory of running it, and why activation checkpointing (recompute instead of store — Vol II · CH 04) exists. Three practical PyTorch facts complete the picture: gradients accumulate into .grad (hence zero_grad() every step — forgetting it is the classic silent bug); the tape is rebuilt every forward pass, so Python control flow is differentiated for free; and anything inside torch.no_grad() records nothing, which is what makes inference cheap.
And the referee from §8.1 gets its one honorable job. Whenever a gradient is written by hand — a custom op, a new layer, a paper reimplementation — it is checked against central differences. The contract: analytic and numerical agree to ~10⁻⁷ in float64, or the backward pass is wrong. Run the audit on §8.3's network:
import numpy as np
rng = np.random.default_rng(0)
X = rng.normal(size=(8,2)); y = rng.integers(0,2,(8,1)).astype(float)
theta = rng.normal(0, 0.6, size=12) # 2-4-1 net, MSE loss, params flattened
def unpack(t): return t[:8].reshape(4,2), t[8:].reshape(1,4)
def loss(t):
W1, W2 = unpack(t)
p = 1/(1 + np.exp(-(np.tanh(X @ W1.T) @ W2.T)))
return np.mean((p - y)**2)
def grad_backprop(t): # analytic: one forward + one backward
W1, W2 = unpack(t)
H = np.tanh(X @ W1.T); p = 1/(1 + np.exp(-(H @ W2.T)))
dZ2 = 2*(p - y)/y.size * p*(1 - p) # chain: MSE then sigmoid
dW1 = ((dZ2 @ W2)*(1 - H**2)).T @ X # EQ M8.3 again
return np.concatenate([dW1.ravel(), (dZ2.T @ H).ravel()])
eps = 1e-5 # numerical: 2 forwards PER parameter
g_bp = grad_backprop(theta)
g_num = np.array([(loss(theta + eps*np.eye(12)[i]) -
loss(theta - eps*np.eye(12)[i]))/(2*eps) for i in range(12)])
print("max |analytic - numerical| =", f"{np.abs(g_bp - g_num).max():.2e}")
print("np.allclose verdict:", np.allclose(g_bp, g_num, rtol=1e-5, atol=1e-7))
print(f"\ncost: backprop = 2 passes total; numerical = {2*12} passes for 12 params")
SGD and minibatches
The loss that matters is an average over the whole dataset, so the true gradient is too — and on a trillion tokens, computing it once would cost more than most entire training runs. Stochastic gradient descent declines to pay: sample a minibatch \(\mathcal{B}\), average its per-example gradients, and step on that estimate:
Batch size is then an engineering trade, not a statistical one. Larger batches use accelerators efficiently and parallelize across devices (Vol II · CH 04), but past a critical batch size the extra averaging buys almost nothing — the noise is no longer the limiting factor, and you are spending more compute per unit of progress. A common heuristic when scaling the batch is to scale the learning rate with it, which works until it abruptly doesn't. And the noise itself is not purely a tax: it helps escape saddle points, and there is evidence — genuinely contested — that it biases training toward flatter minima that generalize better (Chapter 06's themes). What is not contested: the learning rate \(\eta\) is the single most important hyperparameter in deep learning. Too low wastes compute; ~3× too high diverges; the usable window is often well under one order of magnitude, which is why Vol II · EQ 4.4's warmup-and-decay schedules exist.
Momentum and Adam
Real loss surfaces are ravines: curvature differs wildly by direction (recall Chapter 02's elongated bowls). Plain SGD must keep \(\eta\) small enough not to explode along the steepest direction — and at that \(\eta\) it crawls along the shallow one, zigzagging across the valley while barely advancing down it. Momentum fixes this with one extra vector — an exponential moving average of gradients:
Adam keeps momentum's first moment and adds a second: a running average of each coordinate's squared gradient, used to divide the step — so every parameter gets an automatically calibrated per-coordinate learning rate, and rarely-updated or small-gradient directions are not starved. The full update, bias corrections and the decoupled weight decay that makes it AdamW, is Vol II · EQ 4.3 — we will not re-derive it here. The standings in practice:
| Optimizer | State per param | Character | Where it rules |
|---|---|---|---|
| SGD | none | Honest, noisy, ravine-bound | Theory; small problems |
| SGD + momentum | +1 (v) | Coasts valleys, rolls over bumps, overshoots | The CNN/vision era; still competitive there |
| Adam / AdamW | +2 (m, v) | Per-coordinate scaling; robust to bad conditioning | Every LLM you have heard of |
The state column is real money at scale: weights + gradients + fp32 master copy + Adam's two moments is where Vol II · CH 06's "≈16 bytes per parameter to train" comes from — a 70B model wants ~1.1 TB of optimizer-laden memory before a single activation is stored.
Vanishing, exploding, and the fixes that built deep learning
Iterate EQ M8.3 through \(L\) layers and the gradient reaching layer 1 is a product of \(L\) matrices and \(L\) activation slopes. Products compound geometrically: if each factor shrinks the signal by 0.9, a hundred layers leave \(0.9^{100} \approx 3 \times 10^{-5}\) of it; if each grows it by 1.1, the same hundred layers amplify ~14,000×. Vanishing gradients mean the early layers stop learning while the late ones overfit; exploding gradients mean a single step flings the weights to infinity. For two decades this product was the practical wall — "deep networks don't train" — and the modern stack is, to a first approximation, the list of fixes:
- Initialization that respects the product. Xavier/Glorot and He initialization choose weight variance (≈ \(2/d_{\text{in}}\) for ReLU) so each layer's factor starts with norm ≈ 1 — the product begins neutral instead of doomed. One line of code; it is the difference between training and not.
- Activations that pass gradient. EQ M7.2's argument: sigmoid contributes at most 0.25 per layer; ReLU contributes exactly 1 wherever active. The 2012 switch to ReLU is most of why "deep" stopped being a euphemism for "broken".
- Residual connections. Reformulate each layer as \( h_{\ell+1} = h_\ell + F(h_\ell) \). The backward Jacobian becomes \( I + \partial F / \partial h_\ell \): the identity term gives gradients a multiplication-free expressway from the loss to every layer, and the product of \((I + \text{small})\) terms stays tame where a product of raw matrices would not. He et al. (2015) used it to train 152 layers the year 20 was hard.
- Normalization and clipping. LayerNorm/RMSNorm re-standardize activations so the slopes stay in their responsive range; gradient clipping caps the global gradient norm as a circuit breaker against the exploding side — still standard in every LLM pretraining run (Vol II · CH 04).
Carry the third fix with you across the volume boundary: a transformer is a residual network through and through, and what Volume II calls the residual stream (Vol II · §2.2) is exactly this gradient expressway, promoted from a training trick to the architecture's central data structure — every attention head and MLP reads from it and adds back into it, and gradients ride it undamped through a hundred layers.
You now know everything GPT knows about learning — Volume II shows what happens at a trillion times the scale. The forward pass of Chapter 07, this chapter's backward pass, AdamW on EQ M8.4's noisy gradients: that is, literally and completely, the training loop of a frontier model. Volume II begins where the loop meets reality — tokens, embeddings, attention, and the engineering of running it across tens of thousands of GPUs.
Further reading
- Rumelhart, D., Hinton, G. & Williams, R. (1986). Learning Representations by Back-Propagating Errors. — the paper that popularized backpropagation for training multilayer networks.
- Linnainmaa, S. (1970). The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion. — the earliest description of reverse-mode automatic differentiation, backprop's mathematical core.
- Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method. — the foundation of stochastic gradient descent and its convergence conditions.
- Kingma, D. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. — the adaptive optimizer combining momentum and per-parameter scaling that is now the default.
- Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. — the diploma thesis that first diagnosed the vanishing-gradient problem in deep networks.
- Baydin, A., Pearlmutter, B., Radul, A. & Siskind, J. (2018). Automatic Differentiation in Machine Learning: A Survey. — the modern reference tying backprop to general-purpose autodiff.