07 · Neural Networks: The MLP

7.1

The perceptron and its limit

Rosenblatt's 1958 perceptron is a thresholded weighted sum: $ \hat{y} = \mathbf{1}[\, w^\top x + b > 0 \,] $. Geometrically it is a single hyperplane — everything on one side is class 1, everything on the other is class 0. The logistic regression of Chapter 03 softens the threshold into a sigmoid, but the geometry is identical: one flat cut through input space. For sixty years of statistics that was usually enough, because humans hand-engineered features until the classes became linearly separable.

Then consider the smallest dataset in this encyclopedia — exclusive-or:

x₁	x₂	x₁ XOR x₂	corner
0	0	0	bottom-left
0	1	1	top-left
1	0	1	bottom-right
1	1	0	top-right

The 1s sit on one diagonal, the 0s on the other. No line separates them, and the proof takes four inequalities. Suppose weights $w_1, w_2, b$ existed. The four points demand:

$(0,0)\to 0$: $ b \le 0 $
$(1,0)\to 1$: $ w_1 + b > 0 $
$(0,1)\to 1$: $ w_2 + b > 0 $
$(1,1)\to 0$: $ w_1 + w_2 + b \le 0 $

Add the middle two: $ w_1 + w_2 + 2b > 0 $, so $ w_1 + w_2 + b > -b \ge 0 $ — directly contradicting the fourth. No linear model, however trained, can represent XOR. Minsky and Papert published this in 1969, funding for neural networks evaporated, and the result stood as the field's cautionary tale until people accepted the obvious-but-then-untrainable fix: more layers. Note the honest framing — the limitation was about single-layer machines; multi-layer ones were known to be more powerful but nobody could train them until backpropagation spread in 1986 (Chapter 08).

7.2

The MLP: linear, bend, linear

The multi-layer perceptron inserts a hidden layer of $d_h$ units between input and output, with an elementwise nonlinearity $\varphi$ — the activation function — applied in between:

EQ M7.1 — THE TWO-LAYER MLP $$ h = \varphi\!\big( W_1 x + b_1 \big), \qquad \hat{y} = W_2\, h + b_2, \qquad W_1 \in \mathbb{R}^{d_h \times d_{\text{in}}},\; W_2 \in \mathbb{R}^{d_{\text{out}} \times d_h} $$

Each row of $W_1$ defines one hyperplane; hidden unit $h_j = \varphi(w_j^\top x + b_j)$ is a squashed signed distance to it — a soft linear feature detector. The output layer is then an ordinary linear model in the feature space the network chose for itself. The bend is load-bearing: without $\varphi$, the stack collapses — $W_2(W_1 x + b_1) + b_2$ is just another linear model, no matter how many layers you pile up. We write a network's sizes as d_in-d_h-d_out: the instrument below trains a 2-8-1.

One hidden unit with ReLU activation: weights $w = (1, 2)$, bias $b = -1$, input $x = (2, 1)$. Its output is $h = \mathrm{ReLU}(w^\top x + b)$. What is $h$?

Pre-activation: $w^\top x + b = 1\cdot 2 + 2\cdot 1 + (-1) = 2 + 2 - 1 = 3$. Since $3 > 0$, $\mathrm{ReLU}(3) = \max(0, 3) = $ 3. The unit is on its active half, so its gradient passes through undamped ($\varphi' = 1$).

FIG 7.1A 2-4-1 MLP — TWO MATRICES AND ONE BEND

Every edge is one learned number. The first layer's rows are four hyperplanes; φ turns signed distances into soft features; the second layer is a plain linear model over those features. The network's only new trick is choosing its features itself.

What does training do with this freedom? Watch it happen. The instrument below is a complete neural network — forward pass, backpropagation, gradient updates — implemented in this page with no library. It trains a 2-H-1 MLP (tanh hidden units, sigmoid output, cross-entropy loss, full-batch gradient descent) on seeded datasets a linear model cannot touch.

INSTRUMENT M7.1 — XOR PLAYGROUNDREAL 2-H-1 MLP · FULL BACKPROP IN-PAGE · SEEDED

DATASET

CONTROL

LEARNING RATE η (LOG) 0.50

HIDDEN UNITS H 8

EPOCH

0

LOSS (BCE)

—

TRAIN ACCURACY

—

PARAMETERS (4H+1)

33

Press TRAIN and watch a straight prejudice bend into the right shape — with H = 8 and η = 0.5, XOR falls inside a few hundred epochs. Switch to TWO CIRCLES and the net closes a loop around the inner cluster. Now drop H to 2 on the circles: enclosing a region needs at least three half-planes, so it provably cannot — and H = 2 on XOR can represent the answer yet gradient descent often fails to find it (representable ≠ learnable, the theme of §7.4). Then push η toward 100: the loss curve goes ragged as each step overshoots the valley it is aiming for, and the boundary thrashes — full-batch descent on this bounded loss rarely explodes outright, but it stops descending. The loss curve tells you before the picture does.

Before training finds weights, it helps to see that good weights exist. The cell below hard-codes a 2-8-1 ReLU network in which only two of the eight hidden units do any work — and XOR falls exactly. This is the existence proof; the instrument above is the search; Chapter 08 is the algebra of the search.

PYTHON · RUNNABLE IN-BROWSER

import numpy as np

def relu(z): return np.maximum(0, z)

# 2-8-1 MLP, weights set by hand: 2 of the 8 hidden units solve XOR, 6 idle
W1 = np.zeros((8, 2)); b1 = np.zeros(8)
W1[0] = [1, 1]; b1[0] =  0.0     # h0 = ReLU(x1 + x2)
W1[1] = [1, 1]; b1[1] = -1.0     # h1 = ReLU(x1 + x2 - 1)
W2 = np.zeros((1, 8)); W2[0, 0] = 1.0; W2[0, 1] = -2.0
b2 = np.zeros(1)                  # y = h0 - 2*h1

X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
H = relu(X @ W1.T + b1)           # (4, 8)
Y = H @ W2.T + b2                 # (4, 1)
for x, y in zip(X, Y):
    print(f"x = {x.astype(int)}  ->  yhat = {y[0]:.1f}   XOR = {int(x[0]) ^ int(x[1])}")
print("\nhidden layer H (4 inputs x 8 units):")
print(H)

edits are live — break it on purpose

7.3

Activation functions

The choice of $\varphi$ looks cosmetic and decided a decade of history. An activation must be nonlinear (or the stack collapses), nearly free to compute (it runs once per unit per example), and — the part nobody appreciated until networks got deep — it must pass gradients. The learning signal reaching layer 1 is a product of one factor per layer crossed:

EQ M7.2 — WHY GRADIENTS VANISH $$ \frac{\partial \mathcal{L}}{\partial h^{(1)}} \;=\; \Bigg( \prod_{\ell=2}^{L} W_{\ell}^{\top}\, \mathrm{diag}\!\big( \varphi'(z^{(\ell)}) \big) \Bigg) \frac{\partial \mathcal{L}}{\partial h^{(L)}} $$

Every layer multiplies the backward signal by its weight matrix and by $\varphi'$ evaluated where each unit currently sits. Sigmoid's derivative peaks at $1/4$ — so through $L$ sigmoid layers the signal shrinks like $(1/4)^L$ at best, and far faster once units saturate. Ten layers: $\sim 10^{-6}$ of the gradient survives. ReLU's derivative is exactly 1 on the entire active half — gradients pass through unshrunk, which is most of why deep networks became trainable in 2012.

A sigmoid unit outputs $a = \sigma(z) = 0.9$. The sigmoid derivative is $\varphi'(z) = a(1 - a)$. What is this unit's local slope $\varphi'$?

$\varphi' = a(1 - a) = 0.9 \times (1 - 0.9) = 0.9 \times 0.1 = $ 0.09. Far below sigmoid's peak of 0.25 — this near-saturated unit barely passes gradient, and stacking such factors is what makes deep sigmoid networks untrainable.

Activation	φ(z)	Range	max φ′	Verdict
Sigmoid σ	1 / (1 + e⁻ᶻ)	(0, 1)	0.25	Saturates on both sides; killed deep stacks. Survives as the output for probabilities and inside gates.
Tanh	2σ(2z) − 1	(−1, 1)	1.00	Zero-centered sigmoid; the RNN-era default; still saturates at both ends.
ReLU	max(0, z)	[0, ∞)	1.00	Derivative is 1 everywhere active; cheap; sparse. Risk: dead units stuck at φ′ = 0 forever.
GELU	z · Φ(z)	≈ (−0.17, ∞)	≈ 1.08	Smooth ReLU weighted by the Gaussian CDF; default of the GPT-2/BERT era of transformers.

INSTRUMENT M7.2 — ACTIVATION GALLERYf AND f′ ON [−5, 5] · SATURATION SHADED

ACTIVATION

MAX f′ ON [−5, 5]

—

SATURATED SHARE (f′ < 0.05)

—

f′(2.5)¹⁰ — 10-LAYER SIGNAL

—

The red bands are where the derivative is effectively zero — a unit parked there learns nothing. Click through the four and watch the last readout: it is the gradient surviving ten stacked layers for a unit sitting at z = 2.5. Sigmoid: ~10⁻¹². Tanh: ~10⁻¹⁶. ReLU: exactly 1. That single number is the argument that ended the sigmoid era. Note ReLU's own red zone — the entire negative half — which is the dead-unit risk in the table above.

Where the lineage goes next. Modern LLMs use a gated refinement: SwiGLU (Vol II · EQ 2.3) multiplies a SiLU-squashed gate elementwise against a linear up-projection, letting the MLP modulate its own features. It is the direct descendant of the choices in this table — same constraint set, one more multiplicative trick.

7.4

Universal approximation — and its fine print

The classical justification for all of this is the universal approximation theorem, in words: a feed-forward network with a single hidden layer and any non-polynomial activation can approximate any continuous function on a bounded region to any accuracy you name — provided you may make the hidden layer wide enough (Cybenko 1989 for sigmoids; Hornik 1991 in general). One layer of bends, in principle, suffices for everything.

FINE PRINT

Existence is not learnability. The theorem is non-constructive on every axis that matters: (1) it does not say how many units — worst-case width grows exponentially in the input dimension, the curse of dimensionality again; (2) it does not say the weights are findable — gradient descent on a non-convex loss carries no guarantee of reaching them (you watched H = 2 fail on a representable problem in Instrument M7.1); (3) it says nothing about generalization from finite data. In Chapter 06's language, it bounds approximation error only — estimation and optimization error are untouched.

Read it, then, as a license rather than an explanation: MLPs are a hypothesis class with no permanent blind spots, unlike the perceptron. Why gradient-trained deep networks work as well as they do on real data remains a partially open research question in 2026 — be suspicious of anyone who cites this theorem as the answer.

7.5

Width, depth, and why depth wins

If one wide layer suffices in principle, why is every serious network deep? Because depth buys composition. A second hidden layer computes features of features: layer 1 finds edges in pixels, layer 2 assembles edges into textures and parts, layer 3 into objects. In text models the same hierarchy runs characters → morphemes → syntax → semantics. A wide-shallow network must build every high-level feature from raw inputs in one step; a deep one builds a vocabulary at each level and reuses it everywhere above — the same economy that makes subroutines beat straight-line code.

	Wider (same depth)	Deeper (same width)
What you get	More parallel features at one level of abstraction	A hierarchy — features of features
Expressive power	grows polynomially	some functions need exponentially fewer units
Trainability	Benign; gradients stay healthy	EQ M7.2's product bites — needs ReLU-family φ, careful init, later residuals & normalization (Vol II · CH 02)

The middle row has real theorems behind it: deep ReLU networks fold input space repeatedly, producing a number of linear regions that grows exponentially with depth but only polynomially with width (Montúfar 2014), and there exist functions computable by a deep network that no shallow network of sub-exponential width can match (Telgarsky 2016). Honesty requires the caveat: those are worst-case constructions, not descriptions of your dataset. The practical reasons depth wins are that real data is compositional, and that a decade of engineering — initialization, normalization, residual connections — removed depth's optimization penalty. On flat tabular data, Chapter 04's gradient-boosted trees still routinely beat both wide and deep.

7.6

The forward pass is matrix multiplication

Everything above was written for one input vector. Real computation is batched: stack $B$ examples as rows of $X$, and the entire forward pass becomes two matrix multiplications — which is the only operation GPUs are truly built for, and the reason the whole field runs on them:

EQ M7.3 — THE BATCHED FORWARD PASS $$ \underset{B \times d_h}{H} \;=\; \varphi\!\Big( \underset{B \times d_{\text{in}}}{X} \;\; \underset{d_{\text{in}} \times d_h}{W_1^{\top}} \;+\; \underset{1 \times d_h}{b_1} \Big), \qquad \underset{B \times d_{\text{out}}}{\hat{Y}} \;=\; H\, W_2^{\top} + b_2 $$

The single rule of shape discipline: inner dimensions must agree, and the batch dimension rides along untouched in the leftmost slot. The bias $b_1$ is a single row, broadcast down all $B$ rows. Frameworks store weights as (out, in) — hence the transposes. In Volume II the same ledger gains one axis, (B, T, d_model), and otherwise nothing changes.

A two-layer MLP with sizes 3-5-2 ($d_{\text{in}}=3$, $d_h=5$, $d_{\text{out}}=2$). Counting every weight and bias, how many trainable parameters does it have?

$W_1$ is $5 \times 3 = 15$, $b_1$ is 5, $W_2$ is $2 \times 5 = 10$, $b_2$ is 2. Total $= 15 + 5 + 10 + 2 = $ 32. Every edge in the network diagram is one of these numbers.

Professionals debug networks by reciting shapes, not by reading values. Build the habit now — run the drill, then change d_h or batch and predict every line before re-running. If you can write the shape ledger of a network from memory, you understand its forward pass; there is nothing else in it.

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(7)

B, d_in, d_h, d_out = 32, 2, 8, 1     # batch 32 through a 2-8-1
X  = rng.normal(size=(B, d_in))
W1 = rng.normal(size=(d_h, d_in)) * 0.5   # (out, in) convention
b1 = rng.normal(size=d_h) * 0.1
W2 = rng.normal(size=(d_out, d_h)) * 0.5
b2 = np.zeros(d_out)

Z1 = X @ W1.T + b1            # inner dims: (B,d_in)(d_in,d_h)
H  = np.maximum(0, Z1)        # ReLU, elementwise: shape unchanged
Y  = H @ W2.T + b2

ledger = [("X", X), ("W1", W1), ("Z1 = X @ W1.T + b1", Z1),
          ("H = ReLU(Z1)", H), ("W2", W2), ("Y = H @ W2.T + b2", Y)]
for name, A in ledger:
    print(f"{name:24s} {A.shape}")
print(f"\nReLU zeroed {(H == 0).mean():.0%} of H — sparsity is the default")

predict each shape, then run

The forward pass is two matmuls; learning is the question of how blame flows backward through them. Chapter 08: the chain rule organized on a computational graph — backpropagation — plus momentum and Adam, the optimizers that turn raw gradients into progress. Instrument M7.1 already ran every line of it; next chapter you read its algebra.

§