The perceptron and its limit
Rosenblatt's 1958 perceptron is a thresholded weighted sum: \( \hat{y} = \mathbf{1}[\, w^\top x + b > 0 \,] \). Geometrically it is a single hyperplane — everything on one side is class 1, everything on the other is class 0. The logistic regression of Chapter 03 softens the threshold into a sigmoid, but the geometry is identical: one flat cut through input space. For sixty years of statistics that was usually enough, because humans hand-engineered features until the classes became linearly separable.
Then consider the smallest dataset in this encyclopedia — exclusive-or:
| x₁ | x₂ | x₁ XOR x₂ | corner |
|---|---|---|---|
| 0 | 0 | 0 | bottom-left |
| 0 | 1 | 1 | top-left |
| 1 | 0 | 1 | bottom-right |
| 1 | 1 | 0 | top-right |
The 1s sit on one diagonal, the 0s on the other. No line separates them, and the proof takes four inequalities. Suppose weights \(w_1, w_2, b\) existed. The four points demand:
- \((0,0)\to 0\): \( b \le 0 \)
- \((1,0)\to 1\): \( w_1 + b > 0 \)
- \((0,1)\to 1\): \( w_2 + b > 0 \)
- \((1,1)\to 0\): \( w_1 + w_2 + b \le 0 \)
Add the middle two: \( w_1 + w_2 + 2b > 0 \), so \( w_1 + w_2 + b > -b \ge 0 \) — directly contradicting the fourth. No linear model, however trained, can represent XOR. Minsky and Papert published this in 1969, funding for neural networks evaporated, and the result stood as the field's cautionary tale until people accepted the obvious-but-then-untrainable fix: more layers. Note the honest framing — the limitation was about single-layer machines; multi-layer ones were known to be more powerful but nobody could train them until backpropagation spread in 1986 (Chapter 08).
The MLP: linear, bend, linear
The multi-layer perceptron inserts a hidden layer of \(d_h\) units between input and output, with an elementwise nonlinearity \(\varphi\) — the activation function — applied in between:
What does training do with this freedom? Watch it happen. The instrument below is a complete neural network — forward pass, backpropagation, gradient updates — implemented in this page with no library. It trains a 2-H-1 MLP (tanh hidden units, sigmoid output, cross-entropy loss, full-batch gradient descent) on seeded datasets a linear model cannot touch.
Before training finds weights, it helps to see that good weights exist. The cell below hard-codes a 2-8-1 ReLU network in which only two of the eight hidden units do any work — and XOR falls exactly. This is the existence proof; the instrument above is the search; Chapter 08 is the algebra of the search.
import numpy as np
def relu(z): return np.maximum(0, z)
# 2-8-1 MLP, weights set by hand: 2 of the 8 hidden units solve XOR, 6 idle
W1 = np.zeros((8, 2)); b1 = np.zeros(8)
W1[0] = [1, 1]; b1[0] = 0.0 # h0 = ReLU(x1 + x2)
W1[1] = [1, 1]; b1[1] = -1.0 # h1 = ReLU(x1 + x2 - 1)
W2 = np.zeros((1, 8)); W2[0, 0] = 1.0; W2[0, 1] = -2.0
b2 = np.zeros(1) # y = h0 - 2*h1
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
H = relu(X @ W1.T + b1) # (4, 8)
Y = H @ W2.T + b2 # (4, 1)
for x, y in zip(X, Y):
print(f"x = {x.astype(int)} -> yhat = {y[0]:.1f} XOR = {int(x[0]) ^ int(x[1])}")
print("\nhidden layer H (4 inputs x 8 units):")
print(H)
Activation functions
The choice of \(\varphi\) looks cosmetic and decided a decade of history. An activation must be nonlinear (or the stack collapses), nearly free to compute (it runs once per unit per example), and — the part nobody appreciated until networks got deep — it must pass gradients. The learning signal reaching layer 1 is a product of one factor per layer crossed:
| Activation | φ(z) | Range | max φ′ | Verdict |
|---|---|---|---|---|
| Sigmoid σ | 1 / (1 + e⁻ᶻ) | (0, 1) | 0.25 | Saturates on both sides; killed deep stacks. Survives as the output for probabilities and inside gates. |
| Tanh | 2σ(2z) − 1 | (−1, 1) | 1.00 | Zero-centered sigmoid; the RNN-era default; still saturates at both ends. |
| ReLU | max(0, z) | [0, ∞) | 1.00 | Derivative is 1 everywhere active; cheap; sparse. Risk: dead units stuck at φ′ = 0 forever. |
| GELU | z · Φ(z) | ≈ (−0.17, ∞) | ≈ 1.08 | Smooth ReLU weighted by the Gaussian CDF; default of the GPT-2/BERT era of transformers. |
Where the lineage goes next. Modern LLMs use a gated refinement: SwiGLU (Vol II · EQ 2.3) multiplies a SiLU-squashed gate elementwise against a linear up-projection, letting the MLP modulate its own features. It is the direct descendant of the choices in this table — same constraint set, one more multiplicative trick.
Universal approximation — and its fine print
The classical justification for all of this is the universal approximation theorem, in words: a feed-forward network with a single hidden layer and any non-polynomial activation can approximate any continuous function on a bounded region to any accuracy you name — provided you may make the hidden layer wide enough (Cybenko 1989 for sigmoids; Hornik 1991 in general). One layer of bends, in principle, suffices for everything.
Existence is not learnability. The theorem is non-constructive on every axis that matters: (1) it does not say how many units — worst-case width grows exponentially in the input dimension, the curse of dimensionality again; (2) it does not say the weights are findable — gradient descent on a non-convex loss carries no guarantee of reaching them (you watched H = 2 fail on a representable problem in Instrument M7.1); (3) it says nothing about generalization from finite data. In Chapter 06's language, it bounds approximation error only — estimation and optimization error are untouched.
Read it, then, as a license rather than an explanation: MLPs are a hypothesis class with no permanent blind spots, unlike the perceptron. Why gradient-trained deep networks work as well as they do on real data remains a partially open research question in 2026 — be suspicious of anyone who cites this theorem as the answer.
Width, depth, and why depth wins
If one wide layer suffices in principle, why is every serious network deep? Because depth buys composition. A second hidden layer computes features of features: layer 1 finds edges in pixels, layer 2 assembles edges into textures and parts, layer 3 into objects. In text models the same hierarchy runs characters → morphemes → syntax → semantics. A wide-shallow network must build every high-level feature from raw inputs in one step; a deep one builds a vocabulary at each level and reuses it everywhere above — the same economy that makes subroutines beat straight-line code.
| Wider (same depth) | Deeper (same width) | |
|---|---|---|
| What you get | More parallel features at one level of abstraction | A hierarchy — features of features |
| Expressive power | grows polynomially | some functions need exponentially fewer units |
| Trainability | Benign; gradients stay healthy | EQ M7.2's product bites — needs ReLU-family φ, careful init, later residuals & normalization (Vol II · CH 02) |
The middle row has real theorems behind it: deep ReLU networks fold input space repeatedly, producing a number of linear regions that grows exponentially with depth but only polynomially with width (Montúfar 2014), and there exist functions computable by a deep network that no shallow network of sub-exponential width can match (Telgarsky 2016). Honesty requires the caveat: those are worst-case constructions, not descriptions of your dataset. The practical reasons depth wins are that real data is compositional, and that a decade of engineering — initialization, normalization, residual connections — removed depth's optimization penalty. On flat tabular data, Chapter 04's gradient-boosted trees still routinely beat both wide and deep.
The forward pass is matrix multiplication
Everything above was written for one input vector. Real computation is batched: stack \(B\) examples as rows of \(X\), and the entire forward pass becomes two matrix multiplications — which is the only operation GPUs are truly built for, and the reason the whole field runs on them:
Professionals debug networks by reciting shapes, not by reading values. Build the habit now — run the drill, then change d_h or batch and predict every line before re-running. If you can write the shape ledger of a network from memory, you understand its forward pass; there is nothing else in it.
import numpy as np
rng = np.random.default_rng(7)
B, d_in, d_h, d_out = 32, 2, 8, 1 # batch 32 through a 2-8-1
X = rng.normal(size=(B, d_in))
W1 = rng.normal(size=(d_h, d_in)) * 0.5 # (out, in) convention
b1 = rng.normal(size=d_h) * 0.1
W2 = rng.normal(size=(d_out, d_h)) * 0.5
b2 = np.zeros(d_out)
Z1 = X @ W1.T + b1 # inner dims: (B,d_in)(d_in,d_h)
H = np.maximum(0, Z1) # ReLU, elementwise: shape unchanged
Y = H @ W2.T + b2
ledger = [("X", X), ("W1", W1), ("Z1 = X @ W1.T + b1", Z1),
("H = ReLU(Z1)", H), ("W2", W2), ("Y = H @ W2.T + b2", Y)]
for name, A in ledger:
print(f"{name:24s} {A.shape}")
print(f"\nReLU zeroed {(H == 0).mean():.0%} of H — sparsity is the default")
The forward pass is two matmuls; learning is the question of how blame flows backward through them. Chapter 08: the chain rule organized on a computational graph — backpropagation — plus momentum and Adam, the optimizers that turn raw gradients into progress. Instrument M7.1 already ran every line of it; next chapter you read its algebra.
Further reading
- Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. — the founding single-layer model whose limits motivate the MLP.
- Minsky, M. & Papert, S. (1969). Perceptrons. — the formal proof that a single perceptron cannot solve XOR, the limit Section 7.1 turns on.
- Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. — the universal approximation theorem: one hidden layer can approximate any continuous function.
- Hornik, K., Stinchcombe, M. & White, H. (1989). Multilayer Feedforward Networks are Universal Approximators. — generalizes universality beyond sigmoids to broad activation classes.
- Glorot, X. & Bengio, Y. (2010). Understanding the Difficulty of Training Deep Feedforward Networks. — the Xavier initialization and activation analysis that make deep MLPs trainable.
- Nair, V. & Hinton, G. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. — the case for ReLU, now the default hidden activation.