From MLP to deep networks
A multilayer perceptron (MLP) is an alternating stack of affine maps and pointwise nonlinearities. Each layer takes the previous activation \(h^{(\ell-1)}\), applies a learned weight matrix and bias, then a nonlinearity \(\phi\):
The promise of depth is compositional structure: early layers learn edges, later layers learn objects; early layers learn phonemes, later layers learn meaning. The obstacle is that the same composition that builds rich features also compounds the scale of whatever flows through it. Consider the backward pass. Backpropagation (ML 08) sends the loss gradient through the chain rule, so the gradient at layer \(\ell\) is a product of Jacobians from the output back to \(\ell\):
The same compounding hits the forward pass: an activation passing through many layers is repeatedly scaled, so its variance can balloon or collapse before it ever reaches the output. The first historical fix, switching from saturating sigmoids to the non-saturating ReLU \(\phi(z) = \max(0, z)\), removed the worst of the diagonal saturation. But ReLU alone does not control the weight factor \(W^{(k)}\), and that is where the next section begins.
Think of a deep network as a chain of amplifiers. If each amplifier has gain 0.9, then 50 of them in series have gain \(0.9^{50}\approx 0.005\) — the signal is gone. Gain 1.1 gives \(1.1^{50}\approx 117\) — it saturates. Only a chain tuned to gain \(\approx 1\) passes signal cleanly through depth. Init, normalization, and residuals are three ways to lock that gain near one.
Weight initialization — Xavier & He
Before a single gradient step, the random weights you start from already decide whether signal survives the forward pass. The goal is a variance-preserving initialization: each layer should pass activations forward without systematically growing or shrinking their variance. Treat the weights as independent zero-mean random variables and propagate variance through EQ N1.1. For a layer with \(n_{\text{in}}\) inputs, the pre-activation variance is the sum of \(n_{\text{in}}\) independent terms:
Setting the forward factor to 1 gives \(\mathrm{Var}(W) = 1/n_{\text{in}}\). The backward pass wants \(\mathrm{Var}(W) = 1/n_{\text{out}}\) for the same reason (gradients propagate through \(W^\top\)). You cannot satisfy both unless the layer is square, so Glorot (Xavier) initialization takes the harmonic compromise — the average of the two fan counts:
ReLU breaks the linear assumption: it zeros out the negative half of its inputs, so on average it halves the variance of what passes through. He (Kaiming) initialization compensates by doubling the weight variance, keying off \(n_{\text{in}}\) alone since the rectifier is the dominant correction:
kaiming_normal_ in PyTorch). The lesson is general: the right init depends on the nonlinearity, because what you must preserve is the variance after the activation, not before it.# Activation variance across depth: naive vs Xavier vs He init (ReLU net)
import numpy as np
rng = np.random.default_rng(0)
n, depth, batch = 256, 25, 1024
h0 = rng.standard_normal((batch, n)) # unit-variance input
def run(std_fn, relu=True):
h, var = h0.copy(), [h0.var()]
for _ in range(depth):
W = rng.standard_normal((n, n)) * std_fn(n)
h = h @ W # EQ N1.1 (no bias)
if relu:
h = np.maximum(h, 0.0) # ReLU halves variance
var.append(h.var())
return var
naive = run(lambda n: 1.0) # std = 1: explodes
xavier = run(lambda n: np.sqrt(1.0/n)) # tuned for linear/tanh
he = run(lambda n: np.sqrt(2.0/n)) # tuned for ReLU
print(" layer naive xavier he")
for L in (0, 5, 12, 25):
print(f" {L:5d} {naive[L]:11.2e} {xavier[L]:12.4f} {he[L]:9.4f}")
print("\nnaive blows up; xavier (1/n) decays under ReLU; he (2/n) holds near 1.")
plot_xy(list(range(depth + 1)), [min(v, 1e3) for v in he]) # He stays flat
Batch normalization
A good initialization keeps variance under control at step zero — but weights move during training, and the distribution of each layer's inputs drifts as the layers below it update. Ioffe & Szegedy named this drift internal covariate shift and proposed fixing it directly: standardize each layer's pre-activations to zero mean and unit variance, using statistics computed over the current mini-batch:
The payoff is large and somewhat over-determined. BatchNorm lets you use higher learning rates without divergence, makes training far less sensitive to the choice of initialization, and acts as a mild regularizer because each example's normalization depends on the random composition of its mini-batch. The original paper credited the reduction of internal covariate shift; later work (Santurkar et al., 2018) argued the real mechanism is a smoother loss landscape — BatchNorm bounds how fast the loss and its gradients can change, so optimization steps behave more predictably. The mechanism is still debated; the empirical win is not.
Train vs. inference — the classic footgun. At training time BatchNorm uses the live mini-batch statistics. At inference you have no batch (or want determinism), so it switches to a running average of mean and variance accumulated during training. Forgetting to put the model in eval mode — so it normalizes a single test example by its own degenerate statistics — produces the most common BatchNorm bug. BatchNorm also couples examples within a batch and degrades at very small batch sizes; that weakness is exactly why LayerNorm (normalize across features of one example, batch-independent) won in Transformers, where it sits inside every block (Vol II · Ch 02).
# Forward pass with & without BatchNorm; print per-layer activation stats
import numpy as np
rng = np.random.default_rng(0)
n, depth, batch = 128, 12, 512
x = rng.standard_normal((batch, n))
def bn(z, eps=1e-5): # EQ N1.6, gamma=1, beta=0
mu = z.mean(0); var = z.var(0)
return (z - mu) / np.sqrt(var + eps)
def forward(use_bn):
h = x.copy()
stats = []
for _ in range(depth):
W = rng.standard_normal((n, n)) * np.sqrt(2.0 / n) # He init
z = h @ W
if use_bn:
z = bn(z) # re-center & re-scale each layer
h = np.maximum(z, 0.0) # ReLU
stats.append((h.mean(), h.std()))
return stats
print(" layer no-BN mean / std with-BN mean / std")
plain, normed = forward(False), forward(True)
for L in (0, 5, 11):
p, q = plain[L], normed[L]
print(f" {L:5d} {p[0]:+.3f} / {p[1]:6.3f} {q[0]:+.3f} / {q[1]:6.3f}")
print("\nBatchNorm pins each layer's distribution; without it the std drifts.")
Residual connections
Init and normalization keep variance in line, but they do not remove the fundamental fragility of EQ N1.2: a gradient still has to survive a product of \(L\) Jacobians. By 2015, even well-initialized, batch-normalized networks showed a degradation problem — adding more layers made training accuracy worse, not just test accuracy. The deeper net could in principle copy the shallower one by setting extra layers to identity, yet optimization could not find that solution. He et al.'s answer was to make the identity the default, by adding a skip connection around each block:
Why does the skip rescue the gradient? Differentiate EQ N1.7. The Jacobian of a residual block is the identity plus the block's own Jacobian, so the backward product gains an additive shortcut at every layer:
true or false.)# Gradient flow: plain stack vanishes, residual stack survives (EQ N1.8)
import numpy as np
rng = np.random.default_rng(0)
n, depth = 64, 50
g = rng.standard_normal((depth, n, n)) * np.sqrt(0.7 / n) # block Jacobians, gain<1
grad = rng.standard_normal((1, n)) # incoming gradient at the output
plain, res = grad.copy(), grad.copy()
plain_norm, res_norm = [np.linalg.norm(plain)], [np.linalg.norm(res)]
for k in reversed(range(depth)):
plain = plain @ g[k].T # plain: multiply by J each layer
res = res @ (np.eye(n) + g[k]).T # residual: I + J (the skip)
plain_norm.append(np.linalg.norm(plain))
res_norm.append(np.linalg.norm(res))
print(f"plain gradient norm at input layer: {plain_norm[-1]:.3e}")
print(f"residual gradient norm at input layer: {res_norm[-1]:.3e}")
print(f"residual / plain ratio : {res_norm[-1]/plain_norm[-1]:.1f}x")
print("\nthe +I shortcut keeps the early-layer gradient alive across 50 layers.")
plot_xy(list(range(depth + 1)), [np.log10(v + 1e-30) for v in plain_norm])
Regularization — dropout & weight decay
The first three fixes make a deep net trainable; the last makes it generalize. A network with millions of parameters can memorize its training set outright, so we add pressure toward simpler solutions. Two techniques dominate, and they attack overfitting from opposite directions.
Dropout randomly zeros each activation with probability \(p\) on every training step, then rescales the survivors so the expected activation is unchanged:
Weight decay instead penalizes large weights, adding an \(L_2\) term to the loss that pulls every weight toward zero:
The honest picture. The four fixes overlap and partly substitute for one another. BatchNorm already regularizes, which is why dropout and BN are often redundant together. Good initialization reduces — but does not eliminate — the need for normalization. Residual connections plus normalization are now so reliable that very deep training is routine, and the field's frontier has moved from can we train it to can we afford it. None of these is a law of nature; each is an engineering fix to the same underlying disease — signal that compounds geometrically through depth — and each will be revisited, sharpened, or replaced as architectures evolve.
Now that signal can flow through depth, the question is what structure to give the layers. Chapter 02 specializes the dense layer for images: convolutions share weights across space, pooling builds translation tolerance, and the same init/norm/residual toolkit you just met powers the ResNets that dominated computer vision.
References
- Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
- He, K., Zhang, X., Ren, S. & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
- Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
- He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
- Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization.
- Santurkar, S., Tsipras, D., Ilyas, A. & Mądry, A. (2018). How Does Batch Normalization Help Optimization?