Deep Learning Foundations — Init, Norm & Residuals

1.1

From MLP to deep networks

A multilayer perceptron (MLP) is an alternating stack of affine maps and pointwise nonlinearities. Each layer takes the previous activation $h^{(\ell-1)}$, applies a learned weight matrix and bias, then a nonlinearity $\phi$:

EQ N1.1 — A FORWARD LAYER $$ z^{(\ell)} = W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}, \qquad h^{(\ell)} = \phi\!\big(z^{(\ell)}\big), \qquad \ell = 1, \ldots, L $$

$z^{(\ell)}$ is the pre-activation, $h^{(\ell)}$ the activation. Stack $L$ of these and the network composes $L$ nonlinear maps. The universal approximation theorem says even a single sufficiently wide hidden layer can approximate any continuous function on a compact set — but it is silent on how wide and gives no recipe for finding the weights. Depth is the practical answer: deep networks build features hierarchically and represent many functions exponentially more compactly than a shallow one of equal parameter count.

The promise of depth is compositional structure: early layers learn edges, later layers learn objects; early layers learn phonemes, later layers learn meaning. The obstacle is that the same composition that builds rich features also compounds the scale of whatever flows through it. Consider the backward pass. Backpropagation (ML 08) sends the loss gradient through the chain rule, so the gradient at layer $\ell$ is a product of Jacobians from the output back to $\ell$:

EQ N1.2 — WHY DEPTH IS HARD: THE JACOBIAN PRODUCT $$ \frac{\partial \mathcal{L}}{\partial h^{(\ell)}} = \left(\prod_{k=\ell+1}^{L} J^{(k)}\right)^{\!\top} \frac{\partial \mathcal{L}}{\partial h^{(L)}}, \qquad J^{(k)} = \frac{\partial h^{(k)}}{\partial h^{(k-1)}} = \mathrm{diag}\!\big(\phi'(z^{(k)})\big)\, W^{(k)} $$

The gradient is multiplied by one Jacobian per layer. If the typical singular value of these Jacobians is below 1, the product shrinks geometrically toward zero — the vanishing-gradient problem, which leaves early layers learning nothing. If it is above 1, the product blows up — the exploding-gradient problem, which makes training diverge. A network with sigmoid/tanh units is doubly cursed: $\phi'$ saturates to near zero in the tails, so the diagonal factor alone kills the signal. The whole chapter is a campaign to keep that product near 1.

The same compounding hits the forward pass: an activation passing through many layers is repeatedly scaled, so its variance can balloon or collapse before it ever reaches the output. The first historical fix, switching from saturating sigmoids to the non-saturating ReLU $\phi(z) = \max(0, z)$, removed the worst of the diagonal saturation. But ReLU alone does not control the weight factor $W^{(k)}$, and that is where the next section begins.

INTUITION

Think of a deep network as a chain of amplifiers. If each amplifier has gain 0.9, then 50 of them in series have gain $0.9^{50}\approx 0.005$ — the signal is gone. Gain 1.1 gives $1.1^{50}\approx 117$ — it saturates. Only a chain tuned to gain $\approx 1$ passes signal cleanly through depth. Init, normalization, and residuals are three ways to lock that gain near one.

1.2

Weight initialization — Xavier & He

Before a single gradient step, the random weights you start from already decide whether signal survives the forward pass. The goal is a variance-preserving initialization: each layer should pass activations forward without systematically growing or shrinking their variance. Treat the weights as independent zero-mean random variables and propagate variance through EQ N1.1. For a layer with $n_{\text{in}}$ inputs, the pre-activation variance is the sum of $n_{\text{in}}$ independent terms:

EQ N1.3 — VARIANCE PROPAGATION (LINEAR REGIME) $$ \mathrm{Var}\big(z^{(\ell)}\big) = n_{\text{in}}\,\mathrm{Var}\big(W^{(\ell)}\big)\,\mathrm{Var}\big(h^{(\ell-1)}\big) $$

If $n_{\text{in}}\,\mathrm{Var}(W) > 1$, variance grows layer by layer and activations explode; if it is below 1, they vanish. The fix is to choose $\mathrm{Var}(W)$ so the factor is exactly 1. The naive default — $W \sim \mathcal{N}(0, 1)$, variance 1 — multiplies variance by $n_{\text{in}}$ at every layer, which for a width-256 network is a factor of 256 per layer. That single bad constant is enough to make a deep net untrainable.

Setting the forward factor to 1 gives $\mathrm{Var}(W) = 1/n_{\text{in}}$. The backward pass wants $\mathrm{Var}(W) = 1/n_{\text{out}}$ for the same reason (gradients propagate through $W^\top$). You cannot satisfy both unless the layer is square, so Glorot (Xavier) initialization takes the harmonic compromise — the average of the two fan counts:

EQ N1.4 — XAVIER / GLOROT INITIALIZATION $$ \mathrm{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}} \qquad\Longrightarrow\qquad W \sim \mathcal{U}\!\left[-\sqrt{\tfrac{6}{n_{\text{in}}+n_{\text{out}}}},\; \sqrt{\tfrac{6}{n_{\text{in}}+n_{\text{out}}}}\right] $$

The uniform bound comes from the fact that a uniform distribution on $[-a, a]$ has variance $a^2/3$; setting $a^2/3 = 2/(n_{\text{in}}+n_{\text{out}})$ gives $a = \sqrt{6/(n_{\text{in}}+n_{\text{out}})}$. Glorot & Bengio derived this assuming a roughly linear activation around zero — true for $\tanh$, whose slope at the origin is 1. It is the right default for symmetric, zero-centered nonlinearities.

ReLU breaks the linear assumption: it zeros out the negative half of its inputs, so on average it halves the variance of what passes through. He (Kaiming) initialization compensates by doubling the weight variance, keying off $n_{\text{in}}$ alone since the rectifier is the dominant correction:

EQ N1.5 — HE / KAIMING INITIALIZATION (FOR ReLU) $$ \mathrm{Var}(W) = \frac{2}{n_{\text{in}}} \qquad\Longrightarrow\qquad \mathrm{std}(W) = \sqrt{\frac{2}{n_{\text{in}}}} $$

The extra factor of 2 over the naive $1/n_{\text{in}}$ exactly cancels ReLU's variance-halving. This is the default in essentially every modern framework for ReLU-family networks (kaiming_normal_ in PyTorch). The lesson is general: the right init depends on the nonlinearity, because what you must preserve is the variance after the activation, not before it.

A ReLU layer has $n_{\text{in}} = 128$ inputs. Using He initialization (EQ N1.5), what standard deviation should you draw its weights from? ($\sqrt{2/n_{\text{in}}}$.)

$\sqrt{2/n_{\text{in}}} = \sqrt{2/128} = \sqrt{0.015625} = $ 0.125. A width-128 ReLU layer should start with weights of standard deviation 0.125 — far below the naive $1.0$ that would explode the forward pass.

PYTHON · RUNNABLE IN-BROWSER

# Activation variance across depth: naive vs Xavier vs He init (ReLU net)
import numpy as np
rng = np.random.default_rng(0)

n, depth, batch = 256, 25, 1024
h0 = rng.standard_normal((batch, n))          # unit-variance input

def run(std_fn, relu=True):
    h, var = h0.copy(), [h0.var()]
    for _ in range(depth):
        W = rng.standard_normal((n, n)) * std_fn(n)
        h = h @ W                              # EQ N1.1 (no bias)
        if relu:
            h = np.maximum(h, 0.0)             # ReLU halves variance
        var.append(h.var())
    return var

naive  = run(lambda n: 1.0)                    # std = 1: explodes
xavier = run(lambda n: np.sqrt(1.0/n))         # tuned for linear/tanh
he     = run(lambda n: np.sqrt(2.0/n))         # tuned for ReLU

print(" layer    naive        xavier        he")
for L in (0, 5, 12, 25):
    print(f" {L:5d} {naive[L]:11.2e} {xavier[L]:12.4f} {he[L]:9.4f}")
print("\nnaive blows up; xavier (1/n) decays under ReLU; he (2/n) holds near 1.")
plot_xy(list(range(depth + 1)), [min(v, 1e3) for v in he])  # He stays flat

edits are live — break it on purpose

INSTRUMENT N1.1 — INIT EXPLORERACTIVATION VARIANCE ACROSS DEPTH · EQ N1.3–N1.5

WIDTH n 256

DEPTH L 30

INIT SCHEME

VAR AT LAYER 1

—

VAR AT FINAL LAYER

—

VERDICT

—

A ReLU network of the chosen width and depth is run forward on unit-variance input; the curve is $\log_{10}$ of the activation variance at each layer (the dashed line is variance = 1, the target). NAIVE shoots off the top of the chart within a few layers — the $256\times$ blow-up of EQ N1.3. XAVIER decays toward zero because under ReLU its $1/n$ is a factor of 2 too small. HE tracks the dashed line: variance preserved through arbitrary depth. Drop to NAIVE and watch the verdict flip to EXPLODES.

1.3

Batch normalization

A good initialization keeps variance under control at step zero — but weights move during training, and the distribution of each layer's inputs drifts as the layers below it update. Ioffe & Szegedy named this drift internal covariate shift and proposed fixing it directly: standardize each layer's pre-activations to zero mean and unit variance, using statistics computed over the current mini-batch:

EQ N1.6 — BATCH NORMALIZATION $$ \hat{z}_i = \frac{z_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, \qquad y_i = \gamma\,\hat{z}_i + \beta, \qquad \mu_{\mathcal{B}} = \frac{1}{m}\sum_{i=1}^{m} z_i,\;\; \sigma_{\mathcal{B}}^2 = \frac{1}{m}\sum_{i=1}^{m} (z_i - \mu_{\mathcal{B}})^2 $$

For each feature channel, subtract the batch mean and divide by the batch standard deviation (with $\epsilon \approx 10^{-5}$ for numerical safety), then re-scale and re-shift with learned parameters $\gamma, \beta$. Those two learnable parameters are crucial: normalization alone would force every layer into the same fixed distribution, but $\gamma, \beta$ let the network recover any mean and variance it actually needs — including, if $\gamma=\sigma_{\mathcal B}$ and $\beta=\mu_{\mathcal B}$, the identity. Normalization is a default the network can override, not a straitjacket.

The payoff is large and somewhat over-determined. BatchNorm lets you use higher learning rates without divergence, makes training far less sensitive to the choice of initialization, and acts as a mild regularizer because each example's normalization depends on the random composition of its mini-batch. The original paper credited the reduction of internal covariate shift; later work (Santurkar et al., 2018) argued the real mechanism is a smoother loss landscape — BatchNorm bounds how fast the loss and its gradients can change, so optimization steps behave more predictably. The mechanism is still debated; the empirical win is not.

Train vs. inference — the classic footgun. At training time BatchNorm uses the live mini-batch statistics. At inference you have no batch (or want determinism), so it switches to a running average of mean and variance accumulated during training. Forgetting to put the model in eval mode — so it normalizes a single test example by its own degenerate statistics — produces the most common BatchNorm bug. BatchNorm also couples examples within a batch and degrades at very small batch sizes; that weakness is exactly why LayerNorm (normalize across features of one example, batch-independent) won in Transformers, where it sits inside every block (Vol II · Ch 02).

A BatchNorm layer sees the mini-batch of pre-activations $\{2, 2, 6, 6\}$ for one channel. Take $\epsilon = 0$. What is the normalized value $\hat{z}$ (EQ N1.6, before the $\gamma,\beta$ re-scale) of an element with $z = 5$?

Mean $\mu_{\mathcal{B}} = (2+2+6+6)/4 = 4$. Variance $\sigma_{\mathcal{B}}^2 = \tfrac14[(2{-}4)^2+(2{-}4)^2+(6{-}4)^2+(6{-}4)^2] = \tfrac14(4+4+4+4) = 4$, so $\sigma_{\mathcal{B}} = 2$. Then $\hat{z} = (5-4)/2 = $ 0.5 — the element sits half a standard deviation above the batch mean.

PYTHON · RUNNABLE IN-BROWSER

# Forward pass with & without BatchNorm; print per-layer activation stats
import numpy as np
rng = np.random.default_rng(0)

n, depth, batch = 128, 12, 512
x = rng.standard_normal((batch, n))

def bn(z, eps=1e-5):                            # EQ N1.6, gamma=1, beta=0
    mu = z.mean(0); var = z.var(0)
    return (z - mu) / np.sqrt(var + eps)

def forward(use_bn):
    h = x.copy()
    stats = []
    for _ in range(depth):
        W = rng.standard_normal((n, n)) * np.sqrt(2.0 / n)   # He init
        z = h @ W
        if use_bn:
            z = bn(z)                            # re-center & re-scale each layer
        h = np.maximum(z, 0.0)                   # ReLU
        stats.append((h.mean(), h.std()))
    return stats

print(" layer   no-BN mean / std        with-BN mean / std")
plain, normed = forward(False), forward(True)
for L in (0, 5, 11):
    p, q = plain[L], normed[L]
    print(f" {L:5d}   {p[0]:+.3f} / {p[1]:6.3f}        {q[0]:+.3f} / {q[1]:6.3f}")
print("\nBatchNorm pins each layer's distribution; without it the std drifts.")

edits are live — break it on purpose

INSTRUMENT N1.2 — BATCHNORM & TRAINING STABILITYLOSS CURVES · ON vs OFF · EQ N1.6

LEARNING RATE η 0.30

DEPTH L 12

FINAL LOSS · NO BN

—

FINAL LOSS · WITH BN

—

STABLE η CEILING (NO BN)

—

A toy deep net is trained for 60 steps at the chosen learning rate; the mint curve normalizes activations each layer, the muted grey curve does not. Push η up: the no-BN curve diverges (spikes off the top, loss explodes), while the BatchNorm curve keeps descending — the higher-learning-rate tolerance that made BN famous. Increase depth and the gap widens, since the un-normalized net compounds instability over more layers.

1.4

Residual connections

Init and normalization keep variance in line, but they do not remove the fundamental fragility of EQ N1.2: a gradient still has to survive a product of $L$ Jacobians. By 2015, even well-initialized, batch-normalized networks showed a degradation problem — adding more layers made training accuracy worse, not just test accuracy. The deeper net could in principle copy the shallower one by setting extra layers to identity, yet optimization could not find that solution. He et al.'s answer was to make the identity the default, by adding a skip connection around each block:

EQ N1.7 — THE RESIDUAL BLOCK $$ h_{\ell+1} = h_\ell + F\big(h_\ell; \theta_\ell\big) $$

The block learns a residual $F$ — the correction to add to its input — rather than a fresh representation. If the optimal map is close to identity, the network just drives $F \to 0$, which is far easier than learning identity from scratch. $F$ is typically two or three weight layers with normalization and a nonlinearity. The skip is the load-bearing idea: it is what lets networks go from tens of layers to hundreds (and ResNets to over a thousand) and is structurally identical to the residual stream that runs through every Transformer block (Vol II · Ch 02).

Why does the skip rescue the gradient? Differentiate EQ N1.7. The Jacobian of a residual block is the identity plus the block's own Jacobian, so the backward product gains an additive shortcut at every layer:

EQ N1.8 — GRADIENT FLOW THROUGH A SKIP $$ \frac{\partial h_{\ell+1}}{\partial h_\ell} = I + \frac{\partial F}{\partial h_\ell} \qquad\Longrightarrow\qquad \frac{\partial \mathcal{L}}{\partial h_\ell} = \frac{\partial \mathcal{L}}{\partial h_L}\prod_{k=\ell}^{L-1}\!\Big(I + \tfrac{\partial F}{\partial h_k}\Big) $$

Expand the product and one term is the bare identity $I$: the gradient at the output reaches layer $\ell$ undiminished, no matter how many layers lie between, plus higher-order corrections through the $F$ paths. Where a plain net multiplies the gradient by something $<1$ at every layer (vanishing), the residual net always keeps a clean copy of the downstream gradient. Depth stops being a multiplicative tax on the gradient and becomes additive. The standard practice puts BatchNorm (or LayerNorm) inside $F$, so the two fixes compose.

A residual block computes its output as the block input $h$ plus the transformation $F$ applied to that input — that is, $h + F(h)$, the skip connection of EQ N1.7. True or false? (Answer true or false.)

The defining equation of a residual block is exactly $h_{\ell+1} = h_\ell + F(h_\ell)$: the input is carried forward unchanged and the block only learns the correction $F$ to add. The statement is true.

INSTRUMENT N1.3 — RESIDUAL vs PLAIN: GRADIENT FLOW‖∂L/∂h‖ BY LAYER · EQ N1.8

DEPTH L 40

BLOCK GAIN ‖∂F/∂h‖ 0.80

GRAD AT LAYER 1 · PLAIN

—

GRAD AT LAYER 1 · RESIDUAL

—

RATIO (RESIDUAL / PLAIN)

—

The gradient norm starts at 1 at the output and is propagated back to layer 1. The grey plain net multiplies by the block gain at every layer (EQ N1.2), so for a gain below 1 it collapses geometrically — by layer 1 the early weights see almost no signal. The mint residual net follows EQ N1.8: the $+I$ shortcut keeps a path of magnitude 1 alive all the way down, so the gradient barely decays. Set the gain above 1 and the plain net explodes instead — the residual net still stays bounded.

PYTHON · RUNNABLE IN-BROWSER

# Gradient flow: plain stack vanishes, residual stack survives (EQ N1.8)
import numpy as np
rng = np.random.default_rng(0)

n, depth = 64, 50
g = rng.standard_normal((depth, n, n)) * np.sqrt(0.7 / n)   # block Jacobians, gain<1

grad = rng.standard_normal((1, n))            # incoming gradient at the output
plain, res = grad.copy(), grad.copy()
plain_norm, res_norm = [np.linalg.norm(plain)], [np.linalg.norm(res)]

for k in reversed(range(depth)):
    plain = plain @ g[k].T                     # plain: multiply by J each layer
    res   = res @ (np.eye(n) + g[k]).T         # residual: I + J  (the skip)
    plain_norm.append(np.linalg.norm(plain))
    res_norm.append(np.linalg.norm(res))

print(f"plain    gradient norm at input layer: {plain_norm[-1]:.3e}")
print(f"residual gradient norm at input layer: {res_norm[-1]:.3e}")
print(f"residual / plain ratio                : {res_norm[-1]/plain_norm[-1]:.1f}x")
print("\nthe +I shortcut keeps the early-layer gradient alive across 50 layers.")
plot_xy(list(range(depth + 1)), [np.log10(v + 1e-30) for v in plain_norm])

edits are live — break it on purpose

1.5

Regularization — dropout & weight decay

The first three fixes make a deep net trainable; the last makes it generalize. A network with millions of parameters can memorize its training set outright, so we add pressure toward simpler solutions. Two techniques dominate, and they attack overfitting from opposite directions.

Dropout randomly zeros each activation with probability $p$ on every training step, then rescales the survivors so the expected activation is unchanged:

EQ N1.9 — DROPOUT (INVERTED, TRAINING TIME) $$ \tilde{h}_i = \frac{m_i}{1-p}\,h_i, \qquad m_i \sim \mathrm{Bernoulli}(1-p) $$

Each forward pass trains a different random sub-network; at test time dropout is off and the full network acts as an implicit ensemble of all those sub-networks. The $1/(1-p)$ factor (inverted dropout) keeps the expected activation constant, so no scaling is needed at inference. By preventing units from co-adapting — relying on a specific partner always being present — dropout forces redundant, robust features. Typical $p$: 0.1–0.5 for dense layers. It is largely absent from large Transformers, where data scale and other regularizers do the work.

Weight decay instead penalizes large weights, adding an $L_2$ term to the loss that pulls every weight toward zero:

EQ N1.10 — L2 / WEIGHT DECAY $$ \mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2}\sum_j w_j^2 \qquad\Longrightarrow\qquad w_j \leftarrow w_j - \eta\Big(\frac{\partial \mathcal{L}}{\partial w_j} + \lambda\,w_j\Big) $$

The penalty's gradient is just $\lambda w_j$, so each step shrinks every weight by a constant fraction before the data-driven update — hence "decay". Smaller weights mean a smoother, lower-variance function that is harder to overfit. A subtlety that matters in practice: with adaptive optimizers like Adam, classical $L_2$ and true weight decay are not the same, because Adam rescales the gradient; AdamW (Loshchilov & Hutter, 2019) decouples the decay from the gradient step and is the modern default. $\lambda$ typically sits in $10^{-4}$ to $10^{-1}$.

The honest picture. The four fixes overlap and partly substitute for one another. BatchNorm already regularizes, which is why dropout and BN are often redundant together. Good initialization reduces — but does not eliminate — the need for normalization. Residual connections plus normalization are now so reliable that very deep training is routine, and the field's frontier has moved from can we train it to can we afford it. None of these is a law of nature; each is an engineering fix to the same underlying disease — signal that compounds geometrically through depth — and each will be revisited, sharpened, or replaced as architectures evolve.

Now that signal can flow through depth, the question is what structure to give the layers. Chapter 02 specializes the dense layer for images: convolutions share weights across space, pooling builds translation tolerance, and the same init/norm/residual toolkit you just met powers the ResNets that dominated computer vision.

1.R

References

Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS — the variance-preserving (Xavier/Glorot) initialization of §1.2.
He, K., Zhang, X., Ren, S. & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV — He/Kaiming initialization for ReLU networks (EQ N1.5).
Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML — batch normalization (§1.3, EQ N1.6).
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR — residual connections / ResNet (§1.4, EQ N1.7–N1.8).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 — dropout regularization (EQ N1.9).
Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR — AdamW, decoupling weight decay from the adaptive step (EQ N1.10).
Santurkar, S., Tsipras, D., Ilyas, A. & Mądry, A. (2018). How Does Batch Normalization Help Optimization? NeurIPS — the loss-smoothing reinterpretation of BatchNorm cited in §1.3.