Convolutional Neural Networks

2.1

The convolution operation

A fully-connected layer treats an image as a flat vector: a $224\times 224$ RGB picture becomes $150{,}528$ inputs, and a single hidden unit reading all of them owns that many weights. That is wasteful on two counts. It ignores locality — the pixels that matter for detecting an edge are right next to each other, not scattered across the frame — and it ignores repetition — an edge in the top-left corner is the same visual pattern as an edge in the bottom-right, yet a dense layer must relearn it for every position. A convolutional layer fixes both by sliding one small set of weights, a kernel (or filter), across every location and reusing it everywhere.

The arithmetic is a sum of element-wise products between the kernel and the patch of input it currently overlaps. (Deep-learning libraries implement cross-correlation — no kernel flip — and call it convolution; since the kernel is learned, the flip is irrelevant.) For a 2D input $I$ and a $k\times k$ kernel $W$, the output at position $(i,j)$ is:

EQ N2.1 — DISCRETE 2D CONVOLUTION (CROSS-CORRELATION) $$ S(i,j) \;=\; (I * W)(i,j) \;=\; \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} I(i+m,\, j+n)\, W(m,n) \;+\; b $$

The kernel $W$ is a tiny learnable stencil — $3\times 3$ is the modern default — and $b$ a scalar bias. Sliding it across the whole image produces a feature map: a 2D record of where the kernel's pattern occurs. The two structural commitments are local connectivity (each output sees only a $k\times k$ window, not the whole image) and weight sharing (the same $k^2{+}1$ parameters are reused at every position). Weight sharing is what gives convolution its defining property — translation equivariance: shift the input and the feature map shifts identically.

The parameter savings are dramatic. A dense layer mapping a $32\times 32$ single-channel image to a same-sized output needs $1024 \times 1024 \approx 10^6$ weights; a $3\times 3$ convolution producing the same map needs nine (plus a bias), regardless of image size. That economy is not just cheaper — it is a prior. By forcing every location to share weights, convolution declares in advance that visual patterns are position-independent, and that prior is close enough to true that CNNs generalize from far less data than an unconstrained network would need.

The boundary deserves a word. With no padding, a $k\times k$ kernel cannot center on the outermost pixels, so the feature map shrinks: an $N\times N$ input yields an $(N-k+1)\times(N-k+1)$ output. Section 2.3 makes this precise and shows how padding and stride control it.

A $32\times 32$ image is convolved with a $5\times 5$ kernel using stride $1$ and no padding. What is the side length of the output feature map?

The output side is $\left\lfloor \dfrac{W - K + 2P}{S}\right\rfloor + 1 = \dfrac{32 - 5 + 0}{1} + 1 = 27 + 1 = $ 28. Each $5\times 5$ window must fit entirely inside the image, so the kernel's top-left corner can sit in only $28$ positions per axis — giving a $28\times 28$ map.

PYTHON · RUNNABLE IN-BROWSER

# EQ N2.1: 2D convolution (cross-correlation) from scratch in numpy
import numpy as np

# a tiny 6x6 "image": a bright vertical bar down the middle
img = np.zeros((6, 6))
img[:, 2:4] = 1.0

# a vertical-edge detector (Sobel-x): fires where left/right brightness differ
K = np.array([[-1, 0, 1],
              [-2, 0, 2],
              [-1, 0, 1]], dtype=float)

kh, kw = K.shape
H, W = img.shape
out = np.zeros((H - kh + 1, W - kw + 1))          # valid padding -> shrinks
for i in range(out.shape[0]):
    for j in range(out.shape[1]):
        patch = img[i:i+kh, j:j+kw]               # the window under the kernel
        out[i, j] = np.sum(patch * K)             # element-wise product, summed

np.set_printoptions(precision=1, suppress=True)
print("input image (the bar):\n", img)
print("\nfeature map (vertical edges):\n", out)
print("\nleft edge of bar -> +, right edge -> -. flat regions stay 0.")

edits are live — break it on purpose

INSTRUMENT N2.1 — CONVOLUTION KERNEL EXPLORER9×9 INPUT · 3×3 KERNEL · VALID PADDING · EQ N2.1

KERNEL

KERNEL (3×3)

—

OUTPUT SIZE

7 × 7

PARAMS REUSED

49×

Left grid is the input (a hand-drawn "7"); right grid is the feature map after applying the chosen kernel everywhere. EDGE highlights boundaries, BLUR averages neighbors, SHARPEN exaggerates contrast, IDENTITY copies the center pixel. The same nine weights produce every output cell — that single fact is weight sharing, and the source of the 49× reuse count.

2.2

Pooling & translation invariance

Convolution is equivariant: move the cat one pixel right and its feature map moves one pixel right too. For classification we usually want something stronger — invariance: the answer "cat" should not change at all when the cat shifts. Pooling is the classic mechanism that converts a little equivariance into a little invariance. It slides a window over the feature map and summarizes each window with a single number — usually the maximum (max-pooling) or the average:

EQ N2.2 — MAX-POOLING $$ P(i,j) \;=\; \max_{\substack{0\le m<p \\ 0\le n<p}} \; S\big(i\cdot s + m,\; j\cdot s + n\big) $$

A $p\times p$ window with stride $s$ (almost always $p=s=2$) collapses each region to its strongest activation. The effect is twofold. Downsampling: a stride-2 pool halves each spatial dimension, cutting the feature map to a quarter of its area and so cheapening every layer above it. Local invariance: because the max ignores where in the window the peak occurred, a small shift of the input that keeps the peak inside the same window leaves the pooled output unchanged. Stack a few pool layers and small translations become invisible to the classifier.

Pooling buys robustness at a price, and it is worth being honest about it: by discarding precise position it can hurt tasks that need position — segmentation, keypoint localization, anything that must answer "where," not just "what." It is also not the only way to downsample. Since the mid-2010s many architectures drop pooling entirely and use strided convolutions (a conv that itself moves in steps of 2) to shrink the map while keeping the operation learnable. Pooling survives mostly as a cheap, parameter-free downsampler and, in the form of global average pooling, as the standard bridge from the final feature map to a classifier — averaging each channel to one number and feeding that vector to a linear head, which removes the giant dense layers that dominated early CNNs.

True or false: max-pooling adds a measure of translation invariance, because a small shift of the input that keeps the peak activation inside the same pooling window leaves the pooled output unchanged. (Answer true or false.)

Max-pooling reports only the largest activation in each window and discards where in the window it occurred. Shift the input by a pixel or two and, as long as the peak stays inside the same window, the maximum — and therefore the pooled value — does not move. That insensitivity to small positional change is exactly local translation invariance. The statement is true.

PYTHON · RUNNABLE IN-BROWSER

# EQ N2.2: 2x2 max-pooling, and how it absorbs a 1-pixel shift
import numpy as np

def maxpool2x2(x):                                 # stride 2, no overlap
    H, W = x.shape
    out = np.zeros((H // 2, W // 2))
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            out[i, j] = x[2*i:2*i+2, 2*j:2*j+2].max()
    return out

fmap = np.array([[0, 1, 0, 0],                     # a feature map with one peak
                 [2, 9, 1, 0],
                 [0, 0, 3, 1],
                 [0, 0, 1, 0]], dtype=float)
shifted = np.roll(fmap, 1, axis=1)                 # shift everything right by 1px

print("original feature map:\n", fmap)
print("2x2 max-pooled       :\n", maxpool2x2(fmap))
print("\nshifted-by-1 pooled  :\n", maxpool2x2(shifted))
print("\nthe 9 stays the top-left pooled value in BOTH -> shift absorbed.")
print("output went from 4x4 to 2x2: a quarter of the area, for free.")

edits are live — break it on purpose

2.3

Channels, stride & padding

Real images are not flat grids — a color photo is $H\times W\times 3$, three channels (red, green, blue) stacked in depth. Convolution generalizes by giving each kernel the same depth as its input: a kernel applied to a 3-channel image is itself $k\times k\times 3$, it dots over all input channels at once, and it produces a single output map. To get a richer representation you simply run $C_{\text{out}}$ such kernels in parallel — one per output channel — so a conv layer is parameterized by a 4D weight tensor:

EQ N2.3 — A CONV LAYER'S PARAMETERS & OUTPUT GEOMETRY $$ \#\text{params} = (k \cdot k \cdot C_{\text{in}} + 1)\, C_{\text{out}}, \qquad H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} - k + 2P}{S} \right\rfloor + 1 $$

The weight tensor has shape $C_{\text{out}}\times C_{\text{in}}\times k\times k$, plus one bias per output channel. Padding $P$ adds a border of zeros so the kernel can center on edge pixels; "same" padding ($P=\lfloor k/2\rfloor$ for odd $k$, stride 1) keeps the spatial size unchanged. Stride $S$ is how far the kernel hops between applications; $S=2$ downsamples like a pool but with learned weights. The width formula applies independently to height. The depth dimension is where representational capacity lives: early layers carry tens of channels of low-level features (edges, blobs), deep layers carry hundreds or thousands of channels of abstract parts (eyes, wheels, text).

Notice what the layer trades away: a conv with $C_{\text{in}}=C_{\text{out}}=256$ and a $3\times 3$ kernel has $(9\cdot 256 + 1)\cdot 256 \approx 590{,}000$ parameters — independent of image size, because the same kernels run at every location. The familiar CNN rhythm follows directly: as pooling and strided convs shrink the spatial grid, channel counts rise to compensate, trading "where" for "what" as you go deeper. A typical backbone might run $224^2\times 3 \to 56^2\times 64 \to 28^2\times 128 \to 14^2\times 256 \to 7^2\times 512$: the map gets small and deep until a global pool and a linear layer read off the answer.

A practical refinement worth naming: the $1\times 1$ convolution. With $k=1$ it does no spatial mixing at all — it is a per-pixel linear layer across channels, used to cheaply change channel depth (a "bottleneck") before an expensive $3\times 3$. Depthwise-separable convolutions (each channel convolved on its own, then $1\times 1$ mixed) push the same idea to mobile-scale efficiency, and are the engine of the MobileNet/EfficientNet family.

INSTRUMENT N2.2 — FEATURE-MAP VISUALIZERSPATIAL SIZE ↓ · CHANNELS ↑ · EQ N2.3

INPUT SIZE 224

STAGES (conv+pool) 4

BASE CHANNELS 64

FINAL MAP

14 × 14

FINAL CHANNELS

512

ACTIVATIONS / STAGE

—

Each stage is one "same" conv (size-preserving) followed by a stride-2 pool (size-halving) that also doubles the channel count. Watch the volumes flip from wide-and-shallow to small-and-deep — the canonical CNN shape. The activation count per stage (height × width × channels) shows that early layers, despite few channels, hold the most numbers; this is why feature-map memory, not parameters, often dominates training.

2.4

Classic architectures — LeNet to ResNet

The CNN's history is a tight sequence of ideas, each fixing the previous generation's ceiling. Reading it in order is the fastest way to understand what every modern backbone is made of.

Architecture	Year	Depth	The idea it introduced
LeNet-5	1998	7	The template itself: conv → pool → conv → pool → dense, trained by backprop to read handwritten digits (MNIST/checks).
AlexNet	2012	8	The same idea at GPU scale on ImageNet: ReLU, dropout, data augmentation. Halved the error and started the deep-learning era.
VGG	2014	16–19	Depth from uniformity: stacks of small $3\times 3$ convs only. Two $3\times 3$s see a $5\times 5$ region with fewer params and more nonlinearity.
GoogLeNet / Inception	2014	22	Multi-scale "Inception" blocks ($1\times 1$, $3\times 3$, $5\times 5$ in parallel) and $1\times 1$ bottlenecks to stay cheap.
ResNet	2015	50–152	The residual / skip connection — the breakthrough that let networks go past ~20 layers without degrading.

The decisive jump is ResNet. Through 2014 the field believed deeper was better, yet past about twenty layers accuracy got worse — not from overfitting (training error rose too) but from an optimization failure: gradients had to thread through too many transformations to reach early layers, and very deep plain stacks could not even learn the identity function reliably. He et al. solved it with a one-line change to the building block — add the input back to the output:

EQ N2.4 — THE RESIDUAL BLOCK $$ \mathbf{y} \;=\; \mathcal{F}(\mathbf{x}; \{W_i\}) \;+\; \mathbf{x} $$

Instead of asking a block to learn the desired mapping $H(\mathbf{x})$ outright, ask it to learn only the residual $\mathcal{F}(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$; the original input is carried forward by the skip connection and added at the end. Two consequences follow. If a layer is unneeded, driving $\mathcal{F}\to 0$ recovers the identity for free — so extra depth can never hurt. And the additive shortcut gives the gradient a direct path back to early layers (the $+\mathbf{x}$ contributes a clean $+1$ to the derivative), defeating the vanishing-gradient barrier. This single trick enabled 152-layer networks, won ImageNet 2015, and the residual stream it created is now the backbone of essentially every deep architecture — Transformers included (Vol II · EQ 2.x).

It is worth noting where this story stands in 2026, honestly. CNNs no longer hold the absolute accuracy crown on large-scale benchmarks — Vision Transformers (ViT) match or exceed them given enough data, by replacing the convolutional prior with attention over image patches. But the contest is closer than headlines suggest: convnets modernized with the same training recipes (the "ConvNeXt" line) remain competitive, and CNNs still dominate where data is limited or latency and edge deployment matter, precisely because their built-in inductive bias substitutes for data the way a ViT cannot. Convolution did not lose; it became one well-understood tool among several.

WHY 3×3

VGG's quiet lesson: two stacked $3\times 3$ convolutions have the same $5\times 5$ receptive field as one $5\times 5$ conv, but use $2\cdot(3^2)=18$ weights per channel instead of $25$, and insert an extra nonlinearity between them. Three $3\times 3$s match a $7\times 7$ (27 vs 49 weights). Deeper-but-thinner won, and $3\times 3$ became the field's default kernel — the receptive-field calculator below shows exactly how that field grows with depth.

2.5

Transfer learning

The single most practically important fact about CNNs is that the features they learn transfer. A network trained on ImageNet's 1.2 million labelled photos learns, in its early layers, a near-universal visual vocabulary: oriented edges, color contrasts, textures, then corners, then object parts. Those low-level detectors are not specific to "is this a Labrador" — they are what any natural-image task needs. So rather than train a CNN from random weights on your few thousand images (which would overfit badly), you start from a pre-trained backbone and adapt it. Two regimes:

Feature extraction (frozen backbone). Freeze all convolutional weights, discard the original 1000-class head, and train only a fresh classifier on top of the final feature vector. The CNN becomes a fixed feature function; you are fitting a small linear model on excellent features. This is the right move when your dataset is small and/or similar to ImageNet — it cannot overfit the backbone because the backbone does not move.
Fine-tuning. Unfreeze some or all of the backbone and continue training on your data at a small learning rate (often 10–100× lower than from-scratch), so the pre-trained weights are nudged, not erased. Best when you have more data or a domain that drifts from natural photos (medical scans, satellite imagery). A common recipe trains the new head first, then unfreezes the top blocks; the lowest layers — those universal edge detectors — are usually left frozen or barely touched.

The empirical pattern that justifies the whole approach: feature transferability decreases with depth. Early layers transfer almost perfectly across tasks; the last layers are the most task-specific and benefit most from adaptation. That is why "freeze the bottom, retrain the top" is the default, and why transfer learning routinely reaches strong accuracy with hundreds of examples instead of millions — the expensive representation learning was already paid for, once, by whoever trained the backbone.

INSTRUMENT N2.3 — RECEPTIVE-FIELD CALCULATOR3×3 CONVS + 2×2 POOLS · CUMULATIVE RF

3×3 CONV LAYERS 3

2×2 POOLS (after every Nth conv) 2

RECEPTIVE FIELD

18 px

FEATURE STRIDE (jump)

TOTAL LAYERS

Each unit in a deep feature map "sees" a window of the original image — its receptive field. With only $3\times 3$ convs the field grows by 2 px per layer (linearly); insert a stride-2 pool and the jump doubles, so every later conv now reaches twice as far — the field grows much faster. Slide the pool count up and watch a stack of tiny $3\times 3$ kernels come to cover the whole image. The RF recursion: $r_\ell = r_{\ell-1} + (k-1)\,j_{\ell-1}$, with jump $j_\ell = j_{\ell-1}\cdot s$.

A subtle gotcha transfer learning shares with the receptive field: a backbone's effective receptive field is often smaller than its theoretical one (activations near the window's center dominate), and the input statistics it expects — resolution, normalization, channel order — must match what it was trained on. Feed a medical grayscale scan to an RGB-ImageNet backbone without reconciling these and the transfer quietly underperforms. Match the preprocessing first; it is the most common silent failure.

Convolution shares weights across space; the next idea shares them across time. Chapter 03 turns to sequence models — RNNs, LSTMs, and the gating that lets a network carry information across hundreds of steps — the line of work that the Transformer would eventually overtake.

2.R

References

LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11) — LeNet-5; the conv → pool → dense template trained end-to-end by backprop.
Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 25 — AlexNet; ReLU, dropout, and GPU training that ignited the deep-learning era.
Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015 — VGG; depth via stacks of $3\times 3$ convolutions.
Szegedy, C. et al. (2015). Going Deeper with Convolutions. CVPR 2015 — GoogLeNet / Inception; multi-scale blocks and $1\times 1$ bottlenecks.
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016 — ResNet (EQ N2.4); the skip connection that unlocked very deep networks.
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks?. NeurIPS 27 — the empirical basis for transfer learning; transferability falls with depth.
Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021 — the Vision Transformer, the chief modern challenger to the convolutional prior.