The convolution operation
A fully-connected layer treats an image as a flat vector: a \(224\times 224\) RGB picture becomes \(150{,}528\) inputs, and a single hidden unit reading all of them owns that many weights. That is wasteful on two counts. It ignores locality — the pixels that matter for detecting an edge are right next to each other, not scattered across the frame — and it ignores repetition — an edge in the top-left corner is the same visual pattern as an edge in the bottom-right, yet a dense layer must relearn it for every position. A convolutional layer fixes both by sliding one small set of weights, a kernel (or filter), across every location and reusing it everywhere.
The arithmetic is a sum of element-wise products between the kernel and the patch of input it currently overlaps. (Deep-learning libraries implement cross-correlation — no kernel flip — and call it convolution; since the kernel is learned, the flip is irrelevant.) For a 2D input \(I\) and a \(k\times k\) kernel \(W\), the output at position \((i,j)\) is:
The parameter savings are dramatic. A dense layer mapping a \(32\times 32\) single-channel image to a same-sized output needs \(1024 \times 1024 \approx 10^6\) weights; a \(3\times 3\) convolution producing the same map needs nine (plus a bias), regardless of image size. That economy is not just cheaper — it is a prior. By forcing every location to share weights, convolution declares in advance that visual patterns are position-independent, and that prior is close enough to true that CNNs generalize from far less data than an unconstrained network would need.
The boundary deserves a word. With no padding, a \(k\times k\) kernel cannot center on the outermost pixels, so the feature map shrinks: an \(N\times N\) input yields an \((N-k+1)\times(N-k+1)\) output. Section 2.3 makes this precise and shows how padding and stride control it.
# EQ N2.1: 2D convolution (cross-correlation) from scratch in numpy
import numpy as np
# a tiny 6x6 "image": a bright vertical bar down the middle
img = np.zeros((6, 6))
img[:, 2:4] = 1.0
# a vertical-edge detector (Sobel-x): fires where left/right brightness differ
K = np.array([[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]], dtype=float)
kh, kw = K.shape
H, W = img.shape
out = np.zeros((H - kh + 1, W - kw + 1)) # valid padding -> shrinks
for i in range(out.shape[0]):
for j in range(out.shape[1]):
patch = img[i:i+kh, j:j+kw] # the window under the kernel
out[i, j] = np.sum(patch * K) # element-wise product, summed
np.set_printoptions(precision=1, suppress=True)
print("input image (the bar):\n", img)
print("\nfeature map (vertical edges):\n", out)
print("\nleft edge of bar -> +, right edge -> -. flat regions stay 0.")
Pooling & translation invariance
Convolution is equivariant: move the cat one pixel right and its feature map moves one pixel right too. For classification we usually want something stronger — invariance: the answer "cat" should not change at all when the cat shifts. Pooling is the classic mechanism that converts a little equivariance into a little invariance. It slides a window over the feature map and summarizes each window with a single number — usually the maximum (max-pooling) or the average:
Pooling buys robustness at a price, and it is worth being honest about it: by discarding precise position it can hurt tasks that need position — segmentation, keypoint localization, anything that must answer "where," not just "what." It is also not the only way to downsample. Since the mid-2010s many architectures drop pooling entirely and use strided convolutions (a conv that itself moves in steps of 2) to shrink the map while keeping the operation learnable. Pooling survives mostly as a cheap, parameter-free downsampler and, in the form of global average pooling, as the standard bridge from the final feature map to a classifier — averaging each channel to one number and feeding that vector to a linear head, which removes the giant dense layers that dominated early CNNs.
# EQ N2.2: 2x2 max-pooling, and how it absorbs a 1-pixel shift
import numpy as np
def maxpool2x2(x): # stride 2, no overlap
H, W = x.shape
out = np.zeros((H // 2, W // 2))
for i in range(out.shape[0]):
for j in range(out.shape[1]):
out[i, j] = x[2*i:2*i+2, 2*j:2*j+2].max()
return out
fmap = np.array([[0, 1, 0, 0], # a feature map with one peak
[2, 9, 1, 0],
[0, 0, 3, 1],
[0, 0, 1, 0]], dtype=float)
shifted = np.roll(fmap, 1, axis=1) # shift everything right by 1px
print("original feature map:\n", fmap)
print("2x2 max-pooled :\n", maxpool2x2(fmap))
print("\nshifted-by-1 pooled :\n", maxpool2x2(shifted))
print("\nthe 9 stays the top-left pooled value in BOTH -> shift absorbed.")
print("output went from 4x4 to 2x2: a quarter of the area, for free.")
Channels, stride & padding
Real images are not flat grids — a color photo is \(H\times W\times 3\), three channels (red, green, blue) stacked in depth. Convolution generalizes by giving each kernel the same depth as its input: a kernel applied to a 3-channel image is itself \(k\times k\times 3\), it dots over all input channels at once, and it produces a single output map. To get a richer representation you simply run \(C_{\text{out}}\) such kernels in parallel — one per output channel — so a conv layer is parameterized by a 4D weight tensor:
Notice what the layer trades away: a conv with \(C_{\text{in}}=C_{\text{out}}=256\) and a \(3\times 3\) kernel has \((9\cdot 256 + 1)\cdot 256 \approx 590{,}000\) parameters — independent of image size, because the same kernels run at every location. The familiar CNN rhythm follows directly: as pooling and strided convs shrink the spatial grid, channel counts rise to compensate, trading "where" for "what" as you go deeper. A typical backbone might run \(224^2\times 3 \to 56^2\times 64 \to 28^2\times 128 \to 14^2\times 256 \to 7^2\times 512\): the map gets small and deep until a global pool and a linear layer read off the answer.
A practical refinement worth naming: the \(1\times 1\) convolution. With \(k=1\) it does no spatial mixing at all — it is a per-pixel linear layer across channels, used to cheaply change channel depth (a "bottleneck") before an expensive \(3\times 3\). Depthwise-separable convolutions (each channel convolved on its own, then \(1\times 1\) mixed) push the same idea to mobile-scale efficiency, and are the engine of the MobileNet/EfficientNet family.
Classic architectures — LeNet to ResNet
The CNN's history is a tight sequence of ideas, each fixing the previous generation's ceiling. Reading it in order is the fastest way to understand what every modern backbone is made of.
| Architecture | Year | Depth | The idea it introduced |
|---|---|---|---|
| LeNet-5 | 1998 | 7 | The template itself: conv → pool → conv → pool → dense, trained by backprop to read handwritten digits (MNIST/checks). |
| AlexNet | 2012 | 8 | The same idea at GPU scale on ImageNet: ReLU, dropout, data augmentation. Halved the error and started the deep-learning era. |
| VGG | 2014 | 16–19 | Depth from uniformity: stacks of small \(3\times 3\) convs only. Two \(3\times 3\)s see a \(5\times 5\) region with fewer params and more nonlinearity. |
| GoogLeNet / Inception | 2014 | 22 | Multi-scale "Inception" blocks (\(1\times 1\), \(3\times 3\), \(5\times 5\) in parallel) and \(1\times 1\) bottlenecks to stay cheap. |
| ResNet | 2015 | 50–152 | The residual / skip connection — the breakthrough that let networks go past ~20 layers without degrading. |
The decisive jump is ResNet. Through 2014 the field believed deeper was better, yet past about twenty layers accuracy got worse — not from overfitting (training error rose too) but from an optimization failure: gradients had to thread through too many transformations to reach early layers, and very deep plain stacks could not even learn the identity function reliably. He et al. solved it with a one-line change to the building block — add the input back to the output:
It is worth noting where this story stands in 2026, honestly. CNNs no longer hold the absolute accuracy crown on large-scale benchmarks — Vision Transformers (ViT) match or exceed them given enough data, by replacing the convolutional prior with attention over image patches. But the contest is closer than headlines suggest: convnets modernized with the same training recipes (the "ConvNeXt" line) remain competitive, and CNNs still dominate where data is limited or latency and edge deployment matter, precisely because their built-in inductive bias substitutes for data the way a ViT cannot. Convolution did not lose; it became one well-understood tool among several.
VGG's quiet lesson: two stacked \(3\times 3\) convolutions have the same \(5\times 5\) receptive field as one \(5\times 5\) conv, but use \(2\cdot(3^2)=18\) weights per channel instead of \(25\), and insert an extra nonlinearity between them. Three \(3\times 3\)s match a \(7\times 7\) (27 vs 49 weights). Deeper-but-thinner won, and \(3\times 3\) became the field's default kernel — the receptive-field calculator below shows exactly how that field grows with depth.
Transfer learning
The single most practically important fact about CNNs is that the features they learn transfer. A network trained on ImageNet's 1.2 million labelled photos learns, in its early layers, a near-universal visual vocabulary: oriented edges, color contrasts, textures, then corners, then object parts. Those low-level detectors are not specific to "is this a Labrador" — they are what any natural-image task needs. So rather than train a CNN from random weights on your few thousand images (which would overfit badly), you start from a pre-trained backbone and adapt it. Two regimes:
- Feature extraction (frozen backbone). Freeze all convolutional weights, discard the original 1000-class head, and train only a fresh classifier on top of the final feature vector. The CNN becomes a fixed feature function; you are fitting a small linear model on excellent features. This is the right move when your dataset is small and/or similar to ImageNet — it cannot overfit the backbone because the backbone does not move.
- Fine-tuning. Unfreeze some or all of the backbone and continue training on your data at a small learning rate (often 10–100× lower than from-scratch), so the pre-trained weights are nudged, not erased. Best when you have more data or a domain that drifts from natural photos (medical scans, satellite imagery). A common recipe trains the new head first, then unfreezes the top blocks; the lowest layers — those universal edge detectors — are usually left frozen or barely touched.
The empirical pattern that justifies the whole approach: feature transferability decreases with depth. Early layers transfer almost perfectly across tasks; the last layers are the most task-specific and benefit most from adaptation. That is why "freeze the bottom, retrain the top" is the default, and why transfer learning routinely reaches strong accuracy with hundreds of examples instead of millions — the expensive representation learning was already paid for, once, by whoever trained the backbone.
A subtle gotcha transfer learning shares with the receptive field: a backbone's effective receptive field is often smaller than its theoretical one (activations near the window's center dominate), and the input statistics it expects — resolution, normalization, channel order — must match what it was trained on. Feed a medical grayscale scan to an RGB-ImageNet backbone without reconciling these and the transfer quietly underperforms. Match the preprocessing first; it is the most common silent failure.
Convolution shares weights across space; the next idea shares them across time. Chapter 03 turns to sequence models — RNNs, LSTMs, and the gating that lets a network carry information across hundreds of steps — the line of work that the Transformer would eventually overtake.
References
- LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition.
- Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks.
- Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition.
- Szegedy, C. et al. (2015). Going Deeper with Convolutions.
- He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition.
- Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. (2014). How Transferable Are Features in Deep Neural Networks?.
- Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale.