AI // ENCYCLOPEDIA / MULTIMODAL / 01 / VISION INDEX NEXT: 02 MULTIMODAL LLMs →
MULTIMODAL & WORLD MODELS · CHAPTER 01 / 06

Computer Vision with Deep Nets

Convolutional networks learned to read pixels, ImageNet made accuracy a shared benchmark, and recognition became the first task deep learning largely solved. The Vision Transformer later replaced the convolutional prior with the same attention blocks used in language models, putting vision and text on one architecture. This chapter covers how images become features, why ViT works, and how CLIP compares a photograph against a sentence.

LEVELCORE READING TIME≈ 26 MIN BUILDS ONDEEP LEARNING 02 · VOL II 03 INSTRUMENTSPATCH EMBED · RECEPTIVE FIELD · CLIP SIMILARITY
1.1

From pixels to features — the CV problem

A digital image is a tensor of numbers: an \(H\times W\times 3\) grid where each cell holds the red, green and blue intensity of one pixel. A modest \(224\times 224\) RGB photo is therefore \(150{,}528\) values — and nothing in those raw numbers tells you there is a cat in the frame. The defining difficulty of computer vision is the semantic gap: the pixels are low-level and the label is high-level, and between them sit every nuisance a real scene throws at you — lighting, viewpoint, occlusion, scale, background clutter, intra-class variation. Two photos of the same cat can share almost no pixel values; two photos of different things can be nearly identical at the pixel level.

For decades the answer was hand-engineered features: SIFT, HOG, SURF — operators a human designed to be robust to some of those nuisances, computing histograms of gradient orientations or scale-invariant keypoints, then feeding the result to a classifier like an SVM. They worked, within limits, and they encoded real insight about what makes images comparable. Their ceiling was that a person had to guess in advance which features mattered, and that guess never generalized far beyond the task it was tuned for.

The deep-learning bet was to stop guessing and learn the feature hierarchy from data. A trained vision network discovers, layer by layer, a representation that climbs the semantic gap on its own: the first layers respond to oriented edges and color contrasts, the middle layers to textures and motifs, the deep layers to object parts and finally whole objects. Crucially this hierarchy is not specified by hand — it falls out of fitting one differentiable function end-to-end against labels. The same compositional idea underlies everything that follows in this chapter; the only question that changes is what architecture does the composing.

EQ MM1.1 — AN IMAGE AS A FUNCTION TO A LABEL $$ I \in \mathbb{R}^{H\times W\times 3}, \qquad \hat{y} \;=\; f_\theta(I), \qquad \theta^\star = \arg\min_\theta \; \mathbb{E}_{(I,y)}\big[\, \mathcal{L}\big(f_\theta(I),\, y\big)\,\big] $$
Vision is supervised learning where the input is a pixel grid. The classical pipeline fixed a hand-designed feature map \(\phi(I)\) and learned only a shallow classifier on top of \(\phi\). Deep learning makes the entire map \(f_\theta\) — features and classifier together — learnable, so the representation is optimized for the task instead of guessed in advance. Everything in this chapter is a different parameterization of \(f_\theta\): a convolutional stack (§1.2), a patch-transformer (§1.3), or a dual image/text encoder (§1.5).

A subtlety worth stating up front: a pixel value is not a meaningful coordinate. Doubling the brightness of a photo is, semantically, almost a no-op, yet it moves every input number. This is exactly why raw-pixel nearest-neighbour search fails and why learned features — which are trained to be invariant to such nuisances — are the entire game. The features below are what make two images comparable at all.

PYTHON · RUNNABLE IN-BROWSER
# Why raw pixels are a bad space: brightness shift moves every number,
# yet the picture is "the same". Learned features fix this; pixels don't.
import numpy as np
rng = np.random.default_rng(0)

# a tiny 8x8 grayscale "image": a bright square in the corner
img = np.zeros((8, 8))
img[1:4, 1:4] = 0.8

same_scene = np.clip(img + 0.15, 0, 1)        # same scene, brighter lighting
diff_scene = np.zeros((8, 8)); diff_scene[4:7, 4:7] = 0.8   # different object

def pix_dist(a, b):                            # raw-pixel L2 distance
    return float(np.sqrt(((a - b) ** 2).sum()))

print(f"pixel dist  same scene (brighter) : {pix_dist(img, same_scene):.3f}")
print(f"pixel dist  different object      : {pix_dist(img, diff_scene):.3f}")
print("\nin pixel space a brightness change can look as 'far' as a new object.")
print("learning a feature map that ignores lighting is the whole point of §1.2-1.5.")
edits are live — break it on purpose
1.2

Convolutional backbones & ImageNet

The architecture that closed the semantic gap was the convolutional neural network (covered in full in Deep Learning · Ch 02). The single idea: rather than a dense layer that wires every pixel to every unit, slide one small learnable kernel across the whole image and reuse it everywhere. That bakes in two priors that happen to be true of natural images — locality (the pixels relevant to an edge are adjacent) and translation equivariance (a pattern means the same thing wherever it appears) — and it is exactly this built-in inductive bias that lets a CNN generalize from far fewer images than an unconstrained network would need.

A backbone is the stack of convolution + normalization + pooling blocks that turns an image into a feature volume; a small head on top reads that volume for the task at hand. The canonical rhythm is to shrink the spatial grid while growing channel depth — trading where for what — until a global pool collapses the map to one vector per channel that a linear classifier can read.

What made this a field-wide movement rather than a clever trick was a benchmark. ImageNet — roughly 1.2 million labelled training images across 1000 categories, organized for the ILSVRC competition — gave everyone a shared, hard, large-scale yardstick. The inflection point was 2012: AlexNet cut the top-5 error from ~26% to ~16% in one stroke, the moment the modern deep-learning era is usually dated to. The years after were a tight relay of ideas, each fixing the previous ceiling:

ModelYearTop-5 err.The idea it added
AlexNet2012~16.4%CNNs at GPU scale on ImageNet: ReLU, dropout, heavy augmentation. Started the era.
VGG-16/192014~7.3%Depth from uniformity — stacks of \(3\times 3\) convs only; two \(3\times3\)s match a \(5\times5\) field with fewer weights.
GoogLeNet2014~6.7%Multi-scale Inception blocks with \(1\times1\) bottlenecks to stay cheap.
ResNet-1522015~3.6%The residual / skip connection — \(\mathbf{y}=\mathcal{F}(\mathbf{x})+\mathbf{x}\) — that let networks go past ~20 layers and below human-level error.

The decisive jump was ResNet. Before it, the field believed deeper was simply better, yet past about twenty layers accuracy got worse — and not from overfitting, since training error rose too. The cause was an optimization failure: gradients had to thread through too many transformations to reach the early layers. He et al. fixed it by adding the block's input back to its output, so each block learns only a residual correction and the gradient gets a direct \(+1\) path home. That one line enabled 152-layer networks, won ImageNet 2015, and created the residual stream that is now the backbone of essentially every deep architecture — Transformers included (Vol II · EQ 2.x). Hold onto this; the receptive-field instrument below makes the depth story tangible, and §1.3 reuses the very same residual blocks with attention instead of convolution.

Honest status in 2026: convnets no longer hold the absolute accuracy crown on large-scale benchmarks — given enough data, ViT (§1.3) matches or exceeds them. But the gap is narrower than headlines imply. CNNs modernized with the same training recipes (the "ConvNeXt" line) remain competitive, and convolution still wins where data is scarce or latency and edge deployment matter, because its inductive bias substitutes for data a transformer cannot. Convolution did not lose; it became one well-understood tool among several.

INSTRUMENT MM1.1 — CONV vs ViT RECEPTIVE FIELD3×3 CONV STACK (LOCAL, GROWS) vs ViT (GLOBAL FROM LAYER 1)
CONV RECEPTIVE FIELD
11 px
ViT RECEPTIVE FIELD
FULL
CONV FEATURE STRIDE
2
A convolutional unit only "sees" a window of the input — its receptive field — which grows slowly with depth: \(r_\ell = r_{\ell-1} + (k-1)\,j_{\ell-1}\), and only a stride-2 downsample doubles the jump \(j\) so later layers reach further. A ViT self-attention layer, by contrast, lets every patch attend to every other patch, so its receptive field is the entire image at layer 1 (flat line at the top). That single difference — local-and-growing vs global-and-immediate — is the whole architectural argument of §1.3. Add downsamples and watch the conv curve climb to meet the global line.
1.3

Vision Transformers (ViT)

The Transformer was built for sequences of tokens (Vol II · Ch 02–03). The Vision Transformer's contribution — "An Image is Worth 16×16 Words" — was the disarmingly simple observation that you can turn an image into a sequence of tokens and then change almost nothing else. Cut the image into a grid of non-overlapping square patches, flatten each patch into a vector, project it linearly to the model dimension, and you have a sequence of "visual words" that a standard Transformer encoder can chew on with self-attention.

Concretely, take an \(H\times W\) image and a patch size \(P\). You get a grid of \(\tfrac{H}{P}\times\tfrac{W}{P}\) patches; the canonical recipe is a \(224\times 224\) image with \(P=16\), giving a \(14\times 14\) grid — 196 patches. Each patch of \(P\times P\times 3 = 768\) raw values is flattened and multiplied by one learned matrix \(E\) to produce a \(D\)-dimensional patch embedding. This is the only image-specific operation in the entire model:

EQ MM1.2 — PATCH EMBEDDING & SEQUENCE LENGTH $$ N \;=\; \frac{H}{P}\cdot\frac{W}{P}, \qquad z_0 \;=\; \big[\, x_{\text{cls}};\; x_p^{1}E;\; x_p^{2}E;\; \cdots;\; x_p^{N}E \,\big] \;+\; E_{\text{pos}}, \qquad E \in \mathbb{R}^{(P^2\cdot 3)\times D} $$
Each raw patch \(x_p^{i}\in\mathbb{R}^{P^2\cdot 3}\) is linearly projected by the shared matrix \(E\) to a \(D\)-vector — a single learned conv with kernel and stride both equal to \(P\). A prepended learnable [CLS] token \(x_{\text{cls}}\) aggregates information for classification, and a learned positional embedding \(E_{\text{pos}}\) is added because attention is permutation-invariant — without it the model could not tell top-left from bottom-right. From here the sequence \(z_0\) enters an ordinary Transformer encoder; there is no convolutional prior left. The cost: self-attention is \(O(N^2)\) in the patch count, so halving \(P\) quadruples \(N\) and roughly 16×'s the attention compute.

What you gain is a global receptive field from layer one (the instrument in §1.2 shows exactly this): any patch can attend to any other in a single step, where a CNN needs many layers to relate distant regions. What you give up is the convolutional inductive bias — locality and translation equivariance no longer come for free, the model must learn them. That trade has a sharp consequence the original paper made famous: ViT is data-hungry. Trained on ImageNet-1k alone it underperforms a comparable ResNet; pre-trained on a far larger corpus (the paper used the 303M-image JFT-300M) it overtakes the best CNNs. The slogan is fair: convolution trades data for prior knowledge; ViT trades prior knowledge for data.

Since 2020 the picture has filled in. Strong data augmentation and regularization recipes (DeiT) made ViT trainable on ImageNet-1k alone; hierarchical, windowed variants (Swin) reintroduced a multi-scale, locality-aware structure that made transformers practical as general backbones for detection and segmentation; and self-supervised pre-training (DINO, MAE) learns ViT features without labels at all. The CLS-token / patch-sequence formulation, though, is the through-line — and it is what lets the same model family ingest images and text in one stack, the subject of §1.5 and of the next chapter.

A Vision Transformer processes a \(224\times 224\) image using non-overlapping \(16\times 16\) patches. How many patches does it produce (excluding the [CLS] token)?
Patches per side \(= 224/16 = 14\). The grid is \(14\times 14\), so the patch count is \(14^2 = \) 196. (With the [CLS] token the sequence length the encoder sees is \(196+1 = 197\).)
PYTHON · RUNNABLE IN-BROWSER
# EQ MM1.2: patchify an image into NxN patches and flatten — the ONE
# image-specific step in a Vision Transformer. We use a toy 8x8 RGB image.
import numpy as np
rng = np.random.default_rng(0)

H = W = 8          # image side (toy; ViT uses 224)
C = 3              # RGB channels
P = 2              # patch side (toy; ViT uses 16)
img = rng.integers(0, 256, size=(H, W, C)).astype(float)

n_side = H // P                                   # patches per row/col
patches = (img.reshape(n_side, P, n_side, P, C)   # block the grid
              .transpose(0, 2, 1, 3, 4)           # group each PxP block
              .reshape(n_side * n_side, P * P * C))  # flatten each patch

print(f"image shape          : {img.shape}")
print(f"grid of patches      : {n_side} x {n_side} = {n_side*n_side} patches")
print(f"patch sequence shape : {patches.shape}   (N, P*P*C)")
print(f"one patch is a vector of length P*P*C = {P*P*C}")

# the same arithmetic for the real ViT-Base/16 setting:
H2, P2 = 224, 16
print(f"\nViT-B/16 on 224x224  : ({H2}//{P2})^2 = {(H2//P2)**2} patches (+1 CLS)")
edits are live — break it on purpose
INSTRUMENT MM1.2 — PATCH-EMBEDDING EXPLORERIMAGE → GRID OF PATCHES → TOKEN SEQUENCE · EQ MM1.2
PATCHES (N)
196
SEQUENCE (N + CLS)
197
ATTENTION COST ∝ N²
38.4K
A synthetic scene is sliced into the \(\tfrac{H}{P}\times\tfrac{H}{P}\) patch grid that a ViT actually sees; each cell becomes one token in the sequence on the right. Switch to \(P=16\), \(H=224\) for the canonical 196 patches. Drop \(P\) to 8 and watch the patch count quadruple and the \(N^2\) attention cost explode — the central efficiency knob of every vision transformer.
1.4

Detection & segmentation in brief

Classification answers "what is in this image?" with a single label. Most real vision tasks demand "what, and where?" — and the answer's shape is what distinguishes the major task families. It is worth fixing the vocabulary, because the rest of the field is built on it:

  • Object detection — predict a bounding box plus a class for every object instance. Output: a variable-length list of (box, label, score) tuples.
  • Semantic segmentation — assign a class to every pixel, with no notion of instances. Two adjacent cars become one undifferentiated "car" region. Output: an \(H\times W\) label map.
  • Instance segmentation — a per-pixel mask for each individual object, separating those two cars. The union of detection and semantic segmentation.
  • Panoptic segmentation — label every pixel, distinguishing countable things (cars, people) as instances while treating uncountable stuff (sky, road) as regions.

Architecturally these are heads bolted onto the backbones of §1.2–1.3. The historical arc on detection ran from two-stage region proposers (R-CNN → Fast → Faster R-CNN, which propose candidate boxes then classify them) to single-shot real-time detectors (the YOLO and SSD families, which predict boxes directly in one pass), and most recently to set-prediction transformers (DETR), which cast detection as directly emitting a fixed set of objects and match predictions to ground truth with the Hungarian algorithm — no hand-designed anchor boxes or non-max suppression. Segmentation followed a parallel path: fully-convolutional encoder–decoders (FCN, U-Net) that upsample features back to pixel resolution, then Mask R-CNN adding a mask branch to a detector, and now transformer-based unifiers (Mask2Former) that treat every segmentation task as mask classification.

How do you score a predicted box against the truth? The universal currency is Intersection over Union — the overlap of the two boxes divided by their combined area. It is the threshold that decides whether a detection "counts," and it composes into mean Average Precision (mAP), the standard detection metric.

EQ MM1.3 — INTERSECTION OVER UNION (IoU) $$ \mathrm{IoU}(A, B) \;=\; \frac{|A \cap B|}{|A \cup B|} \;=\; \frac{|A \cap B|}{|A| + |B| - |A \cap B|} \;\in\; [0, 1] $$
\(A\) is the predicted box, \(B\) the ground-truth box; \(|\cdot|\) is area. IoU \(= 1\) is a perfect overlap, \(0\) is disjoint. The union is written as \(|A|+|B|-|A\cap B|\) so the shared area is not double-counted. A detection is conventionally accepted when \(\mathrm{IoU}\ge 0.5\) (stricter benchmarks average over thresholds up to 0.95). The same measure, applied per-pixel, scores segmentation masks. IoU is scale-free — it cares about relative overlap, which is exactly why it survives across box sizes.
Two \(1\times 1\) boxes are offset so they overlap in a region of area \(0.5\). What is their IoU? (Use \(\mathrm{IoU}=\dfrac{|A\cap B|}{|A|+|B|-|A\cap B|}\).)
Intersection \(=0.5\); each box has area \(1\), so the union \(=1+1-0.5=1.5\). Then \(\mathrm{IoU}=\dfrac{0.5}{1.5}=\dfrac{1}{3}=\) 0.333. Below the usual \(0.5\) acceptance threshold, so this prediction would not count as a hit.

The frontier in 2026 is open-vocabulary and promptable perception: the Segment Anything Model (SAM/SAM 2) segments arbitrary objects from a point or box prompt without per-class training, and grounding models like Grounding DINO detect objects named by free text. Both lean on the image–text alignment of §1.5 — perception is increasingly steered by language rather than a fixed list of classes.

1.5

CLIP — connecting images and text

Every model so far maps an image to a label from a fixed list. CLIP — Contrastive Language–Image Pre-training — broke that ceiling by learning from the open web instead. The training signal is not class labels but the roughly 400 million (image, caption) pairs people already wrote on the internet. The architecture is two encoders that never share weights: an image encoder (a ViT or ResNet) and a text encoder (a Transformer). Each maps its input to a vector in one shared embedding space, and training pulls matching image–text pairs together while pushing mismatched ones apart.

The geometry that makes this work is cosine similarity: once both encoders' outputs are L2-normalized to unit length, the dot product of an image embedding and a text embedding measures the cosine of the angle between them — high when they describe the same thing, low when they do not. "Aligning images and text in a shared space" means precisely this: a photo of a dog and the string "a photo of a dog" land near each other on the unit sphere.

EQ MM1.4 — COSINE SIMILARITY & THE CONTRASTIVE OBJECTIVE $$ \mathrm{sim}(I, T) \;=\; \frac{\mathbf{u}_I \cdot \mathbf{v}_T}{\lVert \mathbf{u}_I\rVert\,\lVert \mathbf{v}_T\rVert}, \qquad \mathcal{L} \;=\; -\frac{1}{2}\Big[\, \log\frac{e^{\,\mathrm{sim}(I_i,T_i)/\tau}}{\sum_{j} e^{\,\mathrm{sim}(I_i,T_j)/\tau}} \;+\; \log\frac{e^{\,\mathrm{sim}(I_i,T_i)/\tau}}{\sum_{j} e^{\,\mathrm{sim}(I_j,T_i)/\tau}}\,\Big] $$
\(\mathbf{u}_I\) is the image embedding, \(\mathbf{v}_T\) the text embedding. The loss is a symmetric InfoNCE: over a batch of \(n\) pairs it forms an \(n\times n\) similarity matrix and applies cross-entropy so each image picks its true caption out of all \(n\) (and each caption picks its true image), with a learned temperature \(\tau\). The diagonal should be hot, everything off-diagonal cold. Because the supervision is "which caption matches," CLIP needs no curated label set and can therefore recognize concepts it was never explicitly trained to name.

The payoff is zero-shot classification. To classify an image into arbitrary categories you never trained on, write each candidate label as a sentence — "a photo of a {label}" — embed all of them with the text encoder, embed the image with the image encoder, and take the label whose embedding is most similar. No fine-tuning, no task-specific head; the classifier is built on the fly from words. CLIP matched a fully-supervised ResNet-50 on ImageNet without seeing a single ImageNet training label, and it generalizes far more robustly across distribution shifts than models trained on a fixed label set.

This shared space is the hinge of modern multimodal AI. CLIP's image encoder is the visual front-end of countless vision-language models (the next chapter); its text-conditioning is what lets diffusion image generators follow a prompt; and its similarity score is the retrieval engine behind "find me the photo that matches this description." The honest caveats: CLIP inherits the biases and noise of uncurated web data, it is weak at fine-grained counting and spatial relations, and its zero-shot accuracy is sensitive to the exact wording of the prompt ("prompt engineering" for images). It is a representation, not an oracle — but it is the representation that fused vision with the language stack.

True or false: CLIP trains an image encoder and a text encoder so that an image and its matching caption are mapped to nearby vectors in a single shared embedding space, compared by cosine similarity. (Answer true or false.)
This is exactly CLIP's design (EQ MM1.4): both encoders emit L2-normalized vectors into one common space, and the contrastive objective pulls each (image, caption) pair together while pushing mismatched pairs apart. Matching pairs end up with high cosine similarity, which is what enables zero-shot classification by comparing an image against text-described labels. The statement is true.
PYTHON · RUNNABLE IN-BROWSER
# EQ MM1.4: cosine similarity + CLIP-style zero-shot pick. Toy 4-dim
# embeddings stand in for a real image/text encoder's output vectors.
import numpy as np

def unit(v):                                   # L2-normalize to the unit sphere
    return v / np.linalg.norm(v)

def cosine(a, b):
    return float(unit(a) @ unit(b))

# one image embedding, three candidate-caption embeddings
img        = np.array([0.9, 0.2, 0.1, 0.0])    # "a photo of a dog"
captions = {
    "a photo of a dog": np.array([0.8, 0.3, 0.0, 0.1]),
    "a photo of a cat": np.array([0.1, 0.9, 0.2, 0.0]),
    "a city skyline"  : np.array([0.0, 0.1, 0.2, 0.9]),
}

print("cosine similarity image vs each caption:")
scores = {t: cosine(img, v) for t, v in captions.items()}
for t, s in scores.items():
    print(f"  {s:+.3f}   {t}")

best = max(scores, key=scores.get)
print(f"\nzero-shot prediction (argmax cosine): \"{best}\"")
print("the dog caption wins — no ImageNet label, just words vs pixels.")
edits are live — break it on purpose
INSTRUMENT MM1.3 — CLIP SIMILARITY DEMOONE IMAGE EMBEDDING vs TEXT EMBEDDINGS · COSINE · EQ MM1.4
TOP MATCH
a photo of a dog
TOP COSINE
0.96
ZERO-SHOT P(TOP)
Pick an image; the bars show its cosine similarity to five text prompts, and the softmax over those similarities (sharpened by \(1/\tau\)) is the zero-shot class probability. The matching caption stays brightest no matter which image you choose — that consistency is the shared embedding space. Crank the temperature up and the distribution collapses toward a confident one-hot pick; turn it down and the model hedges across captions.
NEXT

CLIP gave vision and language a common coordinate system; the next step is putting them in one model that can talk back. Chapter 02 covers multimodal LLMs — how a CLIP-style vision encoder is stitched to a language model through a projection bridge, what visual instruction tuning trains, and how a single network learns to caption, answer questions about, and reason over images.

1.R

References

  1. Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021 — the Vision Transformer; patch embeddings and the [CLS] token (EQ MM1.2).
  2. Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021 — CLIP; contrastive image–text pre-training and zero-shot transfer (EQ MM1.4).
  3. He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016 — ResNet; the skip connection that unlocked very deep convolutional backbones.
  4. Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 25 — AlexNet; the result that started the deep-learning era of vision.
  5. Russakovsky, O. et al. (2015). ImageNet Large Scale Visual Recognition Challenge. IJCV 115(3) — the ImageNet/ILSVRC benchmark that drove the whole progression.
  6. Carion, N. et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020 — detection as set prediction; no anchors or non-max suppression.
  7. Kirillov, A. et al. (2023). Segment Anything. ICCV 2023 — SAM; promptable, open-vocabulary segmentation (§1.4 frontier).