From pixels to features — the CV problem
A digital image is a tensor of numbers: an \(H\times W\times 3\) grid where each cell holds the red, green and blue intensity of one pixel. A modest \(224\times 224\) RGB photo is therefore \(150{,}528\) values — and nothing in those raw numbers tells you there is a cat in the frame. The defining difficulty of computer vision is the semantic gap: the pixels are low-level and the label is high-level, and between them sit every nuisance a real scene throws at you — lighting, viewpoint, occlusion, scale, background clutter, intra-class variation. Two photos of the same cat can share almost no pixel values; two photos of different things can be nearly identical at the pixel level.
For decades the answer was hand-engineered features: SIFT, HOG, SURF — operators a human designed to be robust to some of those nuisances, computing histograms of gradient orientations or scale-invariant keypoints, then feeding the result to a classifier like an SVM. They worked, within limits, and they encoded real insight about what makes images comparable. Their ceiling was that a person had to guess in advance which features mattered, and that guess never generalized far beyond the task it was tuned for.
The deep-learning bet was to stop guessing and learn the feature hierarchy from data. A trained vision network discovers, layer by layer, a representation that climbs the semantic gap on its own: the first layers respond to oriented edges and color contrasts, the middle layers to textures and motifs, the deep layers to object parts and finally whole objects. Crucially this hierarchy is not specified by hand — it falls out of fitting one differentiable function end-to-end against labels. The same compositional idea underlies everything that follows in this chapter; the only question that changes is what architecture does the composing.
A subtlety worth stating up front: a pixel value is not a meaningful coordinate. Doubling the brightness of a photo is, semantically, almost a no-op, yet it moves every input number. This is exactly why raw-pixel nearest-neighbour search fails and why learned features — which are trained to be invariant to such nuisances — are the entire game. The features below are what make two images comparable at all.
# Why raw pixels are a bad space: brightness shift moves every number,
# yet the picture is "the same". Learned features fix this; pixels don't.
import numpy as np
rng = np.random.default_rng(0)
# a tiny 8x8 grayscale "image": a bright square in the corner
img = np.zeros((8, 8))
img[1:4, 1:4] = 0.8
same_scene = np.clip(img + 0.15, 0, 1) # same scene, brighter lighting
diff_scene = np.zeros((8, 8)); diff_scene[4:7, 4:7] = 0.8 # different object
def pix_dist(a, b): # raw-pixel L2 distance
return float(np.sqrt(((a - b) ** 2).sum()))
print(f"pixel dist same scene (brighter) : {pix_dist(img, same_scene):.3f}")
print(f"pixel dist different object : {pix_dist(img, diff_scene):.3f}")
print("\nin pixel space a brightness change can look as 'far' as a new object.")
print("learning a feature map that ignores lighting is the whole point of §1.2-1.5.")
Convolutional backbones & ImageNet
The architecture that closed the semantic gap was the convolutional neural network (covered in full in Deep Learning · Ch 02). The single idea: rather than a dense layer that wires every pixel to every unit, slide one small learnable kernel across the whole image and reuse it everywhere. That bakes in two priors that happen to be true of natural images — locality (the pixels relevant to an edge are adjacent) and translation equivariance (a pattern means the same thing wherever it appears) — and it is exactly this built-in inductive bias that lets a CNN generalize from far fewer images than an unconstrained network would need.
A backbone is the stack of convolution + normalization + pooling blocks that turns an image into a feature volume; a small head on top reads that volume for the task at hand. The canonical rhythm is to shrink the spatial grid while growing channel depth — trading where for what — until a global pool collapses the map to one vector per channel that a linear classifier can read.
What made this a field-wide movement rather than a clever trick was a benchmark. ImageNet — roughly 1.2 million labelled training images across 1000 categories, organized for the ILSVRC competition — gave everyone a shared, hard, large-scale yardstick. The inflection point was 2012: AlexNet cut the top-5 error from ~26% to ~16% in one stroke, the moment the modern deep-learning era is usually dated to. The years after were a tight relay of ideas, each fixing the previous ceiling:
| Model | Year | Top-5 err. | The idea it added |
|---|---|---|---|
| AlexNet | 2012 | ~16.4% | CNNs at GPU scale on ImageNet: ReLU, dropout, heavy augmentation. Started the era. |
| VGG-16/19 | 2014 | ~7.3% | Depth from uniformity — stacks of \(3\times 3\) convs only; two \(3\times3\)s match a \(5\times5\) field with fewer weights. |
| GoogLeNet | 2014 | ~6.7% | Multi-scale Inception blocks with \(1\times1\) bottlenecks to stay cheap. |
| ResNet-152 | 2015 | ~3.6% | The residual / skip connection — \(\mathbf{y}=\mathcal{F}(\mathbf{x})+\mathbf{x}\) — that let networks go past ~20 layers and below human-level error. |
The decisive jump was ResNet. Before it, the field believed deeper was simply better, yet past about twenty layers accuracy got worse — and not from overfitting, since training error rose too. The cause was an optimization failure: gradients had to thread through too many transformations to reach the early layers. He et al. fixed it by adding the block's input back to its output, so each block learns only a residual correction and the gradient gets a direct \(+1\) path home. That one line enabled 152-layer networks, won ImageNet 2015, and created the residual stream that is now the backbone of essentially every deep architecture — Transformers included (Vol II · EQ 2.x). Hold onto this; the receptive-field instrument below makes the depth story tangible, and §1.3 reuses the very same residual blocks with attention instead of convolution.
Honest status in 2026: convnets no longer hold the absolute accuracy crown on large-scale benchmarks — given enough data, ViT (§1.3) matches or exceeds them. But the gap is narrower than headlines imply. CNNs modernized with the same training recipes (the "ConvNeXt" line) remain competitive, and convolution still wins where data is scarce or latency and edge deployment matter, because its inductive bias substitutes for data a transformer cannot. Convolution did not lose; it became one well-understood tool among several.
Vision Transformers (ViT)
The Transformer was built for sequences of tokens (Vol II · Ch 02–03). The Vision Transformer's contribution — "An Image is Worth 16×16 Words" — was the disarmingly simple observation that you can turn an image into a sequence of tokens and then change almost nothing else. Cut the image into a grid of non-overlapping square patches, flatten each patch into a vector, project it linearly to the model dimension, and you have a sequence of "visual words" that a standard Transformer encoder can chew on with self-attention.
Concretely, take an \(H\times W\) image and a patch size \(P\). You get a grid of \(\tfrac{H}{P}\times\tfrac{W}{P}\) patches; the canonical recipe is a \(224\times 224\) image with \(P=16\), giving a \(14\times 14\) grid — 196 patches. Each patch of \(P\times P\times 3 = 768\) raw values is flattened and multiplied by one learned matrix \(E\) to produce a \(D\)-dimensional patch embedding. This is the only image-specific operation in the entire model:
What you gain is a global receptive field from layer one (the instrument in §1.2 shows exactly this): any patch can attend to any other in a single step, where a CNN needs many layers to relate distant regions. What you give up is the convolutional inductive bias — locality and translation equivariance no longer come for free, the model must learn them. That trade has a sharp consequence the original paper made famous: ViT is data-hungry. Trained on ImageNet-1k alone it underperforms a comparable ResNet; pre-trained on a far larger corpus (the paper used the 303M-image JFT-300M) it overtakes the best CNNs. The slogan is fair: convolution trades data for prior knowledge; ViT trades prior knowledge for data.
Since 2020 the picture has filled in. Strong data augmentation and regularization recipes (DeiT) made ViT trainable on ImageNet-1k alone; hierarchical, windowed variants (Swin) reintroduced a multi-scale, locality-aware structure that made transformers practical as general backbones for detection and segmentation; and self-supervised pre-training (DINO, MAE) learns ViT features without labels at all. The CLS-token / patch-sequence formulation, though, is the through-line — and it is what lets the same model family ingest images and text in one stack, the subject of §1.5 and of the next chapter.
# EQ MM1.2: patchify an image into NxN patches and flatten — the ONE
# image-specific step in a Vision Transformer. We use a toy 8x8 RGB image.
import numpy as np
rng = np.random.default_rng(0)
H = W = 8 # image side (toy; ViT uses 224)
C = 3 # RGB channels
P = 2 # patch side (toy; ViT uses 16)
img = rng.integers(0, 256, size=(H, W, C)).astype(float)
n_side = H // P # patches per row/col
patches = (img.reshape(n_side, P, n_side, P, C) # block the grid
.transpose(0, 2, 1, 3, 4) # group each PxP block
.reshape(n_side * n_side, P * P * C)) # flatten each patch
print(f"image shape : {img.shape}")
print(f"grid of patches : {n_side} x {n_side} = {n_side*n_side} patches")
print(f"patch sequence shape : {patches.shape} (N, P*P*C)")
print(f"one patch is a vector of length P*P*C = {P*P*C}")
# the same arithmetic for the real ViT-Base/16 setting:
H2, P2 = 224, 16
print(f"\nViT-B/16 on 224x224 : ({H2}//{P2})^2 = {(H2//P2)**2} patches (+1 CLS)")
Detection & segmentation in brief
Classification answers "what is in this image?" with a single label. Most real vision tasks demand "what, and where?" — and the answer's shape is what distinguishes the major task families. It is worth fixing the vocabulary, because the rest of the field is built on it:
- Object detection — predict a bounding box plus a class for every object instance. Output: a variable-length list of (box, label, score) tuples.
- Semantic segmentation — assign a class to every pixel, with no notion of instances. Two adjacent cars become one undifferentiated "car" region. Output: an \(H\times W\) label map.
- Instance segmentation — a per-pixel mask for each individual object, separating those two cars. The union of detection and semantic segmentation.
- Panoptic segmentation — label every pixel, distinguishing countable things (cars, people) as instances while treating uncountable stuff (sky, road) as regions.
Architecturally these are heads bolted onto the backbones of §1.2–1.3. The historical arc on detection ran from two-stage region proposers (R-CNN → Fast → Faster R-CNN, which propose candidate boxes then classify them) to single-shot real-time detectors (the YOLO and SSD families, which predict boxes directly in one pass), and most recently to set-prediction transformers (DETR), which cast detection as directly emitting a fixed set of objects and match predictions to ground truth with the Hungarian algorithm — no hand-designed anchor boxes or non-max suppression. Segmentation followed a parallel path: fully-convolutional encoder–decoders (FCN, U-Net) that upsample features back to pixel resolution, then Mask R-CNN adding a mask branch to a detector, and now transformer-based unifiers (Mask2Former) that treat every segmentation task as mask classification.
How do you score a predicted box against the truth? The universal currency is Intersection over Union — the overlap of the two boxes divided by their combined area. It is the threshold that decides whether a detection "counts," and it composes into mean Average Precision (mAP), the standard detection metric.
The frontier in 2026 is open-vocabulary and promptable perception: the Segment Anything Model (SAM/SAM 2) segments arbitrary objects from a point or box prompt without per-class training, and grounding models like Grounding DINO detect objects named by free text. Both lean on the image–text alignment of §1.5 — perception is increasingly steered by language rather than a fixed list of classes.
CLIP — connecting images and text
Every model so far maps an image to a label from a fixed list. CLIP — Contrastive Language–Image Pre-training — broke that ceiling by learning from the open web instead. The training signal is not class labels but the roughly 400 million (image, caption) pairs people already wrote on the internet. The architecture is two encoders that never share weights: an image encoder (a ViT or ResNet) and a text encoder (a Transformer). Each maps its input to a vector in one shared embedding space, and training pulls matching image–text pairs together while pushing mismatched ones apart.
The geometry that makes this work is cosine similarity: once both encoders' outputs are L2-normalized to unit length, the dot product of an image embedding and a text embedding measures the cosine of the angle between them — high when they describe the same thing, low when they do not. "Aligning images and text in a shared space" means precisely this: a photo of a dog and the string "a photo of a dog" land near each other on the unit sphere.
The payoff is zero-shot classification. To classify an image into arbitrary categories you never trained on, write each candidate label as a sentence — "a photo of a {label}" — embed all of them with the text encoder, embed the image with the image encoder, and take the label whose embedding is most similar. No fine-tuning, no task-specific head; the classifier is built on the fly from words. CLIP matched a fully-supervised ResNet-50 on ImageNet without seeing a single ImageNet training label, and it generalizes far more robustly across distribution shifts than models trained on a fixed label set.
This shared space is the hinge of modern multimodal AI. CLIP's image encoder is the visual front-end of countless vision-language models (the next chapter); its text-conditioning is what lets diffusion image generators follow a prompt; and its similarity score is the retrieval engine behind "find me the photo that matches this description." The honest caveats: CLIP inherits the biases and noise of uncurated web data, it is weak at fine-grained counting and spatial relations, and its zero-shot accuracy is sensitive to the exact wording of the prompt ("prompt engineering" for images). It is a representation, not an oracle — but it is the representation that fused vision with the language stack.
# EQ MM1.4: cosine similarity + CLIP-style zero-shot pick. Toy 4-dim
# embeddings stand in for a real image/text encoder's output vectors.
import numpy as np
def unit(v): # L2-normalize to the unit sphere
return v / np.linalg.norm(v)
def cosine(a, b):
return float(unit(a) @ unit(b))
# one image embedding, three candidate-caption embeddings
img = np.array([0.9, 0.2, 0.1, 0.0]) # "a photo of a dog"
captions = {
"a photo of a dog": np.array([0.8, 0.3, 0.0, 0.1]),
"a photo of a cat": np.array([0.1, 0.9, 0.2, 0.0]),
"a city skyline" : np.array([0.0, 0.1, 0.2, 0.9]),
}
print("cosine similarity image vs each caption:")
scores = {t: cosine(img, v) for t, v in captions.items()}
for t, s in scores.items():
print(f" {s:+.3f} {t}")
best = max(scores, key=scores.get)
print(f"\nzero-shot prediction (argmax cosine): \"{best}\"")
print("the dog caption wins — no ImageNet label, just words vs pixels.")
CLIP gave vision and language a common coordinate system; the next step is putting them in one model that can talk back. Chapter 02 covers multimodal LLMs — how a CLIP-style vision encoder is stitched to a language model through a projection bridge, what visual instruction tuning trains, and how a single network learns to caption, answer questions about, and reason over images.
References
- Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale.
- Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision.
- He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition.
- Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks.
- Russakovsky, O. et al. (2015). ImageNet Large Scale Visual Recognition Challenge.
- Carion, N. et al. (2020). End-to-End Object Detection with Transformers (DETR).
- Kirillov, A. et al. (2023). Segment Anything.