Why one model for many modalities
A language model is a function from a sequence of token embeddings to a sequence of token embeddings. It never sees characters or words directly — only vectors in \(\mathbb{R}^{d}\), the model width. Self-attention (Vol II · EQ 3.1) mixes those vectors according to how relevant they are to one another; it has no built-in notion of "text." That indifference is the whole opportunity. If you can turn an image into a handful of \(d\)-dimensional vectors, the transformer will attend to them exactly as it attends to words — no new mechanism required, just new tokens.
This is why the dominant design for vision-language models (VLMs) is not a separate vision network bolted to a separate text network with a translation layer between them. It is a single transformer whose context window holds image tokens and text tokens side by side. Asking "what is in this photo?" becomes one autoregressive generation over a sequence that begins with image tokens and continues with the question — the same next-token objective that trained the language model in the first place.
The alternative histories are instructive. Before this convergence, multimodal systems were pipelines: an object detector emitted labels, a caption model turned labels into a sentence, and a separate language model reasoned over the sentence. Every stage threw away information the next stage might have needed, and errors compounded. The transformer's contribution was to collapse the pipeline into one differentiable model where gradients flow from the final answer all the way back to the pixels. The cost is that you must commit, early, to a way of encoding pixels as tokens — and that single choice (covered next) determines almost everything about how the system behaves.
"Multimodal" usually means vision-language, but the recipe is general. Anything you can chop into a sequence and embed into \(\mathbb{R}^{d}\) becomes attendable: audio via spectrogram patches or a learned codec (Whisper-style), video via space-time patches, even depth maps or robot-sensor streams. The transformer is modality-agnostic; the engineering is all in the tokenizer for each new sense. This chapter uses images as the worked example because they are where the field matured first.
Tokenizing images — patches & projection
Text tokenization splits a string into discrete units and looks each up in an embedding table. Images have no natural discrete units, so the Vision Transformer (ViT) recipe manufactures them: cut the image into a grid of fixed-size square patches, flatten each patch into a vector, and project that vector into the model's embedding space with a single learned linear map. A \(14\times 14\) RGB patch is \(14\cdot 14\cdot 3 = 588\) raw numbers; the projection turns it into one \(d\)-dimensional patch token, the visual analogue of a word embedding.
Two design knobs dominate, and they trade off against each other. Patch size sets resolution: smaller patches mean more tokens, finer visual detail, and quadratically more attention cost (the token count scales as \(1/P^2\)). Image resolution sets how much the model can read at all — fine print, small objects, and dense charts demand high resolution, which is why modern VLMs (Qwen-VL, the "AnyRes" line in LLaVA-1.6, native-resolution ViTs) tile large images into many crops and feed hundreds or thousands of patch tokens per image. The honest tension: every patch token competes with text tokens for the same finite context budget, so "see more" and "read more text" are in direct conflict.
# EQ MM2.1: project image patch vectors into the LLM token space (linear)
import numpy as np
rng = np.random.default_rng(0)
P, C, d = 2, 3, 8 # 2x2 patches, RGB, language-model width d = 8
patch_dim = P * P * C # flattened patch length = 4 * 3 = 12
N = 4 # this image was cut into 4 patches
patches = rng.normal(0, 1, (N, patch_dim)) # N flattened patches
W = rng.normal(0, 0.05, (d, patch_dim)) # shared projection, EQ MM2.1
b = np.zeros(d)
tokens = patches @ W.T + b # (N, patch_dim) @ (patch_dim, d)
print("flattened patch length P*P*C :", patch_dim)
print("patches in :", patches.shape, "(N patches x patch_dim)")
print("image tokens out :", tokens.shape, "(N tokens x d) <- speakable")
print("\none token (the 1st patch, width d):")
print(np.round(tokens[0], 3))
print("\nN patches -> N tokens: the projection never changes the COUNT.")
Architectures — CLIP, Flamingo, LLaVA
Three landmark systems define the design space, and almost every production VLM is a descendant of one of them.
CLIP (2021) is not a generative VLM at all — it is the vision encoder nearly all of them are built on. CLIP trains two towers, an image encoder and a text encoder, on 400M image–caption pairs with a contrastive objective: pull the embedding of an image toward the embedding of its true caption and push it away from every other caption in the batch. The result is a vision encoder whose features are already aligned with language — a patch that depicts a dog lands near the text "a dog." That alignment is why CLIP features are the standard input to the LLM-based VLMs that followed.
Flamingo (2022) showed how to graft vision onto a frozen language model without retraining it. A frozen vision encoder feeds a small "Perceiver Resampler" that compresses a variable number of patch features into a fixed set of visual tokens; these are injected into a frozen LLM through newly inserted gated cross-attention layers. The LLM's own weights never move — only the cross-attention adapters train. This is the canonical cross-attention design (§2.4) and it gave the first strong few-shot, interleaved image-and-text behavior.
LLaVA (2023) is the design that won on simplicity. It takes a frozen CLIP vision encoder, runs its patch features through a tiny trainable projection (originally one linear layer, later a small MLP) into the LLM's embedding space — exactly EQ MM2.1 — and feeds those projected patches in as extra input tokens, prepended to the text tokens. No new attention layers, no architectural surgery: the LLM simply finds image tokens at the front of its context and attends to them with the self-attention it already has. This is the canonical early-fusion design, and its data recipe — GPT-4-generated visual instruction-following conversations — is what made it work.
| System | Year | How vision enters the LLM | Legacy |
|---|---|---|---|
| CLIP | 2021 | n/a — a contrastively trained encoder, not a chat model | the vision encoder everyone reuses |
| Flamingo | 2022 | resampled visual tokens via gated cross-attention into a frozen LLM | cross-attention VLMs (IDEFICS, Llama 3-V style) |
| LLaVA | 2023 | projected patches prepended as input tokens (early fusion) | the dominant open-VLM recipe; visual instruction tuning |
| BLIP-2 | 2023 | a "Q-Former" learns query tokens that pull info from frozen vision | query-based bridging; efficient adapters |
Cross-attention vs early fusion
Everything above reduces to one architectural fork: where do image tokens meet text tokens?
- Early fusion (LLaVA-style). Image tokens are concatenated with text tokens into one sequence; ordinary self-attention lets every text token attend to every image token and vice versa, in every layer. Maximum interaction, minimal new code — but the image tokens occupy real context-window slots and inflate the self-attention cost, which is quadratic in total sequence length.
- Cross-attention (Flamingo-style). Text tokens stay the only entries in the main sequence; image features are kept in a separate memory that the text attends into through dedicated cross-attention layers inserted between the LLM's blocks. The text sequence length is unchanged, so the language model's self-attention cost is untouched and a frozen LLM's weights can be preserved. The price is new parameters and a less symmetric flow of information.
A third family sits between them: query-based resamplers (Flamingo's Perceiver Resampler, BLIP-2's Q-Former). These first compress hundreds of patch features into a small fixed number of learned query tokens, then feed that handful into the LLM — by cross-attention or as input tokens. The point is decoupling: the visual token count the LLM sees no longer scales with image resolution, which is the cleanest answer to the context-budget tension from §2.2. The contested part is quality — heavy compression can drop fine detail, and several 2024–2025 models reverted to feeding many raw patches because reading text-in-images demanded it.
Training & evaluating VLMs
Whatever the wiring, the modern LLaVA-style recipe trains in two stages, almost always on top of a pre-trained language model and a pre-trained (usually CLIP) vision encoder — so the expensive learning is already paid for:
- Stage 1 — alignment / pre-training. Freeze both the vision encoder and the LLM; train only the projection (EQ MM2.1) on a large pile of image–caption pairs, with the next-token objective on the caption. This teaches the projection to place visual tokens where the LLM expects related words to live — cheap, fast, and stabilizing.
- Stage 2 — visual instruction tuning. Unfreeze the projection and (usually) the LLM, and fine-tune on multimodal instruction-following data: image + question → answer, multi-turn visual chat, OCR, charts, grounding. This is where the model learns to follow instructions about images, not merely caption them. LLaVA's key insight was that you can bootstrap this data by prompting a strong text LLM with image annotations to write the conversations.
The mechanics — autoregressive cross-entropy over the answer tokens only, image tokens masked out of the loss — are identical to text fine-tuning (Vol II · CH 06). The image tokens are context, not targets; you never ask the model to "predict the next patch."
Evaluation is the genuinely hard part, and the field is openly uncomfortable with the state of it. The standard suites probe different skills: VQAv2 and GQA (visual question answering), TextVQA and DocVQA (reading text in images), ChartQA (chart reasoning), MMMU (college-level multimodal reasoning), MME and MMBench (broad capability batteries), and POPE (object-hallucination probing). Three caveats that experts will always raise: (1) many benchmarks are contaminated or leak into web-scale training data, inflating scores; (2) answer-matching is brittle — a correct free-form answer can be marked wrong for phrasing, so LLM-graded "judges" are increasingly used, with their own biases; and (3) the most consequential failure mode, hallucination — confidently describing objects that are not in the image — is precisely what the headline accuracy numbers hide, which is why POPE and similar adversarial probes exist. A VLM that scores well on VQA can still invent a clock on an empty wall.
# Fuse image tokens + text tokens into ONE sequence; print the shapes
import numpy as np
rng = np.random.default_rng(0)
d = 8 # shared model width
N, T = 32, 10 # 32 image patch tokens, 10 text tokens
img_tokens = rng.normal(0, 1, (N, d)) # projected patches (EQ MM2.1 output)
text_tokens = rng.normal(0, 1, (T, d)) # word embeddings, same width d
# early fusion = concatenate along the sequence axis (axis 0)
fused = np.concatenate([img_tokens, text_tokens], axis=0)
print("image tokens :", img_tokens.shape, "(N x d)")
print("text tokens :", text_tokens.shape, "(T x d)")
print("fused seq :", fused.shape, "( (N+T) x d ) <- one sequence")
print("sequence length N + T :", N + T)
print("self-attention matrix :", (N + T), "x", (N + T),
"=", (N + T) ** 2, "scores")
print("\nthe LLM now attends over all", N + T,
"tokens with the SAME self-attention it uses for text.")
A high benchmark score is not the absence of hallucination. The cross-entropy objective rewards fluent, plausible answers; nothing in it grounds claims to actually-present pixels. Object-hallucination probes (POPE), grounding metrics, and human review of free-form outputs catch failures that VQA accuracy launders away. Treat any single multimodal number with suspicion — and never report one without a hallucination probe beside it.
So far the image only flowed in; next it flows out. Chapter 03 turns to generation — diffusion and autoregressive image/video models — where the same token-and-attention machinery is run in reverse to produce pixels rather than read them.
References
- Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP).
- Alayrac, J.-B. et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning.
- Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA).
- Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT).
- Li, J., Li, D., Savarese, S. & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs.
- Li, Y. et al. (2023). Evaluating Object Hallucination in Large Vision-Language Models (POPE).
- Yue, X. et al. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark.