AI // ENCYCLOPEDIA / MULTIMODAL / 02 / MULTIMODAL LLMs INDEX NEXT: IMAGE & VIDEO GEN →
MULTIMODAL & WORLD MODELS · CHAPTER 02 / 06

Multimodal LLMs

A transformer treats its tokens as vectors to attend over, regardless of what they encode. An image becomes attendable once it is sliced into patches and each patch is projected into the model's embedding space, after which self-attention mixes pixels and words in the same residual stream. This chapter covers the projection, the CLIP, Flamingo, and LLaVA lineage, the early-fusion versus cross-attention split, and how vision-language models are trained and evaluated.

LEVELCORE READING TIME≈ 26 MIN BUILDS ONMULTIMODAL 01 · VOL II ATTENTION INSTRUMENTSPATCH PROJECTION · FUSION TOGGLE · PATCH ATTENTION
2.1

Why one model for many modalities

A language model is a function from a sequence of token embeddings to a sequence of token embeddings. It never sees characters or words directly — only vectors in \(\mathbb{R}^{d}\), the model width. Self-attention (Vol II · EQ 3.1) mixes those vectors according to how relevant they are to one another; it has no built-in notion of "text." That indifference is the whole opportunity. If you can turn an image into a handful of \(d\)-dimensional vectors, the transformer will attend to them exactly as it attends to words — no new mechanism required, just new tokens.

This is why the dominant design for vision-language models (VLMs) is not a separate vision network bolted to a separate text network with a translation layer between them. It is a single transformer whose context window holds image tokens and text tokens side by side. Asking "what is in this photo?" becomes one autoregressive generation over a sequence that begins with image tokens and continues with the question — the same next-token objective that trained the language model in the first place.

The alternative histories are instructive. Before this convergence, multimodal systems were pipelines: an object detector emitted labels, a caption model turned labels into a sentence, and a separate language model reasoned over the sentence. Every stage threw away information the next stage might have needed, and errors compounded. The transformer's contribution was to collapse the pipeline into one differentiable model where gradients flow from the final answer all the way back to the pixels. The cost is that you must commit, early, to a way of encoding pixels as tokens — and that single choice (covered next) determines almost everything about how the system behaves.

CLAIM

"Multimodal" usually means vision-language, but the recipe is general. Anything you can chop into a sequence and embed into \(\mathbb{R}^{d}\) becomes attendable: audio via spectrogram patches or a learned codec (Whisper-style), video via space-time patches, even depth maps or robot-sensor streams. The transformer is modality-agnostic; the engineering is all in the tokenizer for each new sense. This chapter uses images as the worked example because they are where the field matured first.

2.2

Tokenizing images — patches & projection

Text tokenization splits a string into discrete units and looks each up in an embedding table. Images have no natural discrete units, so the Vision Transformer (ViT) recipe manufactures them: cut the image into a grid of fixed-size square patches, flatten each patch into a vector, and project that vector into the model's embedding space with a single learned linear map. A \(14\times 14\) RGB patch is \(14\cdot 14\cdot 3 = 588\) raw numbers; the projection turns it into one \(d\)-dimensional patch token, the visual analogue of a word embedding.

EQ MM2.1 — PATCH TOKENS BY LINEAR PROJECTION $$ z_p \;=\; W\, \mathrm{flatten}(x_p) \;+\; b \;\in\; \mathbb{R}^{d}, \qquad W \in \mathbb{R}^{d \times (P^2 C)}, \quad p = 1,\ldots,N, \quad N = \frac{HW}{P^2} $$
\(x_p\) is one \(P\times P\) patch with \(C\) color channels; flattening gives a \(P^2 C\) vector. The same shared projection \(W\) maps every patch to a \(d\)-dimensional token — exactly the weight-sharing trick that makes convolution efficient, and indeed this projection is a stride-\(P\) convolution in disguise. An \(H\times W\) image yields \(N = HW/P^2\) tokens: a \(224\times 224\) image at \(P=14\) gives \((224/14)^2 = 16^2 = 256\) patch tokens. Patches carry no inherent order, so a learned positional embedding is added to each \(z_p\) — without it the model could not tell top-left from bottom-right.

Two design knobs dominate, and they trade off against each other. Patch size sets resolution: smaller patches mean more tokens, finer visual detail, and quadratically more attention cost (the token count scales as \(1/P^2\)). Image resolution sets how much the model can read at all — fine print, small objects, and dense charts demand high resolution, which is why modern VLMs (Qwen-VL, the "AnyRes" line in LLaVA-1.6, native-resolution ViTs) tile large images into many crops and feed hundreds or thousands of patch tokens per image. The honest tension: every patch token competes with text tokens for the same finite context budget, so "see more" and "read more text" are in direct conflict.

A vision encoder produces \(32\) image patches, and each patch is projected by EQ MM2.1 into the language model's token width. How many extra tokens do these patches add to the context sequence?
The projection is one-to-one: every patch becomes exactly one token of width \(d\), regardless of the value of \(d\). So \(32\) patches add 32 tokens. (The width \(d\) changes each token's size, never the token count.)
Using \(N = HW/P^2\): a \(224\times 224\) image is split into \(14\times 14\) patches. How many patch tokens \(N\) does the encoder emit?
Patches per side \(= 224/14 = 16\). Total \(N = 16 \times 16 = \dfrac{224 \times 224}{14^2} = \dfrac{50176}{196} = \) 256. This is exactly the token count of CLIP's ViT-L/14 at \(224^2\).
PYTHON · RUNNABLE IN-BROWSER
# EQ MM2.1: project image patch vectors into the LLM token space (linear)
import numpy as np
rng = np.random.default_rng(0)

P, C, d = 2, 3, 8            # 2x2 patches, RGB, language-model width d = 8
patch_dim = P * P * C        # flattened patch length = 4 * 3 = 12
N = 4                        # this image was cut into 4 patches

patches = rng.normal(0, 1, (N, patch_dim))   # N flattened patches
W = rng.normal(0, 0.05, (d, patch_dim))      # shared projection, EQ MM2.1
b = np.zeros(d)

tokens = patches @ W.T + b                    # (N, patch_dim) @ (patch_dim, d)

print("flattened patch length P*P*C :", patch_dim)
print("patches in           :", patches.shape, "(N patches x patch_dim)")
print("image tokens out     :", tokens.shape, "(N tokens x d) <- speakable")
print("\none token (the 1st patch, width d):")
print(np.round(tokens[0], 3))
print("\nN patches -> N tokens: the projection never changes the COUNT.")
edits are live — break it on purpose
INSTRUMENT MM2.1 — IMAGE-TO-TOKEN PROJECTIONPATCHIFY → FLATTEN → PROJECT · EQ MM2.1
PATCH TOKENS N = HW/P²
256
FLATTENED PATCH P²·C
588
PROJECTION W (d × P²C)
Left: the image cut into a patch grid. Right: each patch collapses to one column — a single \(d\)-dimensional token. Shrink the patch size and watch the token count explode (it scales as \(1/P^2\)); every one of those tokens then competes with your text for the context window. Raising the model width \(d\) makes each token taller but never adds tokens — count is set by the patch grid alone.
2.3

Architectures — CLIP, Flamingo, LLaVA

Three landmark systems define the design space, and almost every production VLM is a descendant of one of them.

CLIP (2021) is not a generative VLM at all — it is the vision encoder nearly all of them are built on. CLIP trains two towers, an image encoder and a text encoder, on 400M image–caption pairs with a contrastive objective: pull the embedding of an image toward the embedding of its true caption and push it away from every other caption in the batch. The result is a vision encoder whose features are already aligned with language — a patch that depicts a dog lands near the text "a dog." That alignment is why CLIP features are the standard input to the LLM-based VLMs that followed.

EQ MM2.2 — CLIP CONTRASTIVE OBJECTIVE (IMAGE→TEXT HALF) $$ \mathcal{L}_{i \to t} \;=\; -\frac{1}{B}\sum_{i=1}^{B} \log \frac{\exp\!\big(\langle u_i, v_i\rangle / \tau\big)}{\sum_{j=1}^{B} \exp\!\big(\langle u_i, v_j\rangle / \tau\big)} $$
\(u_i\) is the L2-normalized embedding of image \(i\), \(v_j\) of caption \(j\); the score is a cosine similarity scaled by a learned temperature \(\tau\). For each image, this is just a softmax cross-entropy that treats the matching caption as the correct class among all \(B\) captions in the batch — so a big batch means many hard negatives and a sharper signal. The full CLIP loss symmetrizes this with the text→image half and averages the two. No labels are needed; the captions are the supervision.

Flamingo (2022) showed how to graft vision onto a frozen language model without retraining it. A frozen vision encoder feeds a small "Perceiver Resampler" that compresses a variable number of patch features into a fixed set of visual tokens; these are injected into a frozen LLM through newly inserted gated cross-attention layers. The LLM's own weights never move — only the cross-attention adapters train. This is the canonical cross-attention design (§2.4) and it gave the first strong few-shot, interleaved image-and-text behavior.

LLaVA (2023) is the design that won on simplicity. It takes a frozen CLIP vision encoder, runs its patch features through a tiny trainable projection (originally one linear layer, later a small MLP) into the LLM's embedding space — exactly EQ MM2.1 — and feeds those projected patches in as extra input tokens, prepended to the text tokens. No new attention layers, no architectural surgery: the LLM simply finds image tokens at the front of its context and attends to them with the self-attention it already has. This is the canonical early-fusion design, and its data recipe — GPT-4-generated visual instruction-following conversations — is what made it work.

SystemYearHow vision enters the LLMLegacy
CLIP2021n/a — a contrastively trained encoder, not a chat modelthe vision encoder everyone reuses
Flamingo2022resampled visual tokens via gated cross-attention into a frozen LLMcross-attention VLMs (IDEFICS, Llama 3-V style)
LLaVA2023projected patches prepended as input tokens (early fusion)the dominant open-VLM recipe; visual instruction tuning
BLIP-22023a "Q-Former" learns query tokens that pull info from frozen visionquery-based bridging; efficient adapters
True or false: LLaVA feeds projected image patches into the language model as extra input tokens (prepended to the text), rather than through dedicated cross-attention layers. (Answer true or false.)
LLaVA's only new module is the small projection of EQ MM2.1; the projected patch tokens are concatenated in front of the text tokens and consumed by the LLM's existing self-attention. It adds no cross-attention layers. That is precisely the early-fusion approach, so the statement is true. (Flamingo, by contrast, uses cross-attention.)
2.4

Cross-attention vs early fusion

Everything above reduces to one architectural fork: where do image tokens meet text tokens?

  • Early fusion (LLaVA-style). Image tokens are concatenated with text tokens into one sequence; ordinary self-attention lets every text token attend to every image token and vice versa, in every layer. Maximum interaction, minimal new code — but the image tokens occupy real context-window slots and inflate the self-attention cost, which is quadratic in total sequence length.
  • Cross-attention (Flamingo-style). Text tokens stay the only entries in the main sequence; image features are kept in a separate memory that the text attends into through dedicated cross-attention layers inserted between the LLM's blocks. The text sequence length is unchanged, so the language model's self-attention cost is untouched and a frozen LLM's weights can be preserved. The price is new parameters and a less symmetric flow of information.
EQ MM2.3 — THE FORK: WHO ATTENDS TO WHOM $$ \textbf{early fusion:}\;\; \mathrm{SelfAttn}\big([\,z_{1:N}^{\text{img}};\, e_{1:T}^{\text{txt}}\,]\big), \qquad \textbf{cross-attention:}\;\; \mathrm{CrossAttn}\big(Q{=}e^{\text{txt}},\; K,V{=}z^{\text{img}}\big) $$
In early fusion a single concatenated sequence of length \(N+T\) goes through self-attention, so attention cost grows as \((N+T)^2\) and the \(N\) image tokens are spent from the context budget. In cross-attention the queries come only from the \(T\) text tokens while keys and values come from the \(N\) image tokens, costing \(N\,T\) and leaving the text length \(T\) — and the base LLM — untouched. Early fusion trades context budget for simplicity and tighter image↔text mixing; cross-attention trades extra parameters for an unmodified, context-cheap language model. Most open models since 2023 chose early fusion for its simplicity; several large frontier systems use cross-attention to bolt vision onto an already-trained text model.
INSTRUMENT MM2.2 — EARLY FUSION vs CROSS-ATTENTIONTOGGLE THE WIRING · EQ MM2.3
SEQUENCE INTO SELF-ATTN
N + T
ATTENTION COST
(N+T)²
BASE LLM
modified
Toggle the two wirings. Early fusion drops image tokens straight into the one sequence the LLM already self-attends over — green and grey tokens share every layer. Cross-attention keeps the text sequence pure and lets it reach into a separate image memory through inserted layers, so the base model and its context length stay untouched. Watch the cost readout flip from \((N{+}T)^2\) to \(N\,T\).

A third family sits between them: query-based resamplers (Flamingo's Perceiver Resampler, BLIP-2's Q-Former). These first compress hundreds of patch features into a small fixed number of learned query tokens, then feed that handful into the LLM — by cross-attention or as input tokens. The point is decoupling: the visual token count the LLM sees no longer scales with image resolution, which is the cleanest answer to the context-budget tension from §2.2. The contested part is quality — heavy compression can drop fine detail, and several 2024–2025 models reverted to feeding many raw patches because reading text-in-images demanded it.

2.5

Training & evaluating VLMs

Whatever the wiring, the modern LLaVA-style recipe trains in two stages, almost always on top of a pre-trained language model and a pre-trained (usually CLIP) vision encoder — so the expensive learning is already paid for:

  • Stage 1 — alignment / pre-training. Freeze both the vision encoder and the LLM; train only the projection (EQ MM2.1) on a large pile of image–caption pairs, with the next-token objective on the caption. This teaches the projection to place visual tokens where the LLM expects related words to live — cheap, fast, and stabilizing.
  • Stage 2 — visual instruction tuning. Unfreeze the projection and (usually) the LLM, and fine-tune on multimodal instruction-following data: image + question → answer, multi-turn visual chat, OCR, charts, grounding. This is where the model learns to follow instructions about images, not merely caption them. LLaVA's key insight was that you can bootstrap this data by prompting a strong text LLM with image annotations to write the conversations.

The mechanics — autoregressive cross-entropy over the answer tokens only, image tokens masked out of the loss — are identical to text fine-tuning (Vol II · CH 06). The image tokens are context, not targets; you never ask the model to "predict the next patch."

Evaluation is the genuinely hard part, and the field is openly uncomfortable with the state of it. The standard suites probe different skills: VQAv2 and GQA (visual question answering), TextVQA and DocVQA (reading text in images), ChartQA (chart reasoning), MMMU (college-level multimodal reasoning), MME and MMBench (broad capability batteries), and POPE (object-hallucination probing). Three caveats that experts will always raise: (1) many benchmarks are contaminated or leak into web-scale training data, inflating scores; (2) answer-matching is brittle — a correct free-form answer can be marked wrong for phrasing, so LLM-graded "judges" are increasingly used, with their own biases; and (3) the most consequential failure mode, hallucination — confidently describing objects that are not in the image — is precisely what the headline accuracy numbers hide, which is why POPE and similar adversarial probes exist. A VLM that scores well on VQA can still invent a clock on an empty wall.

PYTHON · RUNNABLE IN-BROWSER
# Fuse image tokens + text tokens into ONE sequence; print the shapes
import numpy as np
rng = np.random.default_rng(0)

d = 8                                    # shared model width
N, T = 32, 10                            # 32 image patch tokens, 10 text tokens

img_tokens  = rng.normal(0, 1, (N, d))   # projected patches (EQ MM2.1 output)
text_tokens = rng.normal(0, 1, (T, d))   # word embeddings, same width d

# early fusion = concatenate along the sequence axis (axis 0)
fused = np.concatenate([img_tokens, text_tokens], axis=0)

print("image tokens :", img_tokens.shape,  "(N x d)")
print("text  tokens :", text_tokens.shape, "(T x d)")
print("fused seq    :", fused.shape, "( (N+T) x d ) <- one sequence")
print("sequence length N + T :", N + T)
print("self-attention matrix :", (N + T), "x", (N + T),
      "=", (N + T) ** 2, "scores")
print("\nthe LLM now attends over all", N + T,
      "tokens with the SAME self-attention it uses for text.")
edits are live — break it on purpose
INSTRUMENT MM2.3 — VLM ATTENTION OVER IMAGE PATCHESA TEXT TOKEN LOOKS AT A 7×7 PATCH GRID · SOFTMAX
MASS ON TOP PATCH
PATCHES OVER 5%
ATTENTION ENTROPY
A single text token (the query word) attends over a \(7\times 7\) grid of image patches; brighter = more attention. The weights are a softmax over patch relevance, so they always sum to 1 — attention routes the text token's read across the image, it never invents pixels. Pick a different word and the bright region moves to the matching object. Drop the temperature toward 0 and the read sharpens to a near-hard lookup of one patch; raise it and attention diffuses to a uniform blur over the whole image.
HONEST CAVEAT

A high benchmark score is not the absence of hallucination. The cross-entropy objective rewards fluent, plausible answers; nothing in it grounds claims to actually-present pixels. Object-hallucination probes (POPE), grounding metrics, and human review of free-form outputs catch failures that VQA accuracy launders away. Treat any single multimodal number with suspicion — and never report one without a hallucination probe beside it.

NEXT

So far the image only flowed in; next it flows out. Chapter 03 turns to generation — diffusion and autoregressive image/video models — where the same token-and-attention machinery is run in reverse to produce pixels rather than read them.

2.R

References

  1. Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021 — the contrastive image–text encoder (EQ MM2.2) that nearly every VLM reuses as its visual front end.
  2. Alayrac, J.-B. et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022 — gated cross-attention into a frozen LLM; the canonical cross-attention design (§2.4).
  3. Liu, H., Li, C., Wu, Q. & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023 — projected patches as input tokens (EQ MM2.1) plus LLM-bootstrapped instruction data; the dominant early-fusion recipe.
  4. Dosovitskiy, A. et al. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021 — the patchify-and-project tokenization (§2.2) that makes images attendable.
  5. Li, J., Li, D., Savarese, S. & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs. ICML 2023 — the Q-Former, a query-based bridge between frozen vision and frozen language.
  6. Li, Y. et al. (2023). Evaluating Object Hallucination in Large Vision-Language Models (POPE). EMNLP 2023 — the object-hallucination probe behind the §2.5 evaluation caveats.
  7. Yue, X. et al. (2024). MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark. CVPR 2024 — college-level multimodal reasoning, a current frontier evaluation.