Two phases, two physics
A request lives twice. Prefill processes the whole prompt in one parallel pass — big matmuls, compute-bound, the GPU happy. Decode then emits one token at a time — each step reads all weights and the entire KV cache to produce a single vector of logits. The diagnostic quantity is arithmetic intensity:
# Decode roofline: aggregate tok/s vs batch, 70B bf16 on H100
import numpy as np
BW, PEAK, N, BYTES = 3.35e12, 989e12, 70e9, 2 # HBM B/s, FLOP/s, params, bf16
batches = 2 ** np.arange(0, 11) # 1 ... 1024
agg = []
print("batch | intensity I | regime | aggregate tok/s")
for b in batches:
I = 2.0 * b / BYTES # ~2b FLOPs per 2 bytes moved
attained = min(PEAK, BW * I) # the roofline (EQ 8.1)
toks = attained / (2 * N) # 2N FLOPs per token
agg.append(toks)
regime = "compute-bound " if attained >= PEAK else "bandwidth-bound"
print(f"{b:5d} | {I:11.0f} | {regime} | {toks:12,.0f}")
print(f"\nridge at I* = {PEAK/BW:.0f} FLOPs/byte (batch ~295): below it each")
print("batch doubling doubles aggregate tok/s for free; above it you only")
print("trade per-user latency. This table is the economics of every API.")
plot_xy(np.log2(batches), agg)
Sampling: from distribution to token
The model hands you \(p(x_t \mid x_{<t})\); the sampler decides. Greedy argmax loops and degenerates on open-ended text; pure sampling wanders into the distribution's noisy tail. Production decoding shapes the distribution first:
# The sampler: temperature + top-p, 2000 draws vs the ideal
import numpy as np
rng = np.random.default_rng(0)
toks = ["Paris", "the", "a", "located", "Lyon", "Berlin"]
z = np.array([5.0, 2.6, 2.2, 1.4, 0.8, -1.0]) # toy logits
def shape(z, tau, top_p):
p = np.exp(z / tau); p /= p.sum()
order = np.argsort(p)[::-1]
keep = order[: np.searchsorted(np.cumsum(p[order]), top_p) + 1]
q = np.zeros_like(p); q[keep] = p[keep]
return q / q.sum() # EQ 8.2: shape, cut, renorm
for tau, top_p in [(0.5, 0.95), (1.5, 0.95)]:
q = shape(z, tau, top_p)
draws = rng.choice(len(z), 2000, p=q).astype(np.intp)
freq = np.bincount(draws, minlength=len(z)) / 2000
print(f"tau={tau} top-p={top_p}")
for t, qi, fi in zip(toks, q, freq):
print(f" {t:8s} ideal {qi:.3f} drawn {fi:.3f} {'#' * int(40*fi)}")
print("cold tau collapses onto 'Paris'; hot tau lets the tail into the")
print("lottery (Lyon survives, Berlin is cut by top-p) -- exactly how a")
print("hallucination does or does not get sampled into existence.")
PagedAttention: virtual memory for the KV cache
Early servers reserved one contiguous KV buffer per request at maximum possible length — internal fragmentation wasted 60–80% of cache memory. vLLM's PagedAttention imported the operating-system playbook: carve the cache into fixed-size blocks (~16 tokens), allocate on demand, and let a block table map each sequence's logical positions to scattered physical blocks.
- Near-zero fragmentation ⇒ 2–4× more concurrent sequences on the same GPU — the single largest throughput win in serving history.
- Copy-on-write sharing: parallel samples and beams share their common prefix physically; only divergent blocks are copied.
- Prefix caching: system prompts, few-shot preambles and conversation history persist as shared blocks across requests — long-system-prompt apps see prefill drop by 10× (this is the mechanism behind API “prompt caching” discounts).
Same idea, next level: RadixAttention (SGLang) organizes cached prefixes in a radix tree for automatic reuse across arbitrary branching conversations and agent trees.
Continuous batching
Batching is how decode escapes the bandwidth wall — weights are read once per step for the whole batch. The naïve version (static batching: wait, run all to completion) dies on variance: one 2,000-token response holds 31 finished requests hostage. Continuous (in-flight) batching schedules at the iteration level:
- Every decode step, finished sequences exit the batch immediately and queued requests join — the batch composition changes step to step.
- Chunked prefill splits long prompts into slices interleaved with ongoing decodes, so a giant document upload doesn't spike everyone's inter-token latency.
- The scheduler's whole life is the throughput–latency frontier: deeper batches raise tokens/s/GPU but stretch each user's TPOT. SLO-aware schedulers ride that curve explicitly.
Speculative decoding: guess cheap, verify exact
Decode wastes a full model read on one token — unless you verify several proposed tokens in a single pass. A small draft model (or extra prediction heads: Medusa, EAGLE; or the model's own MTP heads, as in DeepSeek-V3) proposes \(K\) tokens; the target model scores them all at once — that's a prefill-shaped, compute-cheap operation — and a rejection-sampling rule keeps the output distribution exactly the target's:
# Speculative decoding simulator -- K=4 draft, accept p=0.8
import numpy as np
rng = np.random.default_rng(0)
p, K, rounds = 0.8, 4, 1000
produced = 0
for _ in range(rounds):
accepts = rng.random(K) < p # draft tokens the target agrees with
n_acc = K if accepts.all() else int(np.argmin(accepts))
produced += n_acc + 1 # accepted run + 1 (correction/bonus)
sim = produced / rounds
formula = (1 - p**(K + 1)) / (1 - p)
print(f"simulated tokens per target pass : {sim:.3f}")
print(f"closed form (1-p^(K+1))/(1-p) : {formula:.3f}")
print(f"speedup vs one-token decode : {sim:.2f}x (minus draft overhead)")
print("\non a reject the rest of the draft is discarded, but the target's")
print("own correction still lands -- a verify pass never yields under 1.")
The serving stack, assembled
| Deployment tier | Typical engine | Model + precision | Defining constraint |
|---|---|---|---|
| Hyperscale API | proprietary / TRT-LLM | frontier MoE · FP8/FP4 | goodput per megawatt |
| Self-hosted cluster | vLLM · SGLang | open 7–700B · FP8/INT4 | data control, $/token |
| Workstation / edge | llama.cpp · Ollama · MLX | 1–70B · GGUF 4-bit | RAM + bandwidth (EQ 7.1) |
| On-device | Core ML / NNAPI runtimes | 1–3B · 4-bit + QAT | battery, thermals, privacy |
Everything so far described one dense transformer. The frontier no longer looks like that. Chapter 09: mixture-of-experts, million-token context, models that see and hear, agents that act — and what's still unsolved.
Further reading
- Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). — paged KV cache; the basis of the modern serving stack.
- Yu et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. — iteration-level (continuous) batching.
- Holtzman, Buys, Du, Forbes & Choi (2020). The Curious Case of Neural Text Degeneration. — nucleus (top-p) sampling and why greedy decoding fails.
- Leviathan, Kalman & Matias (2023). Fast Inference from Transformers via Speculative Decoding. — draft-and-verify decoding with exact output guarantees.
- Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. — the parallel formulation of speculative decoding.
- Pope et al. (2022). Efficiently Scaling Transformer Inference. — the prefill/decode cost model and the roofline view of serving.