08 · Inference & Deployment — LLM Field Manual

8.1

Two phases, two physics

A request lives twice. Prefill processes the whole prompt in one parallel pass — big matmuls, compute-bound, the GPU happy. Decode then emits one token at a time — each step reads all weights and the entire KV cache to produce a single vector of logits. The diagnostic quantity is arithmetic intensity:

EQ 8.1 — ARITHMETIC INTENSITY & THE ROOFLINE $$ I = \frac{\text{FLOPs}}{\text{bytes moved}}, \qquad I_{\text{prefill}} \sim O(T) \gg I^{*} \quad\text{vs}\quad I_{\text{decode}} \sim O(b) \ll I^{*} $$

An H100 needs $I^* \approx 300$ FLOPs/byte to keep its tensor cores fed. Prefill clears it easily; single-stream decode manages ~2. Consequences: TTFT (time to first token) is set by prefill compute, TPOT (time per output token) by memory bandwidth, and every serving trick below is an attempt to raise decode's intensity — batching raises $b$, speculation amortizes reads over several tokens, quantization shrinks the bytes.

At batch size $b = 4$, decode does ≈$2b$ FLOPs for every 2 bytes of weight streamed (bf16). What is the arithmetic intensity $I = \dfrac{2b}{2}$ (FLOPs/byte)?

$I = \dfrac{2b}{2} = b = $ 4 FLOPs/byte. Far below the H100 ridge of ~300, so decode at batch 4 is still firmly bandwidth-bound — every batch doubling buys throughput for free until $I$ reaches the ridge.

Single-stream decode of a $70\text{B}$ model in bf16 (2 bytes/param) on an H100 ($3.35\times10^{12}$ B/s). What is the tokens/s ceiling?

Bytes per token $= 2 \times 70\times10^9 = 1.4\times10^{11}$. Ceiling $= \dfrac{3.35\times10^{12}}{1.4\times10^{11}} = $ 23.9 tok/s — the bandwidth wall a single user hits before any batching.

PYTHON · RUNNABLE IN-BROWSER

# Decode roofline: aggregate tok/s vs batch, 70B bf16 on H100
import numpy as np
BW, PEAK, N, BYTES = 3.35e12, 989e12, 70e9, 2   # HBM B/s, FLOP/s, params, bf16

batches = 2 ** np.arange(0, 11)                 # 1 ... 1024
agg = []
print("batch | intensity I | regime          | aggregate tok/s")
for b in batches:
    I = 2.0 * b / BYTES                         # ~2b FLOPs per 2 bytes moved
    attained = min(PEAK, BW * I)                # the roofline (EQ 8.1)
    toks = attained / (2 * N)                   # 2N FLOPs per token
    agg.append(toks)
    regime = "compute-bound  " if attained >= PEAK else "bandwidth-bound"
    print(f"{b:5d} | {I:11.0f} | {regime} | {toks:12,.0f}")

print(f"\nridge at I* = {PEAK/BW:.0f} FLOPs/byte (batch ~295): below it each")
print("batch doubling doubles aggregate tok/s for free; above it you only")
print("trade per-user latency. This table is the economics of every API.")
plot_xy(np.log2(batches), agg)

edits are live — break it on purpose

INSTRUMENT 8.1 — RIDE THE ROOFLINE70B BF16 · H100 · DECODE

CONCURRENT SEQUENCES batch = 16

REGIME

—

AGGREGATE THROUGHPUT

—

PER-USER TPOT

—

Each doubling of batch doubles aggregate tokens/s for free — until the operating point hits the compute ceiling near I* ≈ 295, where per-user latency starts paying for further batching. This single picture is the economics of every LLM API.

8.2

Sampling: from distribution to token

The model hands you $p(x_t \mid x_{<t})$; the sampler decides. Greedy argmax loops and degenerates on open-ended text; pure sampling wanders into the distribution's noisy tail. Production decoding shapes the distribution first:

EQ 8.2 — TEMPERATURE, TOP-K, TOP-P $$ p_i^{(\tau)} = \frac{e^{z_i / \tau}}{\sum_j e^{z_j / \tau}}, \qquad \text{keep } S = \text{smallest set with} \sum_{i \in S} p_i \ge p_{\text{top}} \;\text{ (∩ top-}k\text{)}, \;\text{ renormalize} $$

$\tau < 1$ sharpens (factual/code), $\tau > 1$ flattens (brainstorming). Top-p (“nucleus”) adapts the cutoff to the model's confidence — wide when uncertain, narrow when sure; top-k caps the candidate count outright; min-p (keep tokens above a fraction of the max probability) is the newer favorite for high-temperature creativity without nonsense. Repetition/frequency/presence penalties damp loops. Reasoning models usually want gentle settings (τ ≈ 0.6–1.0) and no aggressive truncation on thinking tokens.

A model's next-token probabilities, sorted, are $[0.50,\, 0.25,\, 0.15,\, 0.10]$. With top-p $= 0.90$, how many tokens fall inside the nucleus (the smallest set whose mass $\ge 0.90$)?

Cumulate from the top: $0.50$ → $0.75$ → $0.90$. The running sum first reaches $0.90$ at the third token, so the nucleus holds 3 tokens; the $0.10$ tail is dropped and the kept mass renormalized.

PYTHON · RUNNABLE IN-BROWSER

# The sampler: temperature + top-p, 2000 draws vs the ideal
import numpy as np
rng = np.random.default_rng(0)
toks = ["Paris", "the", "a", "located", "Lyon", "Berlin"]
z = np.array([5.0, 2.6, 2.2, 1.4, 0.8, -1.0])    # toy logits

def shape(z, tau, top_p):
    p = np.exp(z / tau); p /= p.sum()
    order = np.argsort(p)[::-1]
    keep = order[: np.searchsorted(np.cumsum(p[order]), top_p) + 1]
    q = np.zeros_like(p); q[keep] = p[keep]
    return q / q.sum()                            # EQ 8.2: shape, cut, renorm

for tau, top_p in [(0.5, 0.95), (1.5, 0.95)]:
    q = shape(z, tau, top_p)
    draws = rng.choice(len(z), 2000, p=q).astype(np.intp)
    freq = np.bincount(draws, minlength=len(z)) / 2000
    print(f"tau={tau}  top-p={top_p}")
    for t, qi, fi in zip(toks, q, freq):
        print(f"  {t:8s} ideal {qi:.3f}  drawn {fi:.3f}  {'#' * int(40*fi)}")

print("cold tau collapses onto 'Paris'; hot tau lets the tail into the")
print("lottery (Lyon survives, Berlin is cut by top-p) -- exactly how a")
print("hallucination does or does not get sampled into existence.")

edits are live — break it on purpose

INSTRUMENT 8.2 — SAMPLING PLAYGROUND“The capital of France is ___”

TEMPERATURE τ 1.00

TOP-P 0.95

TOP-K 12

ENTROPY OF FINAL DISTRIBUTION

—

Grey ghost bars: raw model probabilities. Mint: after temperature + truncation + renormalization. Drop τ to 0.1 — sampling collapses to greedy “Paris”. Raise τ to 2.5 with top-p = 1 and “Berlin” enters the lottery: that is how hallucinations get sampled into existence.

8.3

PagedAttention: virtual memory for the KV cache

Early servers reserved one contiguous KV buffer per request at maximum possible length — internal fragmentation wasted 60–80% of cache memory. vLLM's PagedAttention imported the operating-system playbook: carve the cache into fixed-size blocks (~16 tokens), allocate on demand, and let a block table map each sequence's logical positions to scattered physical blocks.

Near-zero fragmentation ⇒ 2–4× more concurrent sequences on the same GPU — the single largest throughput win in serving history.
Copy-on-write sharing: parallel samples and beams share their common prefix physically; only divergent blocks are copied.
Prefix caching: system prompts, few-shot preambles and conversation history persist as shared blocks across requests — long-system-prompt apps see prefill drop by 10× (this is the mechanism behind API “prompt caching” discounts).

Same idea, next level: RadixAttention (SGLang) organizes cached prefixes in a radix tree for automatic reuse across arbitrary branching conversations and agent trees.

8.4

Continuous batching

Batching is how decode escapes the bandwidth wall — weights are read once per step for the whole batch. The naïve version (static batching: wait, run all to completion) dies on variance: one 2,000-token response holds 31 finished requests hostage. Continuous (in-flight) batching schedules at the iteration level:

Every decode step, finished sequences exit the batch immediately and queued requests join — the batch composition changes step to step.
Chunked prefill splits long prompts into slices interleaved with ongoing decodes, so a giant document upload doesn't spike everyone's inter-token latency.
The scheduler's whole life is the throughput–latency frontier: deeper batches raise tokens/s/GPU but stretch each user's TPOT. SLO-aware schedulers ride that curve explicitly.

EQ 8.3 — THE METRICS THAT GET PAGED ON $$ \mathrm{TTFT} \approx t_{\text{queue}} + \frac{\text{prefill FLOPs}}{\text{compute}}, \qquad \mathrm{TPOT} \approx \frac{\text{bytes}_{\text{weights}} + \text{bytes}_{\text{KV}}(T)}{\text{bandwidth} \cdot b_{\text{eff}}}, \qquad \mathrm{E2E} = \mathrm{TTFT} + n \cdot \mathrm{TPOT} $$

Goodput — requests/s within SLO — is the number that matters commercially, and it's why prefill and decode are increasingly disaggregated onto separate GPU pools (compute-heavy cards prefill, bandwidth-heavy cards decode, KV shipped between them).

A request has TTFT $= 0.5$ s and TPOT $= 0.02$ s, and produces $n = 100$ output tokens. What is the end-to-end latency $\mathrm{E2E} = \mathrm{TTFT} + n\cdot\mathrm{TPOT}$?

$\mathrm{E2E} = 0.5 + 100 \times 0.02 = 0.5 + 2.0 = $ 2.5 s. For long generations the $n\cdot\mathrm{TPOT}$ term dominates, which is why decode bandwidth — not prefill — governs the felt latency of chat.

8.5

Speculative decoding: guess cheap, verify exact

Decode wastes a full model read on one token — unless you verify several proposed tokens in a single pass. A small draft model (or extra prediction heads: Medusa, EAGLE; or the model's own MTP heads, as in DeepSeek-V3) proposes $K$ tokens; the target model scores them all at once — that's a prefill-shaped, compute-cheap operation — and a rejection-sampling rule keeps the output distribution exactly the target's:

EQ 8.4 — ACCEPTANCE RULE (LOSSLESS) $$ \text{accept } \tilde{x}_t \text{ with probability } \min\!\left(1,\; \frac{p(\tilde{x}_t)}{q(\tilde{x}_t)}\right); \quad \text{on reject, resample } x_t \sim \mathrm{norm}\big(\max(0,\, p - q)\big) $$

$q$ = draft distribution, $p$ = target. The correction term on rejection is what makes the scheme provably distribution-preserving — speculative decoding is a pure latency win, not an approximation. Expected speedup ≈ acceptance rate × draft length, minus draft overhead: 2–3× in practice on predictable text (code!), less on high-entropy prose.

A draft model proposes $K = 4$ tokens, each accepted independently with probability $p = 0.8$. Expected tokens produced per target verify pass is $\dfrac{1 - p^{K+1}}{1 - p}$. Evaluate it.

$p^{K+1} = 0.8^5 = 0.32768$. So $\dfrac{1 - 0.32768}{1 - 0.8} = \dfrac{0.67232}{0.2} = $ 3.36 tokens per pass — a ~3.4× decode speedup before subtracting draft overhead.

PYTHON · RUNNABLE IN-BROWSER

# Speculative decoding simulator -- K=4 draft, accept p=0.8
import numpy as np
rng = np.random.default_rng(0)
p, K, rounds = 0.8, 4, 1000

produced = 0
for _ in range(rounds):
    accepts = rng.random(K) < p              # draft tokens the target agrees with
    n_acc = K if accepts.all() else int(np.argmin(accepts))
    produced += n_acc + 1                    # accepted run + 1 (correction/bonus)

sim = produced / rounds
formula = (1 - p**(K + 1)) / (1 - p)
print(f"simulated tokens per target pass : {sim:.3f}")
print(f"closed form (1-p^(K+1))/(1-p)    : {formula:.3f}")
print(f"speedup vs one-token decode      : {sim:.2f}x (minus draft overhead)")
print("\non a reject the rest of the draft is discarded, but the target's")
print("own correction still lands -- a verify pass never yields under 1.")

edits are live — break it on purpose

INSTRUMENT 8.3 — SPECULATIVE DECODING, SIMULATEDDRAFT K=4 · VERIFY · CORRECT

DRAFTED

0

ACCEPTED

0

ACCEPT RATE

—

TOKENS / TARGET PASS

—

Grey = drafted by the small model · mint = verified accepted · red flash = rejected (the rest of the draft is discarded) · deep green = the target model's own correction. Without speculation this sentence would cost one full forward pass per word.

8.6

The serving stack, assembled

FIG 8.AANATOMY OF AN LLM SERVICE

The 2026 default stack. Open engines (vLLM, SGLang, TensorRT-LLM, llama.cpp at the edge) implement everything in this chapter off the shelf; what remains proprietary at frontier labs is mostly scheduling policy, multi-region cache routing, and silicon-specific kernels.

Deployment tier	Typical engine	Model + precision	Defining constraint
Hyperscale API	proprietary / TRT-LLM	frontier MoE · FP8/FP4	goodput per megawatt
Self-hosted cluster	vLLM · SGLang	open 7–700B · FP8/INT4	data control, $/token
Workstation / edge	llama.cpp · Ollama · MLX	1–70B · GGUF 4-bit	RAM + bandwidth (EQ 7.1)
On-device	Core ML / NNAPI runtimes	1–3B · 4-bit + QAT	battery, thermals, privacy

Everything so far described one dense transformer. The frontier no longer looks like that. Chapter 09: mixture-of-experts, million-token context, models that see and hear, agents that act — and what's still unsolved.

§

Inference &Deployment

Two phases, two physics

Sampling: from distribution to token

PagedAttention: virtual memory for the KV cache

Continuous batching

Speculative decoding: guess cheap, verify exact

The serving stack, assembled

Further reading

Inference &
Deployment