AI // ENCYCLOPEDIA / VOL II / 08 / INFERENCE & DEPLOYMENT INDEX NEXT: FRONTIER →
CHAPTER 08 / 10

Inference &
Deployment

Serving has its own physics: a compute-bound prefill followed by a memory-bound decode, with cost riding on cache management, batching policy, and the willingness to guess. This chapter takes the request's-eye view, from sampling parameters to the modern serving stack.

READING TIME≈ 30 MIN BUILDS ONCH 03, 07 INSTRUMENTSROOFLINE · SAMPLING · SPEC-DECODE
8.1

Two phases, two physics

A request lives twice. Prefill processes the whole prompt in one parallel pass — big matmuls, compute-bound, the GPU happy. Decode then emits one token at a time — each step reads all weights and the entire KV cache to produce a single vector of logits. The diagnostic quantity is arithmetic intensity:

EQ 8.1 — ARITHMETIC INTENSITY & THE ROOFLINE $$ I = \frac{\text{FLOPs}}{\text{bytes moved}}, \qquad I_{\text{prefill}} \sim O(T) \gg I^{*} \quad\text{vs}\quad I_{\text{decode}} \sim O(b) \ll I^{*} $$
An H100 needs \(I^* \approx 300\) FLOPs/byte to keep its tensor cores fed. Prefill clears it easily; single-stream decode manages ~2. Consequences: TTFT (time to first token) is set by prefill compute, TPOT (time per output token) by memory bandwidth, and every serving trick below is an attempt to raise decode's intensity — batching raises \(b\), speculation amortizes reads over several tokens, quantization shrinks the bytes.
At batch size \(b = 4\), decode does ≈\(2b\) FLOPs for every 2 bytes of weight streamed (bf16). What is the arithmetic intensity \(I = \dfrac{2b}{2}\) (FLOPs/byte)?
\(I = \dfrac{2b}{2} = b = \) 4 FLOPs/byte. Far below the H100 ridge of ~300, so decode at batch 4 is still firmly bandwidth-bound — every batch doubling buys throughput for free until \(I\) reaches the ridge.
Single-stream decode of a \(70\text{B}\) model in bf16 (2 bytes/param) on an H100 (\(3.35\times10^{12}\) B/s). What is the tokens/s ceiling?
Bytes per token \(= 2 \times 70\times10^9 = 1.4\times10^{11}\). Ceiling \(= \dfrac{3.35\times10^{12}}{1.4\times10^{11}} = \) 23.9 tok/s — the bandwidth wall a single user hits before any batching.
PYTHON · RUNNABLE IN-BROWSER
# Decode roofline: aggregate tok/s vs batch, 70B bf16 on H100
import numpy as np
BW, PEAK, N, BYTES = 3.35e12, 989e12, 70e9, 2   # HBM B/s, FLOP/s, params, bf16

batches = 2 ** np.arange(0, 11)                 # 1 ... 1024
agg = []
print("batch | intensity I | regime          | aggregate tok/s")
for b in batches:
    I = 2.0 * b / BYTES                         # ~2b FLOPs per 2 bytes moved
    attained = min(PEAK, BW * I)                # the roofline (EQ 8.1)
    toks = attained / (2 * N)                   # 2N FLOPs per token
    agg.append(toks)
    regime = "compute-bound  " if attained >= PEAK else "bandwidth-bound"
    print(f"{b:5d} | {I:11.0f} | {regime} | {toks:12,.0f}")

print(f"\nridge at I* = {PEAK/BW:.0f} FLOPs/byte (batch ~295): below it each")
print("batch doubling doubles aggregate tok/s for free; above it you only")
print("trade per-user latency. This table is the economics of every API.")
plot_xy(np.log2(batches), agg)
edits are live — break it on purpose
INSTRUMENT 8.1 — RIDE THE ROOFLINE70B BF16 · H100 · DECODE
REGIME
AGGREGATE THROUGHPUT
PER-USER TPOT
Each doubling of batch doubles aggregate tokens/s for free — until the operating point hits the compute ceiling near I* ≈ 295, where per-user latency starts paying for further batching. This single picture is the economics of every LLM API.
8.2

Sampling: from distribution to token

The model hands you \(p(x_t \mid x_{<t})\); the sampler decides. Greedy argmax loops and degenerates on open-ended text; pure sampling wanders into the distribution's noisy tail. Production decoding shapes the distribution first:

EQ 8.2 — TEMPERATURE, TOP-K, TOP-P $$ p_i^{(\tau)} = \frac{e^{z_i / \tau}}{\sum_j e^{z_j / \tau}}, \qquad \text{keep } S = \text{smallest set with} \sum_{i \in S} p_i \ge p_{\text{top}} \;\text{ (∩ top-}k\text{)}, \;\text{ renormalize} $$
\(\tau < 1\) sharpens (factual/code), \(\tau > 1\) flattens (brainstorming). Top-p (“nucleus”) adapts the cutoff to the model's confidence — wide when uncertain, narrow when sure; top-k caps the candidate count outright; min-p (keep tokens above a fraction of the max probability) is the newer favorite for high-temperature creativity without nonsense. Repetition/frequency/presence penalties damp loops. Reasoning models usually want gentle settings (τ ≈ 0.6–1.0) and no aggressive truncation on thinking tokens.
A model's next-token probabilities, sorted, are \([0.50,\, 0.25,\, 0.15,\, 0.10]\). With top-p \(= 0.90\), how many tokens fall inside the nucleus (the smallest set whose mass \(\ge 0.90\))?
Cumulate from the top: \(0.50\) → \(0.75\) → \(0.90\). The running sum first reaches \(0.90\) at the third token, so the nucleus holds 3 tokens; the \(0.10\) tail is dropped and the kept mass renormalized.
PYTHON · RUNNABLE IN-BROWSER
# The sampler: temperature + top-p, 2000 draws vs the ideal
import numpy as np
rng = np.random.default_rng(0)
toks = ["Paris", "the", "a", "located", "Lyon", "Berlin"]
z = np.array([5.0, 2.6, 2.2, 1.4, 0.8, -1.0])    # toy logits

def shape(z, tau, top_p):
    p = np.exp(z / tau); p /= p.sum()
    order = np.argsort(p)[::-1]
    keep = order[: np.searchsorted(np.cumsum(p[order]), top_p) + 1]
    q = np.zeros_like(p); q[keep] = p[keep]
    return q / q.sum()                            # EQ 8.2: shape, cut, renorm

for tau, top_p in [(0.5, 0.95), (1.5, 0.95)]:
    q = shape(z, tau, top_p)
    draws = rng.choice(len(z), 2000, p=q).astype(np.intp)
    freq = np.bincount(draws, minlength=len(z)) / 2000
    print(f"tau={tau}  top-p={top_p}")
    for t, qi, fi in zip(toks, q, freq):
        print(f"  {t:8s} ideal {qi:.3f}  drawn {fi:.3f}  {'#' * int(40*fi)}")

print("cold tau collapses onto 'Paris'; hot tau lets the tail into the")
print("lottery (Lyon survives, Berlin is cut by top-p) -- exactly how a")
print("hallucination does or does not get sampled into existence.")
edits are live — break it on purpose
INSTRUMENT 8.2 — SAMPLING PLAYGROUND“The capital of France is ___”
ENTROPY OF FINAL DISTRIBUTION
Grey ghost bars: raw model probabilities. Mint: after temperature + truncation + renormalization. Drop τ to 0.1 — sampling collapses to greedy “Paris”. Raise τ to 2.5 with top-p = 1 and “Berlin” enters the lottery: that is how hallucinations get sampled into existence.
8.3

PagedAttention: virtual memory for the KV cache

Early servers reserved one contiguous KV buffer per request at maximum possible length — internal fragmentation wasted 60–80% of cache memory. vLLM's PagedAttention imported the operating-system playbook: carve the cache into fixed-size blocks (~16 tokens), allocate on demand, and let a block table map each sequence's logical positions to scattered physical blocks.

  • Near-zero fragmentation ⇒ 2–4× more concurrent sequences on the same GPU — the single largest throughput win in serving history.
  • Copy-on-write sharing: parallel samples and beams share their common prefix physically; only divergent blocks are copied.
  • Prefix caching: system prompts, few-shot preambles and conversation history persist as shared blocks across requests — long-system-prompt apps see prefill drop by 10× (this is the mechanism behind API “prompt caching” discounts).

Same idea, next level: RadixAttention (SGLang) organizes cached prefixes in a radix tree for automatic reuse across arbitrary branching conversations and agent trees.

8.4

Continuous batching

Batching is how decode escapes the bandwidth wall — weights are read once per step for the whole batch. The naïve version (static batching: wait, run all to completion) dies on variance: one 2,000-token response holds 31 finished requests hostage. Continuous (in-flight) batching schedules at the iteration level:

  • Every decode step, finished sequences exit the batch immediately and queued requests join — the batch composition changes step to step.
  • Chunked prefill splits long prompts into slices interleaved with ongoing decodes, so a giant document upload doesn't spike everyone's inter-token latency.
  • The scheduler's whole life is the throughput–latency frontier: deeper batches raise tokens/s/GPU but stretch each user's TPOT. SLO-aware schedulers ride that curve explicitly.
EQ 8.3 — THE METRICS THAT GET PAGED ON $$ \mathrm{TTFT} \approx t_{\text{queue}} + \frac{\text{prefill FLOPs}}{\text{compute}}, \qquad \mathrm{TPOT} \approx \frac{\text{bytes}_{\text{weights}} + \text{bytes}_{\text{KV}}(T)}{\text{bandwidth} \cdot b_{\text{eff}}}, \qquad \mathrm{E2E} = \mathrm{TTFT} + n \cdot \mathrm{TPOT} $$
Goodput — requests/s within SLO — is the number that matters commercially, and it's why prefill and decode are increasingly disaggregated onto separate GPU pools (compute-heavy cards prefill, bandwidth-heavy cards decode, KV shipped between them).
A request has TTFT \(= 0.5\) s and TPOT \(= 0.02\) s, and produces \(n = 100\) output tokens. What is the end-to-end latency \(\mathrm{E2E} = \mathrm{TTFT} + n\cdot\mathrm{TPOT}\)?
\(\mathrm{E2E} = 0.5 + 100 \times 0.02 = 0.5 + 2.0 = \) 2.5 s. For long generations the \(n\cdot\mathrm{TPOT}\) term dominates, which is why decode bandwidth — not prefill — governs the felt latency of chat.
8.5

Speculative decoding: guess cheap, verify exact

Decode wastes a full model read on one token — unless you verify several proposed tokens in a single pass. A small draft model (or extra prediction heads: Medusa, EAGLE; or the model's own MTP heads, as in DeepSeek-V3) proposes \(K\) tokens; the target model scores them all at once — that's a prefill-shaped, compute-cheap operation — and a rejection-sampling rule keeps the output distribution exactly the target's:

EQ 8.4 — ACCEPTANCE RULE (LOSSLESS) $$ \text{accept } \tilde{x}_t \text{ with probability } \min\!\left(1,\; \frac{p(\tilde{x}_t)}{q(\tilde{x}_t)}\right); \quad \text{on reject, resample } x_t \sim \mathrm{norm}\big(\max(0,\, p - q)\big) $$
\(q\) = draft distribution, \(p\) = target. The correction term on rejection is what makes the scheme provably distribution-preserving — speculative decoding is a pure latency win, not an approximation. Expected speedup ≈ acceptance rate × draft length, minus draft overhead: 2–3× in practice on predictable text (code!), less on high-entropy prose.
A draft model proposes \(K = 4\) tokens, each accepted independently with probability \(p = 0.8\). Expected tokens produced per target verify pass is \(\dfrac{1 - p^{K+1}}{1 - p}\). Evaluate it.
\(p^{K+1} = 0.8^5 = 0.32768\). So \(\dfrac{1 - 0.32768}{1 - 0.8} = \dfrac{0.67232}{0.2} = \) 3.36 tokens per pass — a ~3.4× decode speedup before subtracting draft overhead.
PYTHON · RUNNABLE IN-BROWSER
# Speculative decoding simulator -- K=4 draft, accept p=0.8
import numpy as np
rng = np.random.default_rng(0)
p, K, rounds = 0.8, 4, 1000

produced = 0
for _ in range(rounds):
    accepts = rng.random(K) < p              # draft tokens the target agrees with
    n_acc = K if accepts.all() else int(np.argmin(accepts))
    produced += n_acc + 1                    # accepted run + 1 (correction/bonus)

sim = produced / rounds
formula = (1 - p**(K + 1)) / (1 - p)
print(f"simulated tokens per target pass : {sim:.3f}")
print(f"closed form (1-p^(K+1))/(1-p)    : {formula:.3f}")
print(f"speedup vs one-token decode      : {sim:.2f}x (minus draft overhead)")
print("\non a reject the rest of the draft is discarded, but the target's")
print("own correction still lands -- a verify pass never yields under 1.")
edits are live — break it on purpose
INSTRUMENT 8.3 — SPECULATIVE DECODING, SIMULATEDDRAFT K=4 · VERIFY · CORRECT
DRAFTED
0
ACCEPTED
0
ACCEPT RATE
TOKENS / TARGET PASS
Grey = drafted by the small model · mint = verified accepted · red flash = rejected (the rest of the draft is discarded) · deep green = the target model's own correction. Without speculation this sentence would cost one full forward pass per word.
8.6

The serving stack, assembled

FIG 8.AANATOMY OF AN LLM SERVICE
CLIENTS streaming API GATEWAY / ROUTER auth · quotas · model select INFERENCE ENGINE continuous batching PagedAttention · spec decode vLLM · SGLang · TRT-LLM PREFILL POOL compute-bound DECODE POOL bandwidth-bound KV transfer (disaggregated) OBSERVABILITY: TTFT · TPOT · goodput · cache hit-rate · per-token cost autoscaling on queue depth + SLO burn
The 2026 default stack. Open engines (vLLM, SGLang, TensorRT-LLM, llama.cpp at the edge) implement everything in this chapter off the shelf; what remains proprietary at frontier labs is mostly scheduling policy, multi-region cache routing, and silicon-specific kernels.
Deployment tierTypical engineModel + precisionDefining constraint
Hyperscale APIproprietary / TRT-LLMfrontier MoE · FP8/FP4goodput per megawatt
Self-hosted clustervLLM · SGLangopen 7–700B · FP8/INT4data control, $/token
Workstation / edgellama.cpp · Ollama · MLX1–70B · GGUF 4-bitRAM + bandwidth (EQ 7.1)
On-deviceCore ML / NNAPI runtimes1–3B · 4-bit + QATbattery, thermals, privacy
NEXT

Everything so far described one dense transformer. The frontier no longer looks like that. Chapter 09: mixture-of-experts, million-token context, models that see and hear, agents that act — and what's still unsolved.

§

Further reading

  • Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). — paged KV cache; the basis of the modern serving stack.
  • Yu et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. — iteration-level (continuous) batching.
  • Holtzman, Buys, Du, Forbes & Choi (2020). The Curious Case of Neural Text Degeneration. — nucleus (top-p) sampling and why greedy decoding fails.
  • Leviathan, Kalman & Matias (2023). Fast Inference from Transformers via Speculative Decoding. — draft-and-verify decoding with exact output guarantees.
  • Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. — the parallel formulation of speculative decoding.
  • Pope et al. (2022). Efficiently Scaling Transformer Inference. — the prefill/decode cost model and the roofline view of serving.