09 · The Frontier — LLM Field Manual

9.1

Mixture-of-Experts: capacity without the bill

Chapter 02 noted that MLPs hold most parameters; MoE makes them conditional. Replace each MLP with $E$ parallel expert MLPs and a tiny router that sends every token to its top-$k$:

EQ 9.1 — ROUTED EXPERT LAYER $$ y = \sum_{i \,\in\, \mathrm{TopK}(g)} g_i\, \mathrm{FFN}_i(x), \qquad g = \mathrm{softmax}\big( W_r\, x \big) $$

Only $k$ of $E$ experts run per token: parameters scale with $E$; FLOPs scale with $k$. Mixtral 8×7B: 47B total, 13B active. DeepSeek-V3: 256 fine-grained experts + 1 always-on shared expert, 8 routed — 671B total, 37B active. The same economics drive the strongly-rumored MoE backbones of current closed frontier models. Decode-time win (recall EQ 7.1): only active-expert weights stream per token.

An MoE layer has $E = 8$ experts and routes each token to its top-$k = 2$. What percent of the expert parameters are active per token? (Enter 25 for 25%.)

Active fraction $= \dfrac{k}{E} = \dfrac{2}{8} = 0.25 = $ 25%. The other 75% sit idle for this token — capacity stored but not streamed (the decode-time win of EQ 7.1).

EQ 9.2 — LOAD BALANCING $$ \mathcal{L}_{\text{aux}} = \lambda\, E \sum_{i=1}^{E} f_i\, P_i \qquad \big(f_i = \text{fraction of tokens routed to } i,\;\; P_i = \text{mean router prob}\big) $$

Routers left alone collapse onto a few favorite experts, stranding the rest as dead weight. The auxiliary loss penalizes the dot product of realized load and intended probability — minimized when both are uniform. DeepSeek-V3 instead tunes a per-expert bias online (“aux-loss-free” balancing) to avoid distorting the main objective. Expert parallelism (§4.5) then spreads experts across GPUs, paying all-to-all communication per layer.

The load-balancing loss is $\mathcal{L}_{\text{aux}} = \lambda E \sum_{i=1}^{E} f_i P_i$. With $\lambda = 1$, $E = 8$, and perfectly balanced routing ($f_i = P_i = \tfrac{1}{8}$ for all experts), what is $\mathcal{L}_{\text{aux}}$?

Each term $f_i P_i = \tfrac{1}{8}\cdot\tfrac{1}{8} = \tfrac{1}{64}$; summed over 8 experts $= \tfrac{8}{64} = \tfrac18$. Then $\mathcal{L}_{\text{aux}} = 1\cdot 8 \cdot \tfrac18 = $ 1 — the floor value. Any imbalance pushes it above 1.

PYTHON · RUNNABLE IN-BROWSER

# Top-2 MoE router: load balance loss, fair vs biased (EQ 9.2)
import numpy as np
rng = np.random.default_rng(0)
E, k, T = 8, 2, 64
x  = rng.normal(0, 1, (T, 16))           # 64 token hidden states
Wr = rng.normal(0, 0.4, (16, E))         # router weights

def route(bias):
    g = np.exp(x @ Wr + bias)
    g /= g.sum(1, keepdims=True)         # softmax gates (EQ 9.1)
    top2 = np.argsort(g, 1)[:, -2:]      # route each token to its top-2
    f = np.bincount(top2.ravel().astype(np.intc), minlength=E) / (T * k)  # realized load
    P = g.mean(0)                        # mean router probability
    return f, E * np.sum(f * P)          # EQ 9.2 (lambda = 1)

for name, bias in [("fair  ", np.zeros(E)),
                   ("biased", np.array([2.5, 2.0, 0, 0, 0, 0, 0, 0.]))]:
    f, L = route(bias)
    print(f"{name} router  L_aux = {L:.3f}   (uniform ideal = 1.000)")
    print("   load/expert:", " ".join(f"{v:.2f}" for v in f))

print("\nthe biased router funnels every token to two favourites; the")
print("f.P product rises and EQ 9.2's gradient pushes it back to uniform.")

edits are live — break it on purpose

INSTRUMENT 9.1 — TOP-2 ROUTER8 EXPERTS · LOAD ACCUMULATION

—

ROUTER PROBABILITIES g(x)

CUMULATIVE EXPERT LOAD

Each token's hidden state produces gate logits; the top-2 experts (mint) process it, weighted by their gate values. Watch the load bars: drift toward imbalance is exactly what EQ 9.2 exists to punish. Real experts specialize by token statistics — not by clean human topics.

9.2

Long context: the million-token problem

Context windows grew 1,000× in four years (2K → 1M–10M claimed). Three fronts made it possible:

Positional extension. RoPE trained at 4K collapses at 32K — unseen rotation angles. Fixes rescale the spectrum: Position Interpolation compresses all frequencies; NTK-aware scaling raises the base $b$; YaRN interpolates per-frequency (fast dims untouched, slow dims stretched) plus an attention-temperature correction. Standard recipe: pre-train short → continue briefly at long context with scaled RoPE (Llama-3.1's base-500K + 800B long-context tokens).
Attention cost. $O(T^2)$ prefill at $T = 10^6$ is ~10⁶× a 1K prompt. Mitigations: FlashAttention (exact), interleaved sliding-window layers, context parallelism (ring attention across GPUs), and learned sparse patterns (NSA-style) approaching $O(T)$.
KV memory. EQ 3.5 at 1M tokens is brutal — hence GQA/MLA, KV quantization, token eviction/compression heuristics, and tiered KV offload in serving stacks.

PYTHON · RUNNABLE IN-BROWSER

# The price of context: KV + attention share at 8K / 128K / 1M
import numpy as np
L, H_kv, d_k, d, N = 80, 8, 128, 8192, 70e9   # 70B-class, GQA-8, fp16 KV

print("    T   | KV cache/seq | prefill PFLOPs | attention share")
for T in [8_192, 131_072, 1_048_576]:
    kv_gb = 2 * L * H_kv * d_k * T * 2 / 1e9  # K and V, 2 bytes each (EQ 3.5)
    lin   = 2 * N * T                         # weight matmuls
    attn  = 4 * L * d * T**2                  # QK^T + AV
    share = 100 * attn / (attn + lin)
    label = f"{T//1024}K" if T < 1e6 else "1M"
    print(f"  {label:>4}  | {kv_gb:9.1f} GB | {(lin+attn)/1e15:11.1f}    | {share:8.1f} %")

print("\nat 8K the quadratic term is a rounding error. at 1M the KV cache")
print("alone outweighs four H100s and attention IS the forward pass --")
print("every technique in this section exists because of this table.")

edits are live — break it on purpose

INSTRUMENT 9.2 — THE PRICE OF CONTEXT70B-CLASS · GQA-8 · FP16 KV

CONTEXT LENGTH T 8K

KV CACHE / SEQUENCE

—

PREFILL COMPUTE

—

ATTENTION SHARE OF FLOPs

At 8K the quadratic term is a rounding error; at 1M it dominates the entire forward pass and the KV cache alone outweighs the model. Every technique in this section exists because of what this slider does past 128K. (An SSM's state, for comparison: fixed at a few hundred MB regardless of T.)

Honest caveat: needle-in-a-haystack retrieval saturated long ago, but using a full window for reasoning still degrades — the “lost in the middle” effect and context-rot benchmarks show effective context lags advertised context. Long context complements rather than kills retrieval (RAG): selection is cheaper than attention.

9.3

Multimodality: everything becomes tokens

The transformer never cared that its tokens meant text. Modern frontier models are natively multimodal: one decoder attends over interleaved sequences of text tokens, image patches, audio frames, video.

EQ 9.3 — IMAGES AS TOKENS (ViT PATCHIFY) $$ x_{\text{img}} \in \mathbb{R}^{H \times W \times 3} \;\longrightarrow\; \Big\{ W_p\, \mathrm{vec}\big(\text{patch}_{16\times16}^{(j)}\big) \Big\}_{j=1}^{HW/256} \in \mathbb{R}^{d_{\text{model}}} $$

Slice the image into 16×16 patches, flatten, project — each patch is now just another embedding in the sequence. A 1024×1024 image ≈ 4K tokens (hence image inputs' token pricing). Architectures differ in coupling: a pre-trained vision encoder bridged by a projector (LLaVA-style, cheap), cross-attention taps (Flamingo lineage), or early-fusion single-stack training on mixed data (the frontier default).

A $1024\times1024$ image is cut into $16\times16$ patches. How many patch tokens does it become?

Per side: $1024/16 = 64$ patches. Total: $64^2 = $ 4096 tokens — which is why a single high-res image can cost as much context as several pages of text.

Generation side: discrete image/audio tokens from learned codecs (VQ-VAE/RVQ descendants) let the same autoregressive machinery emit media; diffusion heads remain common for high-fidelity images.
Speech: native audio-to-audio loops (realtime APIs) replace the ASR→LLM→TTS pipeline, cutting latency below conversational thresholds.
Why it matters beyond features: vision grounds language in geometry and physics; computer-use agents (below) are impossible without reading screens.

9.4

Agents & tool use

An LLM that can only emit text is an oracle; given tools, it becomes an actor. The mechanics are disarmingly simple — the loop is the product:

# The agent loop — everything else is engineering around it
while not done:
    response = llm(system, history, tools)        # model may emit a tool call
    if response.tool_calls:
        results = execute(response.tool_calls)      # search, code, browser, files…
        history += [response, results]              # observations feed back in
    else: done = True                              # final answer

Tool calling is trained, not prompted — post-training (Chapter 05) teaches the schema-constrained emission format; RLVR-style training on long-horizon tasks (SWE-bench-like environments) is the current capability driver.
Standardization: the Model Context Protocol (MCP) turned tool integration from N×M custom adapters into a USB-like interface — servers expose tools/resources, any model client consumes them.
Reasoning × acting compounds: thinking models that plan, act, observe, and revise (the ReAct pattern, now internalized) turn test-time compute into real-world task completion — coding agents being the proof case.
The hard parts are systemic: error compounding over long horizons (0.99⁵⁰ ≈ 0.6), sandboxing and permissioning, prompt injection from hostile content, and evaluation of open-ended tasks.

An agent completes each step correctly with probability $0.99$, and a task needs $50$ sequential steps with no recovery. What is the probability the whole task succeeds? ($0.99^{50}$.)

$0.99^{50} = $ 0.605. A 99%-reliable step still leaves a ~40% chance of failure across 50 of them — why long-horizon agents need verification, retries, and checkpoints rather than raw per-step accuracy.

9.5

Beyond the transformer: SSMs and hybrids

The transformer's $O(T^2)$ attention and $O(T)$ cache are taxes, and state-space models offer an alternative: compress history into a fixed-size recurrent state. Mamba's selective SSM is the breakthrough form:

EQ 9.4 — SELECTIVE STATE-SPACE RECURRENCE (MAMBA) $$ h_t = \bar{A}(x_t)\, h_{t-1} + \bar{B}(x_t)\, x_t, \qquad y_t = C(x_t)\, h_t $$

A linear RNN whose transition matrices are functions of the input — the “selectivity” that lets it gate what to remember and forget (the failure of older linear RNNs), while remaining parallelizable for training via scan algorithms. Decode cost: $O(1)$ per token, zero KV cache.

Pure SSMs lag transformers on exact recall (copy a phone number from 50K tokens back — attention does this trivially; a compressed state cannot). The convergent answer is hybrids: mostly SSM/linear-attention layers with a sparse sprinkling of full attention (Jamba, Zamba, recent efficiency-focused releases) — most of the speed, most of the recall. Related test-time-compute economics, not just architecture, will decide this race: cheap long generation matters most for reasoning models that think in tens of thousands of tokens.

9.6

Open problems

Problem	State of play
Hallucination	Structural, not a bug: sampling + imperfect knowledge ⇒ confident fabrication. Mitigations (RAG, citations, abstention training, verification loops) manage it; nothing eliminates it. Calibrated uncertainty remains open.
Interpretability	Sparse autoencoders decompose the residual stream into millions of monosemantic features; circuit tracing maps small behaviors end-to-end. Still far from auditing a frontier model's reasoning — the gap between “we can find features” and “we can certify behavior”.
Alignment under optimization pressure	Reward hacking, sycophancy, and (in lab settings) strategic deception scale with capability. Scalable oversight — supervising models smarter than the supervisor — is the live research front.
Continual learning	Weights freeze at deployment; the world doesn't. Today's patch — context + retrieval + agentic memory files — sidesteps rather than solves weight-space updating without forgetting.
Data & energy ceilings	High-quality human text is finite; synthetic data and RL-generated experience must carry growth. Gigawatt clusters make energy, cooling and capital the binding constraints as much as algorithms.
Evaluation	Benchmarks saturate or leak within months; the field leans on held-out private evals, arena preferences, and real-task completion rates — all gameable, none sufficient.

One family of generative models has been conspicuously absent — the one that paints, speaks, and increasingly drafts text in parallel. Chapter 10: diffusion, from the noise-reversal mathematics to masked diffusion language models — including a sandbox where you run real reverse diffusion in the browser.

§