Mixture-of-Experts: capacity without the bill
Chapter 02 noted that MLPs hold most parameters; MoE makes them conditional. Replace each MLP with \(E\) parallel expert MLPs and a tiny router that sends every token to its top-\(k\):
# Top-2 MoE router: load balance loss, fair vs biased (EQ 9.2)
import numpy as np
rng = np.random.default_rng(0)
E, k, T = 8, 2, 64
x = rng.normal(0, 1, (T, 16)) # 64 token hidden states
Wr = rng.normal(0, 0.4, (16, E)) # router weights
def route(bias):
g = np.exp(x @ Wr + bias)
g /= g.sum(1, keepdims=True) # softmax gates (EQ 9.1)
top2 = np.argsort(g, 1)[:, -2:] # route each token to its top-2
f = np.bincount(top2.ravel().astype(np.intc), minlength=E) / (T * k) # realized load
P = g.mean(0) # mean router probability
return f, E * np.sum(f * P) # EQ 9.2 (lambda = 1)
for name, bias in [("fair ", np.zeros(E)),
("biased", np.array([2.5, 2.0, 0, 0, 0, 0, 0, 0.]))]:
f, L = route(bias)
print(f"{name} router L_aux = {L:.3f} (uniform ideal = 1.000)")
print(" load/expert:", " ".join(f"{v:.2f}" for v in f))
print("\nthe biased router funnels every token to two favourites; the")
print("f.P product rises and EQ 9.2's gradient pushes it back to uniform.")
Long context: the million-token problem
Context windows grew 1,000× in four years (2K → 1M–10M claimed). Three fronts made it possible:
- Positional extension. RoPE trained at 4K collapses at 32K — unseen rotation angles. Fixes rescale the spectrum: Position Interpolation compresses all frequencies; NTK-aware scaling raises the base \(b\); YaRN interpolates per-frequency (fast dims untouched, slow dims stretched) plus an attention-temperature correction. Standard recipe: pre-train short → continue briefly at long context with scaled RoPE (Llama-3.1's base-500K + 800B long-context tokens).
- Attention cost. \(O(T^2)\) prefill at \(T = 10^6\) is ~10⁶× a 1K prompt. Mitigations: FlashAttention (exact), interleaved sliding-window layers, context parallelism (ring attention across GPUs), and learned sparse patterns (NSA-style) approaching \(O(T)\).
- KV memory. EQ 3.5 at 1M tokens is brutal — hence GQA/MLA, KV quantization, token eviction/compression heuristics, and tiered KV offload in serving stacks.
# The price of context: KV + attention share at 8K / 128K / 1M
import numpy as np
L, H_kv, d_k, d, N = 80, 8, 128, 8192, 70e9 # 70B-class, GQA-8, fp16 KV
print(" T | KV cache/seq | prefill PFLOPs | attention share")
for T in [8_192, 131_072, 1_048_576]:
kv_gb = 2 * L * H_kv * d_k * T * 2 / 1e9 # K and V, 2 bytes each (EQ 3.5)
lin = 2 * N * T # weight matmuls
attn = 4 * L * d * T**2 # QK^T + AV
share = 100 * attn / (attn + lin)
label = f"{T//1024}K" if T < 1e6 else "1M"
print(f" {label:>4} | {kv_gb:9.1f} GB | {(lin+attn)/1e15:11.1f} | {share:8.1f} %")
print("\nat 8K the quadratic term is a rounding error. at 1M the KV cache")
print("alone outweighs four H100s and attention IS the forward pass --")
print("every technique in this section exists because of this table.")
Honest caveat: needle-in-a-haystack retrieval saturated long ago, but using a full window for reasoning still degrades — the “lost in the middle” effect and context-rot benchmarks show effective context lags advertised context. Long context complements rather than kills retrieval (RAG): selection is cheaper than attention.
Multimodality: everything becomes tokens
The transformer never cared that its tokens meant text. Modern frontier models are natively multimodal: one decoder attends over interleaved sequences of text tokens, image patches, audio frames, video.
- Generation side: discrete image/audio tokens from learned codecs (VQ-VAE/RVQ descendants) let the same autoregressive machinery emit media; diffusion heads remain common for high-fidelity images.
- Speech: native audio-to-audio loops (realtime APIs) replace the ASR→LLM→TTS pipeline, cutting latency below conversational thresholds.
- Why it matters beyond features: vision grounds language in geometry and physics; computer-use agents (below) are impossible without reading screens.
Agents & tool use
An LLM that can only emit text is an oracle; given tools, it becomes an actor. The mechanics are disarmingly simple — the loop is the product:
# The agent loop — everything else is engineering around it
while not done:
response = llm(system, history, tools) # model may emit a tool call
if response.tool_calls:
results = execute(response.tool_calls) # search, code, browser, files…
history += [response, results] # observations feed back in
else: done = True # final answer
- Tool calling is trained, not prompted — post-training (Chapter 05) teaches the schema-constrained emission format; RLVR-style training on long-horizon tasks (SWE-bench-like environments) is the current capability driver.
- Standardization: the Model Context Protocol (MCP) turned tool integration from N×M custom adapters into a USB-like interface — servers expose tools/resources, any model client consumes them.
- Reasoning × acting compounds: thinking models that plan, act, observe, and revise (the ReAct pattern, now internalized) turn test-time compute into real-world task completion — coding agents being the proof case.
- The hard parts are systemic: error compounding over long horizons (0.99⁵⁰ ≈ 0.6), sandboxing and permissioning, prompt injection from hostile content, and evaluation of open-ended tasks.
Beyond the transformer: SSMs and hybrids
The transformer's \(O(T^2)\) attention and \(O(T)\) cache are taxes, and state-space models offer an alternative: compress history into a fixed-size recurrent state. Mamba's selective SSM is the breakthrough form:
Pure SSMs lag transformers on exact recall (copy a phone number from 50K tokens back — attention does this trivially; a compressed state cannot). The convergent answer is hybrids: mostly SSM/linear-attention layers with a sparse sprinkling of full attention (Jamba, Zamba, recent efficiency-focused releases) — most of the speed, most of the recall. Related test-time-compute economics, not just architecture, will decide this race: cheap long generation matters most for reasoning models that think in tens of thousands of tokens.
Open problems
| Problem | State of play |
|---|---|
| Hallucination | Structural, not a bug: sampling + imperfect knowledge ⇒ confident fabrication. Mitigations (RAG, citations, abstention training, verification loops) manage it; nothing eliminates it. Calibrated uncertainty remains open. |
| Interpretability | Sparse autoencoders decompose the residual stream into millions of monosemantic features; circuit tracing maps small behaviors end-to-end. Still far from auditing a frontier model's reasoning — the gap between “we can find features” and “we can certify behavior”. |
| Alignment under optimization pressure | Reward hacking, sycophancy, and (in lab settings) strategic deception scale with capability. Scalable oversight — supervising models smarter than the supervisor — is the live research front. |
| Continual learning | Weights freeze at deployment; the world doesn't. Today's patch — context + retrieval + agentic memory files — sidesteps rather than solves weight-space updating without forgetting. |
| Data & energy ceilings | High-quality human text is finite; synthetic data and RL-generated experience must carry growth. Gigawatt clusters make energy, cooling and capital the binding constraints as much as algorithms. |
| Evaluation | Benchmarks saturate or leak within months; the field leans on held-out private evals, arena preferences, and real-task completion rates — all gameable, none sufficient. |
One family of generative models has been conspicuously absent — the one that paints, speaks, and increasingly drafts text in parallel. Chapter 10: diffusion, from the noise-reversal mathematics to masked diffusion language models — including a sandbox where you run real reverse diffusion in the browser.
Further reading
- Shazeer et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. — the sparse MoE layer modern flagships are built on.
- Fedus, Zoph & Shazeer (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. — top-1 routing and the engineering of MoE at scale.
- Radford et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). — the image–text alignment behind multimodal models.
- Alayrac et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. — fusing a vision encoder into a frozen LM.
- Yao et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. — the reason–act loop underlying tool-using agents.
- Gu & Dao (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. — the leading state-space challenger to attention.