AI // ENCYCLOPEDIA / VOL II / ⌘ / CAPSTONE INDEX FINISH ↺
CAPSTONE / END-TO-END

The Full Stack

Ten chapters compress into two instruments. First, design a frontier model, where every slider invokes an equation you have already met, from Chinchilla's optimum to the KV-cache budget of the GPU it ships on. Then ride a single token through the whole machine, from raw text to the next sampled word.

MODEHANDS ON USESEQ 1.2 · 3.5 · 4.1 · 4.2 · 4.5 · 7.1 · 8.2 PREREQUISITECH 01–10 (OR COURAGE)
C.1

The lifecycle, on one screen

FIG C.AA MODEL'S LIFE — CHAPTERS MAPPED TO PIPELINE
DATA PIPELINE CH 04.1 · 15T tokens PRE-TRAINING CH 01–04 · 10²⁵⁺ FLOPs BASE MODEL capability, no manners POST-TRAINING CH 05 · SFT → RL → RLVR ASSISTANT + evals & red team ADAPT CH 06 · LoRA / QLoRA COMPRESS CH 07 · distill · quantize SERVE CH 08 · vLLM-class engine APPLICATION CH 09 · agents · tools · RAG DIFFUSION HEADS CH 10 · images · speech telemetry · preferences · new data the loop that trains the next generation SUBSTRATE — GPUs · HBM · interconnect · parallelism (CH 4.5) · rooflines (CH 8.1) every box above is ultimately a bandwidth negotiation
Read it twice. Left to right: one model's life. The mint return path: each generation's deployment telemetry, preference data and distilled outputs become the next generation's training set — the industry's actual flywheel.
C.2

The Forge: design a model

Six decisions take a model from thesis to dossier: how much compute, dense or sparse, how far past Chinchilla to push the data, how much context, what precision, and which silicon serves it. Everything downstream is arithmetic you now know.

PYTHON · RUNNABLE IN-BROWSER
# The Forge as one function: budget -> full model dossier
import numpy as np
E, A, B, al, be = 1.82, 482.0, 2085.4, 0.3478, 0.3658  # Chinchilla refit
C, f = 1e25, 4                       # 10^25 FLOPs, 4x over-trained (default)
BW, MEM = 3.35e12, 80e9              # H100: HBM bandwidth, capacity

Nopt = ((al*A)/(be*B))**(1/(al+be)) * (C/6)**(be/(al+be))   # EQ 4.2
N = Nopt / np.sqrt(f)                # over-train: shrink N, grow D
D = C / (6 * N)                      # EQ 4.1's budget C = 6ND
loss = E + A/N**al + B/D**be         # EQ 4.1 predicted loss

weights = N * 1                      # dense, FP8 = 1 byte/param
toks    = BW / weights               # EQ 7.1 single-stream ceiling
shards  = int(np.ceil(weights / (MEM * 0.9)))
kv_user = 2*96*8*128*2 * 131072      # 96 layers, GQA-8, fp16 KV, 128K ctx
users   = int((MEM*0.9*shards - weights) // kv_user)
gpu_h   = C / (0.45 * 989e12) / 3600 # H100-hours at 45% MFU

print(f"params N         : {N/1e9:.0f} B dense   tokens D : {D/1e12:.1f} T  ({D/N:.0f} tok/param)")
print(f"predicted loss   : {loss:.3f}")
print(f"training bill    : {gpu_h/1e6:.1f}M H100-hours  ~ ${gpu_h*2/1e6:.1f}M at $2/hr")
print(f"weights, FP8     : {weights/1e9:.0f} GB  ->  shard across {shards} x H100")
print(f"decode ceiling   : {toks:.0f} tok/s single-stream (EQ 7.1)")
print(f"KV / user @ 128K : {kv_user/1e9:.1f} GB  ->  {users} concurrent user(s)/node")
print("\nset Instrument C.1 to 10^25 / dense / FP8 / H100 and watch every")
print("number above reappear on the dossier. the whole stack is one chain.")
edits are live — break it on purpose
The compute budget for a dense model is \(C = 6ND\) (params \(N\), training tokens \(D\)). For \(N = 20\text{B}\) parameters and \(D = 2\text{T}\) tokens, what is the training compute \(C\) in FLOPs?
\(C = 6 \times (20\times10^{9}) \times (2\times10^{12}) = 6 \times 4\times10^{22} = \) 2.4e23 FLOPs. (At \(D/N = 100\) tokens/param this model is ~5× over the Chinchilla optimum of ~20.)
A \(100\text{B}\) dense model served in FP8 (1 byte/param) on an H100 (\(3.35\times10^{12}\) B/s). What is the single-stream decode ceiling (EQ 7.1)?
Weight bytes \(= 1 \times 100\times10^{9} = 10^{11}\). Ceiling \(= \dfrac{3.35\times10^{12}}{10^{11}} = \) 33.5 tok/s — the speed-of-light for one user before batching.
Training takes \(C = 10^{25}\) FLOPs on H100s peaking at \(989\times10^{12}\) FLOP/s, run at \(45\%\) MFU. How many H100-hours is that? \(\;\text{hours} = \dfrac{C}{0.45 \times 989\times10^{12} \times 3600}\).
Effective rate \(= 0.45 \times 989\times10^{12} = 4.45\times10^{14}\) FLOP/s. Seconds \(= \dfrac{10^{25}}{4.45\times10^{14}} = 2.25\times10^{10}\). Hours \(= \div 3600 = \) 6.24e6 H100-hours — about $12.5M at $2/hr.
INSTRUMENT C.1 — THE FORGEEQ 4.1 · 4.2 · 4.5 · 3.5 · 7.1 CHAINED
MODEL DOSSIER —
TOTAL PARAMS
ACTIVE / TOKEN
TRAINING TOKENS
TOKENS / PARAM
PREDICTED LOSS (EQ 4.1)
COMPUTE C
FLEET FOR 90-DAY RUN
TRAINING COMPUTE COST
WEIGHTS ON DISK
SINGLE-STREAM CEILING
KV CACHE / USER @ FULL CTX
CONCURRENT USERS / NODE
Try the classics: 10²² dense at 1× ≈ Chinchilla itself. 10²⁴·³ dense, 32× over-trained ≈ Llama-3-8B economics. 10²⁵·⁵ MoE 18:1, 128K, FP8 on B200 ≈ a 2025 frontier deployment. Then build something irresponsible — 1M context on an RTX 4090 — and read why it fails.
C.3

Token journey: one step of the loop

This is the entire manual in one breath: text becomes tokens (CH 01), tokens become vectors (01), attention mixes positions (03) and MLPs transform them (02) through every layer, the unembedding produces logits (01), the sampler chooses (08), and the choice rejoins the context for the next round. A toy bigram model plays the transformer's role — the plumbing is exactly real.

A model has \(L = 96\) layers, GQA with \(H_{kv} = 8\) KV heads, head dim \(d_k = 128\), fp16 KV (2 bytes). For one sequence at \(T = 131072\) tokens, what is the KV-cache size in GB? \(\;\text{bytes} = 2\cdot L\cdot H_{kv}\cdot d_k\cdot T\cdot 2\).
Per token, per sequence: \(2\cdot 96\cdot 8\cdot 128\cdot 2 = 393{,}216\) bytes. Times \(T = 131072\): \(393216 \times 131072 \approx 5.15\times10^{10}\) bytes \(= \) 51.5 GB — one 128K-token user nearly fills an entire 80 GB card with cache alone.
PYTHON · RUNNABLE IN-BROWSER
# Token journey in code: a bigram LM and the temperature dial
import numpy as np
LM = {
 "the":      {"robot": 2.2, "cat": 1.6, "gradient": 1.0, "moon": 0.4},
 "robot":    {"picked": 2.0, "saw": 1.0, "dropped": 0.6},
 "cat":      {"saw": 1.8, "chased": 1.2},
 "gradient": {"exploded": 1.5, "vanished": 1.5},
 "moon":     {"rose": 1.5},
 "picked":   {"up": 2.5},     "up":       {"the": 2.0},
 "saw":      {"the": 2.0},    "chased":   {"the": 2.0},
 "dropped":  {"the": 2.0},    "rose":     {"and": 1.5},
 "exploded": {"and": 1.5},    "vanished": {"and": 1.5},
 "and":      {"the": 2.0},
}

def generate(tau, seed=0):
    rng = np.random.default_rng(seed)
    seq = ["the"]
    for _ in range(15):
        nxt, z = zip(*LM[seq[-1]].items())
        p = np.exp(np.array(z) / tau); p /= p.sum()   # softmax(z/tau)
        seq.append(str(rng.choice(list(nxt), p=p)))   # sample, append, repeat
    return " ".join(seq)

print("tau = 0.3 :", generate(0.3))
print("tau = 1.5 :", generate(1.5))
print("\nscore successors -> softmax(z/tau) -> sample -> append: the exact")
print("loop of Instrument C.2, and of every serving GPU on earth tonight.")
edits are live — break it on purpose
INSTRUMENT C.2 — TOKEN JOURNEYONE FORWARD PASS, ANIMATED
CONTEXT (GREY = PROMPT · MINT = GENERATED · GLOW = ATTENTION FROM CURRENT POSITION)
NEXT-TOKEN DISTRIBUTION (TOP-8)
Press STEP and watch the stage strip — that exact sequence, repeated per token, is what burns the world's GPU fleets. AUTO runs until a period. Crank temperature to 2.5 and watch the toy model hallucinate; drop to 0.1 and it turns into a determinist.
C.4

Where to go next

If you want…Read / build
The primary sourcesAttention Is All You Need (2017) · GPT-3 (2020) · Chinchilla (2022) · InstructGPT (2022) · LoRA (2021) · FlashAttention (2022) · DPO (2023) · DeepSeek-V3 / R1 reports (2024–25) · DDPM (2020)
To build oneKarpathy's Neural Networks: Zero to Hero and nanoGPT/nanochat — train a real (small) GPT end-to-end, then re-read Chapter 04 and feel it.
To serve onevLLM or SGLang on any open-weight model; watch your own TTFT/TPOT dashboards reproduce Chapter 08.
To adapt oneQLoRA via the PEFT/TRL stack or Unsloth; budget a weekend and follow the Chapter 06 recipe literally.
To look inside oneThe mechanistic-interpretability literature: induction heads, superposition, sparse autoencoders, circuit tracing.
END

You now hold the full pipeline: tokens → embeddings → attention in a residual stream (01–03), shaped by data and compute under scaling laws (04), aligned into an assistant (05), adapted (06), compressed (07), served at scale (08), pushed by the frontier (09), and flanked by diffusion's parallel world (10). Re-open any chapter from the index — the instruments don't mind being played twice.