Capstone · The Full Stack — LLM Field Manual

C.1

The lifecycle, on one screen

FIG C.AA MODEL'S LIFE — CHAPTERS MAPPED TO PIPELINE

Read it twice. Left to right: one model's life. The mint return path: each generation's deployment telemetry, preference data and distilled outputs become the next generation's training set — the industry's actual flywheel.

C.2

The Forge: design a model

Six decisions take a model from thesis to dossier: how much compute, dense or sparse, how far past Chinchilla to push the data, how much context, what precision, and which silicon serves it. Everything downstream is arithmetic you now know.

PYTHON · RUNNABLE IN-BROWSER

# The Forge as one function: budget -> full model dossier
import numpy as np
E, A, B, al, be = 1.82, 482.0, 2085.4, 0.3478, 0.3658  # Chinchilla refit
C, f = 1e25, 4                       # 10^25 FLOPs, 4x over-trained (default)
BW, MEM = 3.35e12, 80e9              # H100: HBM bandwidth, capacity

Nopt = ((al*A)/(be*B))**(1/(al+be)) * (C/6)**(be/(al+be))   # EQ 4.2
N = Nopt / np.sqrt(f)                # over-train: shrink N, grow D
D = C / (6 * N)                      # EQ 4.1's budget C = 6ND
loss = E + A/N**al + B/D**be         # EQ 4.1 predicted loss

weights = N * 1                      # dense, FP8 = 1 byte/param
toks    = BW / weights               # EQ 7.1 single-stream ceiling
shards  = int(np.ceil(weights / (MEM * 0.9)))
kv_user = 2*96*8*128*2 * 131072      # 96 layers, GQA-8, fp16 KV, 128K ctx
users   = int((MEM*0.9*shards - weights) // kv_user)
gpu_h   = C / (0.45 * 989e12) / 3600 # H100-hours at 45% MFU

print(f"params N         : {N/1e9:.0f} B dense   tokens D : {D/1e12:.1f} T  ({D/N:.0f} tok/param)")
print(f"predicted loss   : {loss:.3f}")
print(f"training bill    : {gpu_h/1e6:.1f}M H100-hours  ~ ${gpu_h*2/1e6:.1f}M at $2/hr")
print(f"weights, FP8     : {weights/1e9:.0f} GB  ->  shard across {shards} x H100")
print(f"decode ceiling   : {toks:.0f} tok/s single-stream (EQ 7.1)")
print(f"KV / user @ 128K : {kv_user/1e9:.1f} GB  ->  {users} concurrent user(s)/node")
print("\nset Instrument C.1 to 10^25 / dense / FP8 / H100 and watch every")
print("number above reappear on the dossier. the whole stack is one chain.")

edits are live — break it on purpose

The compute budget for a dense model is $C = 6ND$ (params $N$, training tokens $D$). For $N = 20\text{B}$ parameters and $D = 2\text{T}$ tokens, what is the training compute $C$ in FLOPs?

$C = 6 \times (20\times10^{9}) \times (2\times10^{12}) = 6 \times 4\times10^{22} = $ 2.4e23 FLOPs. (At $D/N = 100$ tokens/param this model is ~5× over the Chinchilla optimum of ~20.)

A $100\text{B}$ dense model served in FP8 (1 byte/param) on an H100 ($3.35\times10^{12}$ B/s). What is the single-stream decode ceiling (EQ 7.1)?

Weight bytes $= 1 \times 100\times10^{9} = 10^{11}$. Ceiling $= \dfrac{3.35\times10^{12}}{10^{11}} = $ 33.5 tok/s — the speed-of-light for one user before batching.

Training takes $C = 10^{25}$ FLOPs on H100s peaking at $989\times10^{12}$ FLOP/s, run at $45\%$ MFU. How many H100-hours is that? $\;\text{hours} = \dfrac{C}{0.45 \times 989\times10^{12} \times 3600}$.

Effective rate $= 0.45 \times 989\times10^{12} = 4.45\times10^{14}$ FLOP/s. Seconds $= \dfrac{10^{25}}{4.45\times10^{14}} = 2.25\times10^{10}$. Hours $= \div 3600 = $ 6.24e6 H100-hours — about $12.5M at $2/hr.

INSTRUMENT C.1 — THE FORGEEQ 4.1 · 4.2 · 4.5 · 3.5 · 7.1 CHAINED

TRAINING COMPUTE C 10^25 FLOPs

OVER-TRAIN DIAL 1× (CHINCHILLA-OPTIMAL)

ARCHITECTURE

CONTEXT

SERVE PRECISION

SERVE HARDWARE

MODEL DOSSIER — ——

TOTAL PARAMS

—

ACTIVE / TOKEN

—

TRAINING TOKENS

—

TOKENS / PARAM

—

PREDICTED LOSS (EQ 4.1)

—

COMPUTE C

—

FLEET FOR 90-DAY RUN

—

TRAINING COMPUTE COST

—

WEIGHTS ON DISK

—

SINGLE-STREAM CEILING

—

KV CACHE / USER @ FULL CTX

—

CONCURRENT USERS / NODE

—

Try the classics: 10²² dense at 1× ≈ Chinchilla itself. 10²⁴·³ dense, 32× over-trained ≈ Llama-3-8B economics. 10²⁵·⁵ MoE 18:1, 128K, FP8 on B200 ≈ a 2025 frontier deployment. Then build something irresponsible — 1M context on an RTX 4090 — and read why it fails.

C.3

Token journey: one step of the loop

This is the entire manual in one breath: text becomes tokens (CH 01), tokens become vectors (01), attention mixes positions (03) and MLPs transform them (02) through every layer, the unembedding produces logits (01), the sampler chooses (08), and the choice rejoins the context for the next round. A toy bigram model plays the transformer's role — the plumbing is exactly real.

A model has $L = 96$ layers, GQA with $H_{kv} = 8$ KV heads, head dim $d_k = 128$, fp16 KV (2 bytes). For one sequence at $T = 131072$ tokens, what is the KV-cache size in GB? $\;\text{bytes} = 2\cdot L\cdot H_{kv}\cdot d_k\cdot T\cdot 2$.

Per token, per sequence: $2\cdot 96\cdot 8\cdot 128\cdot 2 = 393{,}216$ bytes. Times $T = 131072$: $393216 \times 131072 \approx 5.15\times10^{10}$ bytes $= $ 51.5 GB — one 128K-token user nearly fills an entire 80 GB card with cache alone.

PYTHON · RUNNABLE IN-BROWSER

# Token journey in code: a bigram LM and the temperature dial
import numpy as np
LM = {
 "the":      {"robot": 2.2, "cat": 1.6, "gradient": 1.0, "moon": 0.4},
 "robot":    {"picked": 2.0, "saw": 1.0, "dropped": 0.6},
 "cat":      {"saw": 1.8, "chased": 1.2},
 "gradient": {"exploded": 1.5, "vanished": 1.5},
 "moon":     {"rose": 1.5},
 "picked":   {"up": 2.5},     "up":       {"the": 2.0},
 "saw":      {"the": 2.0},    "chased":   {"the": 2.0},
 "dropped":  {"the": 2.0},    "rose":     {"and": 1.5},
 "exploded": {"and": 1.5},    "vanished": {"and": 1.5},
 "and":      {"the": 2.0},
}

def generate(tau, seed=0):
    rng = np.random.default_rng(seed)
    seq = ["the"]
    for _ in range(15):
        nxt, z = zip(*LM[seq[-1]].items())
        p = np.exp(np.array(z) / tau); p /= p.sum()   # softmax(z/tau)
        seq.append(str(rng.choice(list(nxt), p=p)))   # sample, append, repeat
    return " ".join(seq)

print("tau = 0.3 :", generate(0.3))
print("tau = 1.5 :", generate(1.5))
print("\nscore successors -> softmax(z/tau) -> sample -> append: the exact")
print("loop of Instrument C.2, and of every serving GPU on earth tonight.")

edits are live — break it on purpose

INSTRUMENT C.2 — TOKEN JOURNEYONE FORWARD PASS, ANIMATED

PROMPT

TEMPERATURE 0.90

CONTEXT (GREY = PROMPT · MINT = GENERATED · GLOW = ATTENTION FROM CURRENT POSITION)

NEXT-TOKEN DISTRIBUTION (TOP-8)

Press STEP and watch the stage strip — that exact sequence, repeated per token, is what burns the world's GPU fleets. AUTO runs until a period. Crank temperature to 2.5 and watch the toy model hallucinate; drop to 0.1 and it turns into a determinist.

C.4

Where to go next

If you want…	Read / build
The primary sources	Attention Is All You Need (2017) · GPT-3 (2020) · Chinchilla (2022) · InstructGPT (2022) · LoRA (2021) · FlashAttention (2022) · DPO (2023) · DeepSeek-V3 / R1 reports (2024–25) · DDPM (2020)
To build one	Karpathy's Neural Networks: Zero to Hero and nanoGPT/nanochat — train a real (small) GPT end-to-end, then re-read Chapter 04 and feel it.
To serve one	vLLM or SGLang on any open-weight model; watch your own TTFT/TPOT dashboards reproduce Chapter 08.
To adapt one	QLoRA via the PEFT/TRL stack or Unsloth; budget a weekend and follow the Chapter 06 recipe literally.
To look inside one	The mechanistic-interpretability literature: induction heads, superposition, sparse autoencoders, circuit tracing.

END

You now hold the full pipeline: tokens → embeddings → attention in a residual stream (01–03), shaped by data and compute under scaling laws (04), aligned into an assistant (05), adapted (06), compressed (07), served at scale (08), pushed by the frontier (09), and flanked by diffusion's parallel world (10). Re-open any chapter from the index — the instruments don't mind being played twice.