The lifecycle, on one screen
The Forge: design a model
Six decisions take a model from thesis to dossier: how much compute, dense or sparse, how far past Chinchilla to push the data, how much context, what precision, and which silicon serves it. Everything downstream is arithmetic you now know.
# The Forge as one function: budget -> full model dossier
import numpy as np
E, A, B, al, be = 1.82, 482.0, 2085.4, 0.3478, 0.3658 # Chinchilla refit
C, f = 1e25, 4 # 10^25 FLOPs, 4x over-trained (default)
BW, MEM = 3.35e12, 80e9 # H100: HBM bandwidth, capacity
Nopt = ((al*A)/(be*B))**(1/(al+be)) * (C/6)**(be/(al+be)) # EQ 4.2
N = Nopt / np.sqrt(f) # over-train: shrink N, grow D
D = C / (6 * N) # EQ 4.1's budget C = 6ND
loss = E + A/N**al + B/D**be # EQ 4.1 predicted loss
weights = N * 1 # dense, FP8 = 1 byte/param
toks = BW / weights # EQ 7.1 single-stream ceiling
shards = int(np.ceil(weights / (MEM * 0.9)))
kv_user = 2*96*8*128*2 * 131072 # 96 layers, GQA-8, fp16 KV, 128K ctx
users = int((MEM*0.9*shards - weights) // kv_user)
gpu_h = C / (0.45 * 989e12) / 3600 # H100-hours at 45% MFU
print(f"params N : {N/1e9:.0f} B dense tokens D : {D/1e12:.1f} T ({D/N:.0f} tok/param)")
print(f"predicted loss : {loss:.3f}")
print(f"training bill : {gpu_h/1e6:.1f}M H100-hours ~ ${gpu_h*2/1e6:.1f}M at $2/hr")
print(f"weights, FP8 : {weights/1e9:.0f} GB -> shard across {shards} x H100")
print(f"decode ceiling : {toks:.0f} tok/s single-stream (EQ 7.1)")
print(f"KV / user @ 128K : {kv_user/1e9:.1f} GB -> {users} concurrent user(s)/node")
print("\nset Instrument C.1 to 10^25 / dense / FP8 / H100 and watch every")
print("number above reappear on the dossier. the whole stack is one chain.")
Token journey: one step of the loop
This is the entire manual in one breath: text becomes tokens (CH 01), tokens become vectors (01), attention mixes positions (03) and MLPs transform them (02) through every layer, the unembedding produces logits (01), the sampler chooses (08), and the choice rejoins the context for the next round. A toy bigram model plays the transformer's role — the plumbing is exactly real.
# Token journey in code: a bigram LM and the temperature dial
import numpy as np
LM = {
"the": {"robot": 2.2, "cat": 1.6, "gradient": 1.0, "moon": 0.4},
"robot": {"picked": 2.0, "saw": 1.0, "dropped": 0.6},
"cat": {"saw": 1.8, "chased": 1.2},
"gradient": {"exploded": 1.5, "vanished": 1.5},
"moon": {"rose": 1.5},
"picked": {"up": 2.5}, "up": {"the": 2.0},
"saw": {"the": 2.0}, "chased": {"the": 2.0},
"dropped": {"the": 2.0}, "rose": {"and": 1.5},
"exploded": {"and": 1.5}, "vanished": {"and": 1.5},
"and": {"the": 2.0},
}
def generate(tau, seed=0):
rng = np.random.default_rng(seed)
seq = ["the"]
for _ in range(15):
nxt, z = zip(*LM[seq[-1]].items())
p = np.exp(np.array(z) / tau); p /= p.sum() # softmax(z/tau)
seq.append(str(rng.choice(list(nxt), p=p))) # sample, append, repeat
return " ".join(seq)
print("tau = 0.3 :", generate(0.3))
print("tau = 1.5 :", generate(1.5))
print("\nscore successors -> softmax(z/tau) -> sample -> append: the exact")
print("loop of Instrument C.2, and of every serving GPU on earth tonight.")
Where to go next
| If you want… | Read / build |
|---|---|
| The primary sources | Attention Is All You Need (2017) · GPT-3 (2020) · Chinchilla (2022) · InstructGPT (2022) · LoRA (2021) · FlashAttention (2022) · DPO (2023) · DeepSeek-V3 / R1 reports (2024–25) · DDPM (2020) |
| To build one | Karpathy's Neural Networks: Zero to Hero and nanoGPT/nanochat — train a real (small) GPT end-to-end, then re-read Chapter 04 and feel it. |
| To serve one | vLLM or SGLang on any open-weight model; watch your own TTFT/TPOT dashboards reproduce Chapter 08. |
| To adapt one | QLoRA via the PEFT/TRL stack or Unsloth; budget a weekend and follow the Chapter 06 recipe literally. |
| To look inside one | The mechanistic-interpretability literature: induction heads, superposition, sparse autoencoders, circuit tracing. |
You now hold the full pipeline: tokens → embeddings → attention in a residual stream (01–03), shaped by data and compute under scaling laws (04), aligned into an assistant (05), adapted (06), compressed (07), served at scale (08), pushed by the frontier (09), and flanked by diffusion's parallel world (10). Re-open any chapter from the index — the instruments don't mind being played twice.