Data: the curriculum of the internet
Frontier runs consume 10–20 trillion tokens. Raw web crawl is mostly unusable; the pipeline that refines it is among the most guarded IP in the industry. The canonical stages:
- Extraction. HTML → text (boilerplate, navigation, ads stripped). Quality of this step alone moves benchmarks.
- Language ID & heuristic filters. Drop documents failing length, symbol-ratio, repetition and word-list tests (C4/Gopher rules).
- Deduplication. Exact (hashing) and near-dup (MinHash / LSH over shingles). Duplicates waste compute and amplify memorization.
- Model-based quality filtering. Classifiers trained to recognize “textbook-like” or high-utility pages now gate the majority of what survives (the FineWeb-Edu pattern).
- Mixing. The final recipe weights sources — web, code, math, papers, books, multilingual — and typically ends with a midtraining / annealing phase that up-weights the highest-quality and long-context data at low learning rate.
- Synthetic data. Increasingly, strong models generate or rewrite training text for weaker successors and specialized phases — with care, since uncurated self-training degrades distributions.
Data quality buys more than data quantity. Identical architectures separated only by corpus curation differ by the equivalent of a 2–5× compute multiplier. The “data wall” debate is really a question of how much refinable raw material and synthetic generation remain.
Scaling laws: how big, how long
Loss falls as a smooth, shockingly reliable power law in model size \(N\) and data \(D\). The Chinchilla (Hoffmann et al., 2022) parametric form:
# Chinchilla solver: closed-form N*, D* from the corrected-fit constants
import numpy as np
E, A, B, alpha, beta = 1.82, 482.0, 2085.4, 0.348, 0.366 # Epoch AI refit
a, b = beta / (alpha + beta), alpha / (alpha + beta)
G = (alpha * A / (beta * B)) ** (1 / (alpha + beta))
def optimum(C): # minimize EQ 4.1 subject to C = 6ND
N = G * (C / 6) ** a
D = (C / 6) / N
return N, D, E + A / N**alpha + B / D**beta
print(" C N* D* tok/param loss L")
for C in (1e22, 5.76e23, 1e24, 1e26):
N, D, L = optimum(C)
print(f" {C:8.1e} {N:11.2e} {D:11.2e} {D/N:9.1f} {L:9.3f}")
print("\n5.76e23 = Chinchilla's own budget: ~70B / ~1.4T recovered on a napkin.")
print("note the refit bends tokens/param below 20 as C grows -- the 20:1 rule")
print("is a Chinchilla-scale snapshot, not a law.")
Cs = np.logspace(20, 27, 50)
plot_xy(np.log10(Cs), np.log10([optimum(C)[0] for C in Cs])) # slope = 0.51
Emergence and downstream scaling. Loss scales smoothly; specific capabilities can look discontinuous (“emergent”) because task metrics are step functions over smooth log-likelihood gains. Modern practice fits separate scaling curves for benchmark performance, and — since 2024 — treats post-training compute and test-time compute (Chapter 05/08) as additional scaling axes.
Optimization: AdamW and the schedule
The unchallenged default is AdamW — Adam with decoupled weight decay:
# EQ A4.1 in dollars: identical mistake probabilities, two harnesses
actions = [ # (action class, P[harmful attempt], $cost raw, $cost sandboxed)
("bad file edit", 0.050, 2_000, 5), # git reset vs lost work
("rm in the wrong dir", 0.010, 25_000, 5), # container fs vs your homedir
("curl|sh from a README",0.004, 250_000, 50), # egress allowlist blocks exfil
("prod credential use", 0.002, 1_000_000, 0), # secret never mounted: c(a)=0
]
print(f"{'action class':24s}{'P[attempt]':>11s}{'E[raw]':>9s}{'E[sandboxed]':>14s}")
raw_total = box_total = 0.0
for name, p, c_raw, c_box in actions:
raw_total += p * c_raw
box_total += p * c_box
print(f"{name:24s}{p:11.3f}{p * c_raw:9,.0f}{p * c_box:14.2f}")
print("-" * 58)
print(f"{'expected damage, one attempt of each':35s}{raw_total:9,.0f}{box_total:14.2f}")
print(f"\nsame model, same first factor — the harness cuts E[damage] by "
f"{raw_total / box_total:,.0f}x")
print("you cannot zero P[harmful attempt]; you fully control max cost c(a)")
Batch sizes are measured in tokens — frontier runs use 4M–60M tokens per step, often ramped during training. µP / “maximal update parametrization” style scaling rules let labs tune hyperparameters on small proxies and transfer them up.
Numerics: mixed precision
Nothing trains in FP32 anymore. The standard recipe is BF16 compute with FP32 master state: matmuls and activations in bfloat16 (8-bit exponent — FP32's range with less precision, hence no loss-scaling dance that FP16 required), while a master copy of weights and the Adam moments stay in FP32 for stable accumulation.
| Format | Bits (sign·exp·mantissa) | Range | Role |
|---|---|---|---|
| FP32 | 1 · 8 · 23 | ~10^±38 | Master weights, optimizer moments, softmax/norm accumulations |
| BF16 | 1 · 8 · 7 | ~10^±38 | Default training compute since A100 |
| FP16 | 1 · 5 · 10 | ~±65,504 | Legacy training (needed loss scaling); still common in inference |
| FP8 (E4M3/E5M2) | 1 · 4 · 3 / 1 · 5 · 2 | ±448 / ±57,344 | Hopper/Blackwell matmuls; DeepSeek-V3 trained largely in FP8 |
Per-step training memory ≈ 16 bytes/param under this recipe (2 BF16 weight + 4 FP32 master + 8 Adam moments + gradient) — 70B parameters ⇒ ~1.1 TB before activations. Hence: parallelism.
Parallelism: one model, twenty thousand GPUs
No single accelerator holds a frontier model and its optimizer state, let alone trains it in tolerable time. Training is decomposed along complementary axes — composed together, this is “3-D (now 4-D+) parallelism”:
- Data parallelism (DP): clone the model, split the batch, all-reduce gradients. Scales until the gradient sync saturates the network.
- ZeRO / FSDP: DP without the memory waste — optimizer state, gradients, and finally parameters are sharded across replicas and gathered transiently. Stage-3 memory per GPU falls ~linearly in replica count.
- Tensor parallelism (TP): split individual weight matrices across GPUs (column- then row-wise, Megatron-style) so each matmul runs jointly; needs all-reduce per layer — keep it inside the NVLink island.
- Pipeline parallelism (PP): split by depth into stages; micro-batches stream to keep the “bubble” (idle ramp-up/down fraction ≈ \( (p-1)/m \) for \(p\) stages, \(m\) micro-batches) small. Interleaved and zero-bubble schedules (DualPipe) push this further.
- Context/sequence parallelism: split the sequence dimension (ring attention) for very long inputs. Expert parallelism spreads MoE experts (Chapter 09).
- Activation checkpointing: store only block boundaries, recompute the inside on backward — ~30% extra compute for several-fold activation memory savings.
The bill
Plugging EQ 4.2 into real numbers grounds every strategic conversation about AI:
# the bill: days and dollars for a pre-training run (EQ 4.5)
def bill(N, D, gpus, mfu, peak=989e12, usd_hr=2.0): # H100 BF16 peak
C = 6 * N * D # total FLOPs
days = C / (gpus * peak * mfu) / 86_400
return C, days, gpus * days * 24 * usd_hr
runs = [("GPT-2 redo (2019->now)", 1.5e9, 1e10, 256, 0.35),
("Llama-3.1-405B ", 4.05e11, 1.5e13, 16_384, 0.41),
("1e26-FLOP frontier ", 1.5e12, 1.1e13, 100_000, 0.40)]
print("run FLOPs days cost")
for name, N, D, g, mfu in runs:
C, days, cost = bill(N, D, g, mfu)
print(f"{name} {C:9.2e} {days:9.2f} ${cost:>12,.0f}")
print("\n405B check: 3.65e25 FLOPs / (16,384 x 989e12 x 0.41) = 63.5 days, ~$50M.")
print("Meta reported ~54 days of actual pre-training: napkin lands within 20%.")
print("GPT-2 is now a ~quarter-hour, ~$150 run. The frontier line is why")
print("training decisions reach the board.")
What you have now is a base model — a magnificent autocomplete that will continue a question with three more questions. Chapter 05: the alignment stack that turns it into something you can actually talk to.
Further reading
- Kaplan et al. (2020). Scaling Laws for Neural Language Models. — the first power-law account of loss versus parameters, data, and compute.
- Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). — corrected the data/parameter trade-off; the compute-optimal recipe used since.
- Loshchilov & Hutter (2019). Decoupled Weight Decay Regularization (AdamW). — the optimizer this chapter's schedule is built around.
- Micikevicius et al. (2018). Mixed Precision Training. — FP16 training with loss scaling, the basis of modern numerics.
- Shoeybi et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. — tensor parallelism for splitting a model across GPUs.
- Rajbhandari, Rasley, Ruwase & He (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. — the sharded-optimizer scheme behind data-parallel scale.
- Penedo et al. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. — a transparent account of modern web-data curation.