04 · Pre-training — LLM Field Manual

4.1

Data: the curriculum of the internet

Frontier runs consume 10–20 trillion tokens. Raw web crawl is mostly unusable; the pipeline that refines it is among the most guarded IP in the industry. The canonical stages:

Extraction. HTML → text (boilerplate, navigation, ads stripped). Quality of this step alone moves benchmarks.
Language ID & heuristic filters. Drop documents failing length, symbol-ratio, repetition and word-list tests (C4/Gopher rules).
Deduplication. Exact (hashing) and near-dup (MinHash / LSH over shingles). Duplicates waste compute and amplify memorization.
Model-based quality filtering. Classifiers trained to recognize “textbook-like” or high-utility pages now gate the majority of what survives (the FineWeb-Edu pattern).
Mixing. The final recipe weights sources — web, code, math, papers, books, multilingual — and typically ends with a midtraining / annealing phase that up-weights the highest-quality and long-context data at low learning rate.
Synthetic data. Increasingly, strong models generate or rewrite training text for weaker successors and specialized phases — with care, since uncurated self-training degrades distributions.

RULE

Data quality buys more than data quantity. Identical architectures separated only by corpus curation differ by the equivalent of a 2–5× compute multiplier. The “data wall” debate is really a question of how much refinable raw material and synthetic generation remain.

4.2

Scaling laws: how big, how long

Loss falls as a smooth, shockingly reliable power law in model size $N$ and data $D$. The Chinchilla (Hoffmann et al., 2022) parametric form:

EQ 4.1 — CHINCHILLA LOSS SURFACE $$ L(N, D) \;=\; E \;+\; \frac{A}{N^{\alpha}} \;+\; \frac{B}{D^{\beta}} $$

$E$ is the irreducible entropy of text; the two power-law terms are the cost of finite capacity and finite data. Corrected fit (Epoch AI's 2024 replication of the paper): $E = 1.82,\ A = 482.0,\ B = 2085.4,\ \alpha = 0.348,\ \beta = 0.366$ — the values the instrument below uses, which reproduce Chinchilla-70B/1.4T at the paper's own budget.

EQ 4.2 — THE BUDGET CONSTRAINT & OPTIMUM $$ C \approx 6\,N D \qquad\Longrightarrow\qquad N^{*} \propto C^{\,0.46}, \quad D^{*} \propto C^{\,0.54}, \quad \frac{D^{*}}{N^{*}} \approx 20 \text{ tokens/param} $$

Each parameter touched by each token costs ≈6 FLOPs (2 forward, 4 backward). Minimizing EQ 4.1 subject to the budget gives the famous rule of thumb: scale data and parameters together, ~20:1. Kaplan et al. (2020) had concluded ~1.7:1 — fixing that error is why Chinchilla-70B beat Gopher-280B with the same compute.

Llama-3-8B has $ N = 8 \times 10^{9} $ parameters and was trained on $ D = 1.5 \times 10^{13} $ tokens. What is its tokens-per-parameter ratio $ D/N $?

$ D/N = \dfrac{1.5 \times 10^{13}}{8 \times 10^{9}} = \dfrac{15{,}000}{8} = $ 1875 tokens/param — far above Chinchilla's ~20:1, deliberately overtrained to make inference cheap forever after.

Estimate the training compute $ C = 6\,N D $ for Llama-3-8B ($ N = 8 \times 10^{9} $, $ D = 1.5 \times 10^{13} $). Give your answer as the coefficient of $ 10^{23} $ FLOPs.

$ N D = 8 \times 10^{9} \times 1.5 \times 10^{13} = 1.2 \times 10^{23} $. Then $ C = 6 \times 1.2 \times 10^{23} = 7.2 \times 10^{23} $ FLOPs, i.e. coefficient 7.2.

PYTHON · RUNNABLE IN-BROWSER

# Chinchilla solver: closed-form N*, D* from the corrected-fit constants
import numpy as np
E, A, B, alpha, beta = 1.82, 482.0, 2085.4, 0.348, 0.366   # Epoch AI refit

a, b = beta / (alpha + beta), alpha / (alpha + beta)
G = (alpha * A / (beta * B)) ** (1 / (alpha + beta))

def optimum(C):                       # minimize EQ 4.1 subject to C = 6ND
    N = G * (C / 6) ** a
    D = (C / 6) / N
    return N, D, E + A / N**alpha + B / D**beta

print("        C          N*          D*   tok/param   loss L")
for C in (1e22, 5.76e23, 1e24, 1e26):
    N, D, L = optimum(C)
    print(f"  {C:8.1e} {N:11.2e} {D:11.2e} {D/N:9.1f} {L:9.3f}")

print("\n5.76e23 = Chinchilla's own budget: ~70B / ~1.4T recovered on a napkin.")
print("note the refit bends tokens/param below 20 as C grows -- the 20:1 rule")
print("is a Chinchilla-scale snapshot, not a law.")

Cs = np.logspace(20, 27, 50)
plot_xy(np.log10(Cs), np.log10([optimum(C)[0] for C in Cs]))  # slope = 0.51

edits are live — break it on purpose

INSTRUMENT 4.1 — SPEND A COMPUTE BUDGETEQ 4.1 + 4.2 · LIVE

COMPUTE BUDGET C 10^24 FLOPs

OPTIMAL PARAMS N*

—

OPTIMAL TOKENS D*

—

TOKENS / PARAM

—

ACHIEVABLE LOSS

—

Each curve: loss across all ways to split the budget C between parameters (x-axis) and tokens (implied, D = C/6N). The valley is broad — and real labs deliberately train smaller-than-optimal models on far more tokens (Llama-3-8B: ~1,875 tokens/param), overpaying in training compute to buy cheap inference forever after.

Emergence and downstream scaling. Loss scales smoothly; specific capabilities can look discontinuous (“emergent”) because task metrics are step functions over smooth log-likelihood gains. Modern practice fits separate scaling curves for benchmark performance, and — since 2024 — treats post-training compute and test-time compute (Chapter 05/08) as additional scaling axes.

4.3

Optimization: AdamW and the schedule

The unchallenged default is AdamW — Adam with decoupled weight decay:

EQ 4.3 — ADAMW UPDATE $$ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\, g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2)\, g_t^2 \\[4px] \hat{m}_t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \qquad \theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda\, \theta_t \right) \end{aligned} $$

First moment $m$ smooths the gradient; second moment $v$ normalizes per-parameter step size; decay $\lambda$ is applied to weights directly rather than mixed into the gradient (the “W”). Typical: $\beta_1 = 0.9, \beta_2 = 0.95, \lambda = 0.1$. Cost: two extra FP32 states per parameter — the reason optimizer memory, not weights, dominates training footprints, and a target of ZeRO sharding (§4.5). Newer optimizers (Muon, second-order-flavored methods) are credibly claiming 1.3–2× efficiency in recent open runs.

AdamW bias correction: with $ \beta_1 = 0.9 $, at step $ t = 2 $ the raw first moment is $ m_2 = 0.5 $. What is the corrected $ \hat{m}_2 = \dfrac{m_2}{1 - \beta_1^{\,t}} $?

$ \beta_1^{2} = 0.9^2 = 0.81 $, so $ 1 - 0.81 = 0.19 $. Then $ \hat{m}_2 = 0.5 / 0.19 = $ 2.632 — early-step correction inflates the moment while it is still warming up from zero.

EQ 4.4 — LEARNING-RATE SCHEDULE $$ \eta(t) = \begin{cases} \eta_{\max}\, \dfrac{t}{t_w} & t < t_w \quad \text{(linear warmup)} \\[8px] \eta_{\min} + \tfrac{1}{2}\big(\eta_{\max}-\eta_{\min}\big)\Big(1 + \cos \pi \tfrac{t - t_w}{T - t_w}\Big) & t \ge t_w \quad \text{(cosine decay)} \end{cases} $$

Warmup (hundreds–thousands of steps) protects the fragile early phase; cosine decays to $\eta_{\min} \approx 0.1\, \eta_{\max}$. The WSD (warmup–stable–decay) variant holds LR flat and decays only in a final phase — convenient for checkpoint reuse and continual pre-training. Gradient-norm clipping at 1.0 is universal; loss-spike lore (skip bad batches, restart from checkpoint) remains part of the craft.

PYTHON · RUNNABLE IN-BROWSER

# EQ A4.1 in dollars: identical mistake probabilities, two harnesses
actions = [  # (action class, P[harmful attempt], $cost raw, $cost sandboxed)
    ("bad file edit",        0.050,     2_000,  5),   # git reset vs lost work
    ("rm in the wrong dir",  0.010,    25_000,  5),   # container fs vs your homedir
    ("curl|sh from a README",0.004,   250_000, 50),   # egress allowlist blocks exfil
    ("prod credential use",  0.002, 1_000_000,  0),   # secret never mounted: c(a)=0
]

print(f"{'action class':24s}{'P[attempt]':>11s}{'E[raw]':>9s}{'E[sandboxed]':>14s}")
raw_total = box_total = 0.0
for name, p, c_raw, c_box in actions:
    raw_total += p * c_raw
    box_total += p * c_box
    print(f"{name:24s}{p:11.3f}{p * c_raw:9,.0f}{p * c_box:14.2f}")
print("-" * 58)
print(f"{'expected damage, one attempt of each':35s}{raw_total:9,.0f}{box_total:14.2f}")
print(f"\nsame model, same first factor — the harness cuts E[damage] by "
      f"{raw_total / box_total:,.0f}x")
print("you cannot zero P[harmful attempt]; you fully control max cost c(a)")

edits are live — break it on purpose

INSTRUMENT 4.2 — LR SCHEDULE DESIGNEREQ 4.4 · LIVE

WARMUP 3%

FLOOR η_min / η_max 10%

SHAPE

WSD (warmup–stable–decay) holds the rate flat and decays only in the final 20% — checkpoints from the stable plateau can be branched into many decay runs, which is why continual-pre-training shops prefer it.

Batch sizes are measured in tokens — frontier runs use 4M–60M tokens per step, often ramped during training. µP / “maximal update parametrization” style scaling rules let labs tune hyperparameters on small proxies and transfer them up.

4.4

Numerics: mixed precision

Nothing trains in FP32 anymore. The standard recipe is BF16 compute with FP32 master state: matmuls and activations in bfloat16 (8-bit exponent — FP32's range with less precision, hence no loss-scaling dance that FP16 required), while a master copy of weights and the Adam moments stay in FP32 for stable accumulation.

Format	Bits (sign·exp·mantissa)	Range	Role
FP32	1 · 8 · 23	~10^±38	Master weights, optimizer moments, softmax/norm accumulations
BF16	1 · 8 · 7	~10^±38	Default training compute since A100
FP16	1 · 5 · 10	~±65,504	Legacy training (needed loss scaling); still common in inference
FP8 (E4M3/E5M2)	1 · 4 · 3 / 1 · 5 · 2	±448 / ±57,344	Hopper/Blackwell matmuls; DeepSeek-V3 trained largely in FP8

Per-step training memory ≈ 16 bytes/param under this recipe (2 BF16 weight + 4 FP32 master + 8 Adam moments + gradient) — 70B parameters ⇒ ~1.1 TB before activations. Hence: parallelism.

4.5

Parallelism: one model, twenty thousand GPUs

No single accelerator holds a frontier model and its optimizer state, let alone trains it in tolerable time. Training is decomposed along complementary axes — composed together, this is “3-D (now 4-D+) parallelism”:

FIG 4.APARALLELISM AXES

Composition in practice (Llama-3-405B): TP=8 inside each server (NVLink), PP=16 across servers, DP/FSDP over the remainder, plus context parallelism for 128K-token sequences — 16,384 H100s working as one optimizer.

Data parallelism (DP): clone the model, split the batch, all-reduce gradients. Scales until the gradient sync saturates the network.
ZeRO / FSDP: DP without the memory waste — optimizer state, gradients, and finally parameters are sharded across replicas and gathered transiently. Stage-3 memory per GPU falls ~linearly in replica count.
Tensor parallelism (TP): split individual weight matrices across GPUs (column- then row-wise, Megatron-style) so each matmul runs jointly; needs all-reduce per layer — keep it inside the NVLink island.
Pipeline parallelism (PP): split by depth into stages; micro-batches stream to keep the “bubble” (idle ramp-up/down fraction ≈ $ (p-1)/m $ for $p$ stages, $m$ micro-batches) small. Interleaved and zero-bubble schedules (DualPipe) push this further.
Context/sequence parallelism: split the sequence dimension (ring attention) for very long inputs. Expert parallelism spreads MoE experts (Chapter 09).
Activation checkpointing: store only block boundaries, recompute the inside on backward — ~30% extra compute for several-fold activation memory savings.

4.6

The bill

Plugging EQ 4.2 into real numbers grounds every strategic conversation about AI:

EQ 4.5 — TRAINING TIME ESTIMATE $$ \text{days} \;=\; \frac{6\,N D}{n_{\text{GPU}} \times \text{FLOPs}_{\text{peak}} \times \text{MFU} \times 86{,}400} $$

MFU — model FLOPs utilization, the fraction of peak silicon throughput doing useful model math — runs 35–50% in well-tuned large runs. Example: $N = 405\text{B},\ D = 15\text{T} \Rightarrow C \approx 3.6 \times 10^{25}$ FLOPs; on 16,384 H100s (≈990 TFLOPs BF16 each) at 41% MFU ⇒ ~63 days. At ~$2/GPU-hr that's ~$50M of compute — before the salaries, the failed runs, and the post-training.

Using EQ 4.5, estimate wall-clock days for a run with $ C = 3.456 \times 10^{24} $ FLOPs on $ n_{\text{GPU}} = 1000 $ chips at $ \text{FLOPs}_{\text{peak}} = 10^{15} $/s and $ \text{MFU} = 0.4 $. (Use $ 86{,}400 $ s/day.)

Denominator $ = 1000 \times 10^{15} \times 0.4 \times 86{,}400 = 3.456 \times 10^{22} $ FLOPs/day. Days $ = \dfrac{3.456 \times 10^{24}}{3.456 \times 10^{22}} = $ 100 days.

PYTHON · RUNNABLE IN-BROWSER

# the bill: days and dollars for a pre-training run (EQ 4.5)
def bill(N, D, gpus, mfu, peak=989e12, usd_hr=2.0):     # H100 BF16 peak
    C = 6 * N * D                                       # total FLOPs
    days = C / (gpus * peak * mfu) / 86_400
    return C, days, gpus * days * 24 * usd_hr

runs = [("GPT-2 redo (2019->now)", 1.5e9,    1e10,   256,     0.35),
        ("Llama-3.1-405B        ", 4.05e11,  1.5e13, 16_384,  0.41),
        ("1e26-FLOP frontier    ", 1.5e12,   1.1e13, 100_000, 0.40)]

print("run                        FLOPs      days          cost")
for name, N, D, g, mfu in runs:
    C, days, cost = bill(N, D, g, mfu)
    print(f"{name} {C:9.2e} {days:9.2f}  ${cost:>12,.0f}")

print("\n405B check: 3.65e25 FLOPs / (16,384 x 989e12 x 0.41) = 63.5 days, ~$50M.")
print("Meta reported ~54 days of actual pre-training: napkin lands within 20%.")
print("GPT-2 is now a ~quarter-hour, ~$150 run. The frontier line is why")
print("training decisions reach the board.")

edits are live — break it on purpose

INSTRUMENT 4.3 — PRICE A TRAINING RUNEQ 4.5 · H100 BF16 PEAK 989 TFLOPs

PARAMS N 405B

TOKENS D 15T

GPUs 16,384

MFU 41%

RATE $2.00/hr

COMPUTE C = 6ND

—

WALL-CLOCK

—

COMPUTE COST

—

Defaults ≈ Llama-3.1-405B. Try GPT-2 (N=1.5B, D=10B tokens) on 256 GPUs — what took OpenAI weeks in 2019 is now an afternoon. Then price a 10²⁶-FLOP frontier run and see why these decisions reach board level.

GPT-2 (2019)

~1021

FLOPs — reproducible today for a few hundred dollars

GPT-4 CLASS (2023)

~2×1025

FLOPs — tens of millions of dollars

FRONTIER (2025–26)

1026+

FLOPs — gigawatt-scale clusters, $100M–$1B+ runs

What you have now is a base model — a magnificent autocomplete that will continue a question with three more questions. Chapter 05: the alignment stack that turns it into something you can actually talk to.

§

Pre-training

Data: the curriculum of the internet

Scaling laws: how big, how long

Optimization: AdamW and the schedule

Numerics: mixed precision

Parallelism: one model, twenty thousand GPUs

The bill

Further reading