AI // ENCYCLOPEDIA / VOL II / 04 / PRE-TRAINING INDEX NEXT: POST-TRAINING →
CHAPTER 04 / 10

Pre-training

Pre-training spends a compute budget, months of time on tens of thousands of accelerators, to push cross-entropy as low as physics and economics allow. The decisions are few but consequential: what data to use, how many parameters versus how many tokens, which optimizer settings, and how to keep a building-sized computer numerically stable.

READING TIME≈ 30 MIN BUILDS ONCH 01–02 INSTRUMENTSSCALING · LR DESIGNER · THE BILL
4.1

Data: the curriculum of the internet

Frontier runs consume 10–20 trillion tokens. Raw web crawl is mostly unusable; the pipeline that refines it is among the most guarded IP in the industry. The canonical stages:

  • Extraction. HTML → text (boilerplate, navigation, ads stripped). Quality of this step alone moves benchmarks.
  • Language ID & heuristic filters. Drop documents failing length, symbol-ratio, repetition and word-list tests (C4/Gopher rules).
  • Deduplication. Exact (hashing) and near-dup (MinHash / LSH over shingles). Duplicates waste compute and amplify memorization.
  • Model-based quality filtering. Classifiers trained to recognize “textbook-like” or high-utility pages now gate the majority of what survives (the FineWeb-Edu pattern).
  • Mixing. The final recipe weights sources — web, code, math, papers, books, multilingual — and typically ends with a midtraining / annealing phase that up-weights the highest-quality and long-context data at low learning rate.
  • Synthetic data. Increasingly, strong models generate or rewrite training text for weaker successors and specialized phases — with care, since uncurated self-training degrades distributions.
RULE

Data quality buys more than data quantity. Identical architectures separated only by corpus curation differ by the equivalent of a 2–5× compute multiplier. The “data wall” debate is really a question of how much refinable raw material and synthetic generation remain.

4.2

Scaling laws: how big, how long

Loss falls as a smooth, shockingly reliable power law in model size \(N\) and data \(D\). The Chinchilla (Hoffmann et al., 2022) parametric form:

EQ 4.1 — CHINCHILLA LOSS SURFACE $$ L(N, D) \;=\; E \;+\; \frac{A}{N^{\alpha}} \;+\; \frac{B}{D^{\beta}} $$
\(E\) is the irreducible entropy of text; the two power-law terms are the cost of finite capacity and finite data. Corrected fit (Epoch AI's 2024 replication of the paper): \(E = 1.82,\ A = 482.0,\ B = 2085.4,\ \alpha = 0.348,\ \beta = 0.366\) — the values the instrument below uses, which reproduce Chinchilla-70B/1.4T at the paper's own budget.
EQ 4.2 — THE BUDGET CONSTRAINT & OPTIMUM $$ C \approx 6\,N D \qquad\Longrightarrow\qquad N^{*} \propto C^{\,0.46}, \quad D^{*} \propto C^{\,0.54}, \quad \frac{D^{*}}{N^{*}} \approx 20 \text{ tokens/param} $$
Each parameter touched by each token costs ≈6 FLOPs (2 forward, 4 backward). Minimizing EQ 4.1 subject to the budget gives the famous rule of thumb: scale data and parameters together, ~20:1. Kaplan et al. (2020) had concluded ~1.7:1 — fixing that error is why Chinchilla-70B beat Gopher-280B with the same compute.
Llama-3-8B has \( N = 8 \times 10^{9} \) parameters and was trained on \( D = 1.5 \times 10^{13} \) tokens. What is its tokens-per-parameter ratio \( D/N \)?
\( D/N = \dfrac{1.5 \times 10^{13}}{8 \times 10^{9}} = \dfrac{15{,}000}{8} = \) 1875 tokens/param — far above Chinchilla's ~20:1, deliberately overtrained to make inference cheap forever after.
Estimate the training compute \( C = 6\,N D \) for Llama-3-8B (\( N = 8 \times 10^{9} \), \( D = 1.5 \times 10^{13} \)). Give your answer as the coefficient of \( 10^{23} \) FLOPs.
\( N D = 8 \times 10^{9} \times 1.5 \times 10^{13} = 1.2 \times 10^{23} \). Then \( C = 6 \times 1.2 \times 10^{23} = 7.2 \times 10^{23} \) FLOPs, i.e. coefficient 7.2.
PYTHON · RUNNABLE IN-BROWSER
# Chinchilla solver: closed-form N*, D* from the corrected-fit constants
import numpy as np
E, A, B, alpha, beta = 1.82, 482.0, 2085.4, 0.348, 0.366   # Epoch AI refit

a, b = beta / (alpha + beta), alpha / (alpha + beta)
G = (alpha * A / (beta * B)) ** (1 / (alpha + beta))

def optimum(C):                       # minimize EQ 4.1 subject to C = 6ND
    N = G * (C / 6) ** a
    D = (C / 6) / N
    return N, D, E + A / N**alpha + B / D**beta

print("        C          N*          D*   tok/param   loss L")
for C in (1e22, 5.76e23, 1e24, 1e26):
    N, D, L = optimum(C)
    print(f"  {C:8.1e} {N:11.2e} {D:11.2e} {D/N:9.1f} {L:9.3f}")

print("\n5.76e23 = Chinchilla's own budget: ~70B / ~1.4T recovered on a napkin.")
print("note the refit bends tokens/param below 20 as C grows -- the 20:1 rule")
print("is a Chinchilla-scale snapshot, not a law.")

Cs = np.logspace(20, 27, 50)
plot_xy(np.log10(Cs), np.log10([optimum(C)[0] for C in Cs]))  # slope = 0.51
edits are live — break it on purpose
INSTRUMENT 4.1 — SPEND A COMPUTE BUDGETEQ 4.1 + 4.2 · LIVE
OPTIMAL PARAMS N*
OPTIMAL TOKENS D*
TOKENS / PARAM
ACHIEVABLE LOSS
Each curve: loss across all ways to split the budget C between parameters (x-axis) and tokens (implied, D = C/6N). The valley is broad — and real labs deliberately train smaller-than-optimal models on far more tokens (Llama-3-8B: ~1,875 tokens/param), overpaying in training compute to buy cheap inference forever after.

Emergence and downstream scaling. Loss scales smoothly; specific capabilities can look discontinuous (“emergent”) because task metrics are step functions over smooth log-likelihood gains. Modern practice fits separate scaling curves for benchmark performance, and — since 2024 — treats post-training compute and test-time compute (Chapter 05/08) as additional scaling axes.

4.3

Optimization: AdamW and the schedule

The unchallenged default is AdamW — Adam with decoupled weight decay:

EQ 4.3 — ADAMW UPDATE $$ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\, g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2)\, g_t^2 \\[4px] \hat{m}_t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \qquad \theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda\, \theta_t \right) \end{aligned} $$
First moment \(m\) smooths the gradient; second moment \(v\) normalizes per-parameter step size; decay \(\lambda\) is applied to weights directly rather than mixed into the gradient (the “W”). Typical: \(\beta_1 = 0.9, \beta_2 = 0.95, \lambda = 0.1\). Cost: two extra FP32 states per parameter — the reason optimizer memory, not weights, dominates training footprints, and a target of ZeRO sharding (§4.5). Newer optimizers (Muon, second-order-flavored methods) are credibly claiming 1.3–2× efficiency in recent open runs.
AdamW bias correction: with \( \beta_1 = 0.9 \), at step \( t = 2 \) the raw first moment is \( m_2 = 0.5 \). What is the corrected \( \hat{m}_2 = \dfrac{m_2}{1 - \beta_1^{\,t}} \)?
\( \beta_1^{2} = 0.9^2 = 0.81 \), so \( 1 - 0.81 = 0.19 \). Then \( \hat{m}_2 = 0.5 / 0.19 = \) 2.632 — early-step correction inflates the moment while it is still warming up from zero.
EQ 4.4 — LEARNING-RATE SCHEDULE $$ \eta(t) = \begin{cases} \eta_{\max}\, \dfrac{t}{t_w} & t < t_w \quad \text{(linear warmup)} \\[8px] \eta_{\min} + \tfrac{1}{2}\big(\eta_{\max}-\eta_{\min}\big)\Big(1 + \cos \pi \tfrac{t - t_w}{T - t_w}\Big) & t \ge t_w \quad \text{(cosine decay)} \end{cases} $$
Warmup (hundreds–thousands of steps) protects the fragile early phase; cosine decays to \(\eta_{\min} \approx 0.1\, \eta_{\max}\). The WSD (warmup–stable–decay) variant holds LR flat and decays only in a final phase — convenient for checkpoint reuse and continual pre-training. Gradient-norm clipping at 1.0 is universal; loss-spike lore (skip bad batches, restart from checkpoint) remains part of the craft.
PYTHON · RUNNABLE IN-BROWSER
# EQ A4.1 in dollars: identical mistake probabilities, two harnesses
actions = [  # (action class, P[harmful attempt], $cost raw, $cost sandboxed)
    ("bad file edit",        0.050,     2_000,  5),   # git reset vs lost work
    ("rm in the wrong dir",  0.010,    25_000,  5),   # container fs vs your homedir
    ("curl|sh from a README",0.004,   250_000, 50),   # egress allowlist blocks exfil
    ("prod credential use",  0.002, 1_000_000,  0),   # secret never mounted: c(a)=0
]

print(f"{'action class':24s}{'P[attempt]':>11s}{'E[raw]':>9s}{'E[sandboxed]':>14s}")
raw_total = box_total = 0.0
for name, p, c_raw, c_box in actions:
    raw_total += p * c_raw
    box_total += p * c_box
    print(f"{name:24s}{p:11.3f}{p * c_raw:9,.0f}{p * c_box:14.2f}")
print("-" * 58)
print(f"{'expected damage, one attempt of each':35s}{raw_total:9,.0f}{box_total:14.2f}")
print(f"\nsame model, same first factor — the harness cuts E[damage] by "
      f"{raw_total / box_total:,.0f}x")
print("you cannot zero P[harmful attempt]; you fully control max cost c(a)")
edits are live — break it on purpose
INSTRUMENT 4.2 — LR SCHEDULE DESIGNEREQ 4.4 · LIVE
WSD (warmup–stable–decay) holds the rate flat and decays only in the final 20% — checkpoints from the stable plateau can be branched into many decay runs, which is why continual-pre-training shops prefer it.

Batch sizes are measured in tokens — frontier runs use 4M–60M tokens per step, often ramped during training. µP / “maximal update parametrization” style scaling rules let labs tune hyperparameters on small proxies and transfer them up.

4.4

Numerics: mixed precision

Nothing trains in FP32 anymore. The standard recipe is BF16 compute with FP32 master state: matmuls and activations in bfloat16 (8-bit exponent — FP32's range with less precision, hence no loss-scaling dance that FP16 required), while a master copy of weights and the Adam moments stay in FP32 for stable accumulation.

FormatBits (sign·exp·mantissa)RangeRole
FP321 · 8 · 23~10^±38Master weights, optimizer moments, softmax/norm accumulations
BF161 · 8 · 7~10^±38Default training compute since A100
FP161 · 5 · 10~±65,504Legacy training (needed loss scaling); still common in inference
FP8 (E4M3/E5M2)1 · 4 · 3 / 1 · 5 · 2±448 / ±57,344Hopper/Blackwell matmuls; DeepSeek-V3 trained largely in FP8

Per-step training memory ≈ 16 bytes/param under this recipe (2 BF16 weight + 4 FP32 master + 8 Adam moments + gradient) — 70B parameters ⇒ ~1.1 TB before activations. Hence: parallelism.

4.5

Parallelism: one model, twenty thousand GPUs

No single accelerator holds a frontier model and its optimizer state, let alone trains it in tolerable time. Training is decomposed along complementary axes — composed together, this is “3-D (now 4-D+) parallelism”:

FIG 4.APARALLELISM AXES
DATA PARALLEL (DP) replica 0 batch shard A replica 1 batch shard B all-reduce grads TENSOR PARALLEL (TP) W[:, :d/2] half of every matmul W[:, d/2:] other half all-reduce per layer (NVLink domain) PIPELINE PARALLEL (PP) layers 1–40 41–80 81–126 micro-batches stream through stages ZeRO / FSDP — SHARD THE STATES, NOT THE MATH Stage 1: shard optimizer state · Stage 2: + gradients · Stage 3: + parameters (gather just-in-time per layer, then discard)
Composition in practice (Llama-3-405B): TP=8 inside each server (NVLink), PP=16 across servers, DP/FSDP over the remainder, plus context parallelism for 128K-token sequences — 16,384 H100s working as one optimizer.
  • Data parallelism (DP): clone the model, split the batch, all-reduce gradients. Scales until the gradient sync saturates the network.
  • ZeRO / FSDP: DP without the memory waste — optimizer state, gradients, and finally parameters are sharded across replicas and gathered transiently. Stage-3 memory per GPU falls ~linearly in replica count.
  • Tensor parallelism (TP): split individual weight matrices across GPUs (column- then row-wise, Megatron-style) so each matmul runs jointly; needs all-reduce per layer — keep it inside the NVLink island.
  • Pipeline parallelism (PP): split by depth into stages; micro-batches stream to keep the “bubble” (idle ramp-up/down fraction ≈ \( (p-1)/m \) for \(p\) stages, \(m\) micro-batches) small. Interleaved and zero-bubble schedules (DualPipe) push this further.
  • Context/sequence parallelism: split the sequence dimension (ring attention) for very long inputs. Expert parallelism spreads MoE experts (Chapter 09).
  • Activation checkpointing: store only block boundaries, recompute the inside on backward — ~30% extra compute for several-fold activation memory savings.
4.6

The bill

Plugging EQ 4.2 into real numbers grounds every strategic conversation about AI:

EQ 4.5 — TRAINING TIME ESTIMATE $$ \text{days} \;=\; \frac{6\,N D}{n_{\text{GPU}} \times \text{FLOPs}_{\text{peak}} \times \text{MFU} \times 86{,}400} $$
MFU — model FLOPs utilization, the fraction of peak silicon throughput doing useful model math — runs 35–50% in well-tuned large runs. Example: \(N = 405\text{B},\ D = 15\text{T} \Rightarrow C \approx 3.6 \times 10^{25}\) FLOPs; on 16,384 H100s (≈990 TFLOPs BF16 each) at 41% MFU ⇒ ~63 days. At ~$2/GPU-hr that's ~$50M of compute — before the salaries, the failed runs, and the post-training.
Using EQ 4.5, estimate wall-clock days for a run with \( C = 3.456 \times 10^{24} \) FLOPs on \( n_{\text{GPU}} = 1000 \) chips at \( \text{FLOPs}_{\text{peak}} = 10^{15} \)/s and \( \text{MFU} = 0.4 \). (Use \( 86{,}400 \) s/day.)
Denominator \( = 1000 \times 10^{15} \times 0.4 \times 86{,}400 = 3.456 \times 10^{22} \) FLOPs/day. Days \( = \dfrac{3.456 \times 10^{24}}{3.456 \times 10^{22}} = \) 100 days.
PYTHON · RUNNABLE IN-BROWSER
# the bill: days and dollars for a pre-training run (EQ 4.5)
def bill(N, D, gpus, mfu, peak=989e12, usd_hr=2.0):     # H100 BF16 peak
    C = 6 * N * D                                       # total FLOPs
    days = C / (gpus * peak * mfu) / 86_400
    return C, days, gpus * days * 24 * usd_hr

runs = [("GPT-2 redo (2019->now)", 1.5e9,    1e10,   256,     0.35),
        ("Llama-3.1-405B        ", 4.05e11,  1.5e13, 16_384,  0.41),
        ("1e26-FLOP frontier    ", 1.5e12,   1.1e13, 100_000, 0.40)]

print("run                        FLOPs      days          cost")
for name, N, D, g, mfu in runs:
    C, days, cost = bill(N, D, g, mfu)
    print(f"{name} {C:9.2e} {days:9.2f}  ${cost:>12,.0f}")

print("\n405B check: 3.65e25 FLOPs / (16,384 x 989e12 x 0.41) = 63.5 days, ~$50M.")
print("Meta reported ~54 days of actual pre-training: napkin lands within 20%.")
print("GPT-2 is now a ~quarter-hour, ~$150 run. The frontier line is why")
print("training decisions reach the board.")
edits are live — break it on purpose
INSTRUMENT 4.3 — PRICE A TRAINING RUNEQ 4.5 · H100 BF16 PEAK 989 TFLOPs
COMPUTE C = 6ND
WALL-CLOCK
COMPUTE COST
Defaults ≈ Llama-3.1-405B. Try GPT-2 (N=1.5B, D=10B tokens) on 256 GPUs — what took OpenAI weeks in 2019 is now an afternoon. Then price a 10²⁶-FLOP frontier run and see why these decisions reach board level.
GPT-2 (2019)
~1021
FLOPs — reproducible today for a few hundred dollars
GPT-4 CLASS (2023)
~2×1025
FLOPs — tens of millions of dollars
FRONTIER (2025–26)
1026+
FLOPs — gigawatt-scale clusters, $100M–$1B+ runs
NEXT

What you have now is a base model — a magnificent autocomplete that will continue a question with three more questions. Chapter 05: the alignment stack that turns it into something you can actually talk to.

§

Further reading

  • Kaplan et al. (2020). Scaling Laws for Neural Language Models. — the first power-law account of loss versus parameters, data, and compute.
  • Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). — corrected the data/parameter trade-off; the compute-optimal recipe used since.
  • Loshchilov & Hutter (2019). Decoupled Weight Decay Regularization (AdamW). — the optimizer this chapter's schedule is built around.
  • Micikevicius et al. (2018). Mixed Precision Training. — FP16 training with loss scaling, the basis of modern numerics.
  • Shoeybi et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. — tensor parallelism for splitting a model across GPUs.
  • Rajbhandari, Rasley, Ruwase & He (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. — the sharded-optimizer scheme behind data-parallel scale.
  • Penedo et al. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. — a transparent account of modern web-data curation.