Training Techniques in Practice

4.1

Dataset curation & tagging

The most common reason a fine-tune disappoints is not the algorithm — it is the data. A base model has already seen trillions of tokens of generic text; your few thousand examples can only nudge it, so every one of them must earn its place. The job before training is curation: assemble examples that are correct, on-distribution for what you will actually ask at inference, free of duplicates, and not contaminated by your evaluation set.

Four checks do most of the work, in this order:

Step	What it removes	Why it matters
Dedup	near-identical examples	Repeats inflate apparent dataset size and let the model memorize a handful of strings instead of learning the pattern.
Decontaminate	eval/test leakage	If your benchmark questions sit in the training data, your numbers are fiction (Vol II · §6.5).
Quality filter	wrong, toxic, off-format	One mislabeled example teaches the wrong thing far more efficiently than ten right ones correct it.
Balance	skew toward easy/common cases	An imbalanced set makes the model fluent on the majority slice and blind to the tail you care about.

Tagging is the second half of curation, and it is what turns a flat pile of text into a steerable signal. Every example carries metadata — its source, domain, language, difficulty, quality score, license — and those tags are used in two distinct ways. Offline, they drive filtering and the mixing ratios of §4.4. Inline, special tokens written into the sequence itself let the model condition on provenance: a leading <quality:high> or domain marker that the model learns to associate with the behavior you want, and that you can then assert at inference. This is the idea behind conditional pre-training and quality-tag prefixes — keep the noisy data in the corpus for breadth, but label it so the model can be told to imitate only the good parts.

CONTROL TOKENS

The format your tags take is not free decoration — they must be reserved tokens the tokenizer treats atomically, not strings the model could also emit as ordinary text. A domain tag spelled as plain words can be hallucinated mid-generation; a true control token cannot, because it lives outside the natural-language vocabulary. Reuse the base model's existing special-token slots where you can.

Curation also has a quantitative side: not every example contributes equally to the loss, and a corpus's effective size after dedup is smaller than its raw count. A simple, honest way to measure the diversity you actually have is the effective number of distinct items — the exponential of the entropy of the source mixture, which collapses toward 1 as one source dominates and rises toward the source count when the mixture is uniform.

EQ OM4.1 — EFFECTIVE DATASET DIVERSITY $$ H = -\sum_{i} p_i \log p_i, \qquad N_{\text{eff}} = e^{H} = \exp\!\Big(\!-\!\sum_i p_i \log p_i\Big) $$

$p_i$ is the fraction of tokens from source $i$ after dedup. $N_{\text{eff}}$ is the perplexity of the source distribution: a corpus that is 90% one source and 10% another has $N_{\text{eff}} \approx 1.38$ — barely more diverse than a single source, no matter how many sources are nominally present. The number that matters is not how many sources you collected, but how evenly the tokens are spread across them. This same quantity reappears as the lever in the data-mixing instrument of §4.4.

A deduplicated corpus draws tokens from two sources in equal proportion, $ p = (0.5,\ 0.5) $. Using EQ OM4.1 with natural logs, what is the effective number of sources $ N_{\text{eff}} = e^{H} $?

$ H = -(0.5\ln 0.5 + 0.5\ln 0.5) = -\ln 0.5 = \ln 2 $. Then $ N_{\text{eff}} = e^{\ln 2} = $ 2 — a perfectly even two-source mix is worth exactly two sources, the maximum for two parts.

PYTHON · RUNNABLE IN-BROWSER

# EQ OM4.1: effective dataset diversity = exp(entropy of the source mix)
import numpy as np

def n_eff(p):
    p = np.asarray(p, float); p = p / p.sum()      # normalize to a distribution
    p = p[p > 0]                                    # 0*log0 = 0, skip empty sources
    H = -(p * np.log(p)).sum()                      # Shannon entropy (nats)
    return np.exp(H)                                # perplexity of the mixture

mixes = {
    "uniform 4-way   ": [1, 1, 1, 1],
    "skewed 90/10     ": [0.9, 0.1],
    "near single src  ": [0.97, 0.01, 0.01, 0.01],
    "balanced 3-way   ": [1, 1, 1],
}
for name, p in mixes.items():
    print(f"{name}: N_eff = {n_eff(p):.3f}  (raw sources = {len(p)})")

print("\nN_eff collapses toward 1 as one source dominates -- collecting more")
print("sources buys nothing if 90% of your tokens still come from one of them.")

edits are live — break it on purpose

A blunt heuristic that holds up: read fifty of your own examples by hand before you train on any of them. Tools find duplicates and contamination; only a human notices that the "answers" were scraped from a forum where half of them are wrong, or that the format drifts every few hundred rows. Quality dominates quantity in fine-tuning, and the cheapest quality filter is a pair of eyes.

4.2

Freezing & unfreezing blocks

Full fine-tuning lets every weight move. But a transformer's layers do not all do the same job: the lower blocks encode generic, broadly useful features (tokens, syntax, low-level semantics), while the upper blocks specialize toward the output distribution. Freezing a block means excluding its parameters from the optimizer — they keep their pre-trained values, receive no gradient update, and need no optimizer state. The choice of where to draw the freeze line is one of the highest-leverage knobs you have, and it trades three things at once.

You freeze more →	Trainable params	Compute & memory	Forgetting risk
Freeze nothing (full FT)	100%	highest	highest
Freeze lower blocks	fewer	lower	lower
Freeze all but the head	tiny	lowest	lowest (but least capacity)

The mechanics are simple but worth stating exactly. For a model whose backbone is $L$ identical transformer blocks of $P_b$ parameters each, plus an embedding table $P_e$ and an output head $P_h$, freezing the first $k$ blocks (and the embeddings, the usual default) leaves a trainable count and fraction of:

EQ OM4.2 — TRAINABLE FRACTION UNDER FREEZING $$ P_{\text{train}} = (L-k)\,P_b + P_h, \qquad f = \frac{P_{\text{train}}}{P_e + L\,P_b + P_h} $$

$k$ is the number of frozen lower blocks. Because optimizer state (with AdamW, two moments) and activation gradients are only kept for trainable parameters, halving $P_{\text{train}}$ roughly halves the training memory beyond the frozen forward pass — and it directly limits how far the weights can drift from the pre-trained solution, which is exactly the forgetting lever of §4.5. Frozen layers still run on the forward pass, so they cost compute for activations; what you save is the backward pass and the optimizer.

Gradual unfreezing, introduced with ULMFiT, sequences these choices over time rather than fixing one. You begin with everything frozen but the head, train for a bit, then unfreeze the topmost block, then the next, and so on toward the input — each newly thawed layer also given a smaller learning rate than the one above it (discriminative fine-tuning). The intuition: let the task-specific top adapt first on stable lower features, then carefully relax the deeper, more general representations only once the top has found its footing. This was a central recipe for transfer learning before LoRA (Vol II · §6.2) made low-rank adapters the default, and it remains the right mental model for what freezing buys you.

A model has $ L = 32 $ equal-size transformer blocks (treat the embedding and head as negligible). You freeze the first $ k = 24 $ blocks. Using EQ OM4.2, what fraction $ f $ of the blocks remains trainable?

With equal blocks and a negligible head, $ f = \dfrac{L-k}{L} = \dfrac{32-24}{32} = \dfrac{8}{32} = $ 0.25. Freezing three-quarters of the backbone leaves one quarter of the parameters to learn the task — and roughly quarters the optimizer memory they require.

PYTHON · RUNNABLE IN-BROWSER

# EQ OM4.2: freeze the first k of L blocks, get the trainable fraction
import numpy as np

L      = 32           # transformer blocks
P_b    = 200e6        # params per block (200M, ~6.4B backbone)
P_emb  = 525e6        # embedding table (frozen with the lower blocks)
P_head = 525e6        # output head (always trainable here)
total  = P_emb + L * P_b + P_head

print(f"{'frozen k':>9} {'trainable params':>18} {'fraction':>10}")
for k in (0, 8, 16, 24, 31):
    p_train = (L - k) * P_b + P_head            # head stays trainable
    frac = p_train / total
    print(f"{k:>9} {p_train/1e9:>16.3f}B {frac:>10.3f}")

# optimizer (AdamW: 2 moments) + grads ~ 12 bytes/trainable param, fp32-ish
k = 24
p_train = (L - k) * P_b + P_head
opt_gb = p_train * 12 / 1e9
print(f"\nfreeze {k}: ~{opt_gb:.1f} GB of optimizer+grad state, vs "
      f"{(total*12/1e9):.1f} GB for full FT -- the memory you buy back.")

edits are live — break it on purpose

INSTRUMENT OM4.1 — LAYER-FREEZING EXPLOREREQ OM4.2 · WHICH BLOCKS TRAIN

BACKBONE BLOCKS L 32

FROZEN LOWER BLOCKS k 24

EMBEDDINGS

TRAINABLE BLOCKS

—

TRAINABLE PARAMS

—

TRAINABLE FRACTION

—

OPTIM + GRAD MEMORY

—

Each cell is one block; mint = trainable, deep-green = frozen. The head is always trainable; toggle whether the embedding table thaws too. Drag k up from 0 (full fine-tune) toward L and watch trainable params — and the optimizer memory they demand — fall away. The bottom blocks you freeze are the generic features you most want to protect; the top blocks you leave trainable are where task-specific behavior lives.

4.3

Continued pre-training & domain adaptation

Instruction fine-tuning teaches a model how to behave; it does not, by itself, teach it a new domain. If your target is legal contracts, clinical notes, a low-resource language, or an internal codebase whose idioms never appeared at scale on the open web, the base model lacks the underlying language model of that domain — and no amount of supervised examples will install vocabulary and distributional knowledge that the pre-training never built. The fix is continued pre-training (also called domain-adaptive pre-training, or DAPT): take the base model and keep running the original self-supervised objective — next-token prediction — but now on a large corpus of in-domain raw text, before you do any task fine-tuning.

EQ OM4.3 — THE TWO-STAGE OBJECTIVE $$ \theta_0 \;\xrightarrow[\text{DAPT}]{\;\mathcal{L}_{\text{LM}}(\mathcal{D}_{\text{domain}})\;} \theta_1 \;\xrightarrow[\text{SFT}]{\;\mathcal{L}_{\text{task}}(\mathcal{D}_{\text{task}})\;} \theta_2, \qquad \mathcal{L}_{\text{LM}} = -\!\sum_t \log p_\theta(x_t \mid x_{<t}) $$

Start from the pre-trained $\theta_0$; first minimize the same language-modeling loss on domain text to reach $\theta_1$; only then fine-tune on labeled task data to reach $\theta_2$. The first arrow moves the model's distribution onto your domain; the second teaches the skill. Continued pre-training adapts a base model to a new domain before fine-tuning — it is the step that builds the foundation the task fine-tune then stands on. Gururangan et al. showed this two-stage recipe beats task fine-tuning alone across domains, and that a cheaper task-adaptive variant (TAPT, pre-training on the unlabeled task data itself) helps even when a domain corpus is unavailable.

Three engineering points separate a continued-pre-training run that helps from one that quietly damages the model:

Learning rate. Use a fraction of the original pre-training peak — too high and you overwrite general knowledge (forgetting, §4.5); too low and the domain never sinks in. A short warmup and cosine decay over one to a few epochs of domain text is the standard shape.
Tokenizer fit. If your domain uses tokens the base tokenizer shatters into fragments (chemical formulae, code, a new script), continued pre-training on top of a bad tokenization is fighting uphill. Vocabulary extension is sometimes warranted — but new embedding rows start untrained and need their own warmup.
Replay. Mix a slice of general-domain text back into the domain corpus (§4.4). Pure in-domain continued pre-training is the fastest route to a model that aces your jargon and has forgotten how to hold a normal conversation.

When to reach for it. Continued pre-training is expensive relative to LoRA fine-tuning and is the wrong tool for behavioral gaps (format, style, tool protocols) — those are §4.2 / Vol II · Ch 06 territory. It earns its cost when the gap is genuinely knowledge or language: the model would need to have read more of your world to do the task at all. A useful tell is perplexity — if the base model's perplexity on a held-out sample of your domain is high, DAPT has room to work; if it is already low, you mostly need task data, not more pre-training.

4.4

Curriculum & data mixing

Once you have curated, tagged, and possibly domain-adapted, two questions about order and proportion remain. Curriculum learning asks: in what sequence should examples be presented? Data mixing asks: in what ratio should different sources be combined? Both are about shaping the distribution the optimizer sees over the course of training, and both can move final quality more than another epoch ever would.

Curriculum

The original curriculum-learning result (Bengio et al., 2009) is that presenting examples in order of increasing difficulty — easy first, hard later — can speed convergence and reach better optima than uniform random shuffling, much as a syllabus does for a student. In LLM training the effect is real but contested: it helps most when there is a clear, reliable difficulty signal (sequence length, a grader's score, a reasoning-step count) and matters less when data is already abundant and diverse. The honest summary in 2026 is that curriculum is a useful lever for reasoning and code fine-tunes with measurable difficulty, and an over-engineered one for generic chat data where good shuffling is hard to beat.

Data mixing & replay

Mixing is the more universally important of the two. When you fine-tune or continue-pre-train, you choose how to weight your sources — and crucially, how much replay (a fraction of the model's original, general-domain data) to fold back in. Replay is the single most reliable defense against forgetting: by keeping the old distribution partly present, you keep its gradients partly alive. The trade-off is direct — more replay protects old capability at the cost of slower adaptation to the new domain.

EQ OM4.4 — A MIXED OBJECTIVE WITH REPLAY $$ \mathcal{L}_{\text{mix}} = (1-r)\,\mathbb{E}_{x\sim\mathcal{D}_{\text{new}}}\!\big[\ell(x)\big] \;+\; r\,\mathbb{E}_{x\sim\mathcal{D}_{\text{old}}}\!\big[\ell(x)\big], \qquad r \in [0, 1] $$

$r$ is the replay fraction — the share of each batch drawn from the original general corpus rather than the new domain. $r = 0$ is pure adaptation (fastest learning, fastest forgetting); $r = 1$ is no adaptation at all. Empirically a small replay fraction — often 1–10% — recovers most of the retained capability for a small slowdown in adaptation, which is why "mix a little of the old data back in" is the most repeated piece of fine-tuning advice that actually works. The same effective-diversity logic of EQ OM4.1 governs how the non-replay portion is itself blended across domains.

PYTHON · RUNNABLE IN-BROWSER

# Data mixing with replay: how batch composition shifts effective gradients.
# A simple two-domain model: expected per-step movement toward each domain
# is proportional to that domain's share of the batch (EQ OM4.4).
import numpy as np

new_share = lambda r: 1 - r           # fraction of batch from the new domain
old_share = lambda r: r               # replay fraction from the old domain

print(f"{'replay r':>9} {'new-domain pull':>16} {'old-domain pull':>16}")
for r in (0.0, 0.01, 0.05, 0.1, 0.25, 0.5):
    print(f"{r:>9.2f} {new_share(r):>16.2f} {old_share(r):>16.2f}")

# break-even: replay you need so old-domain pull >= a target retention budget
target = 0.05                          # want >= 5% of gradient mass on old data
need = target                          # since old_share(r) = r
print(f"\nTo keep >= {target:.0%} of the gradient on old data, set r >= {need:.2f}.")
print("Even 5% replay keeps the old distribution alive while 95% of each")
print("batch still drives adaptation -- the standard anti-forgetting trick.")

edits are live — break it on purpose

INSTRUMENT OM4.2 — DATA-MIXING RATIO SIMULATOREQ OM4.4 · NEW vs OLD vs DIVERSITY

REPLAY FRACTION r 0.05

NEW-DOMAIN SOURCES 3

SKEW OF NEW MIX 1.0

ADAPTATION SPEED

—

OLD CAPABILITY RETAINED

—

EFFECTIVE SOURCES Nₑff

—

The bar splits each batch into replay (deep-green, old domain) and new-domain sources (mint, one band per source). Raise r and retained capability climbs while adaptation speed falls — the §4.5 trade-off made visible. Raise the skew and watch the new-domain mix collapse toward a single source: effective diversity $N_{\text{eff}}$ (EQ OM4.1) drops even though the source count is unchanged.

4.5

Catastrophic forgetting & mitigations

Every technique in this chapter circles one failure mode. Catastrophic forgetting is the tendency of a neural network, trained sequentially on task B, to overwrite the weights that encoded task A — sometimes destroying a capability it had moments earlier. It is not a bug in the optimizer; it is a direct consequence of how gradient descent works. The loss on B says nothing about A, so the update is free to move into any direction that lowers B's loss, including directions that wreck A. The first careful study of it in modern nets is McCloskey & Cohen (1989); it is the central obstacle to continual learning, and it is exactly what a domain fine-tune risks doing to a model's general skills.

The cleanest way to feel it: fit a linear model to task A, record its error, then keep training on task B alone, and watch A's error climb as the shared weights are pulled toward B.

PYTHON · RUNNABLE IN-BROWSER

# Catastrophic forgetting in miniature: fit task A, then train on B,
# and measure how much task-A performance drops (no replay).
import numpy as np
rng = np.random.default_rng(0)

d = 8
wA = rng.normal(0, 1, d)                     # task A's true weights
wB = rng.normal(0, 1, d)                     # task B: a DIFFERENT relationship
XA = rng.normal(0, 1, (200, d)); yA = XA @ wA
XB = rng.normal(0, 1, (200, d)); yB = XB @ wB

w = np.linalg.lstsq(XA, yA, rcond=None)[0]   # learn task A
mseA_before = float(np.mean((XA @ w - yA) ** 2))

lr = 0.02                                     # now train ONLY on B (SGD)
for _ in range(300):
    grad = XB.T @ (XB @ w - yB) / len(XB)
    w -= lr * grad
mseA_after = float(np.mean((XA @ w - yA) ** 2))

print(f"task A MSE before training on B : {mseA_before:.4f}")
print(f"task A MSE after  training on B : {mseA_after:.4f}")
print(f"forgetting (MSE increase)       : {mseA_after - mseA_before:.4f}")
print("\nA was solved exactly; fitting B with no replay overwrites the shared")
print("weights and A's error explodes. This is catastrophic forgetting.")

edits are live — break it on purpose

Forgetting is usually reported as a drop in a held-out metric on the original capability — a forgetting probe run at every checkpoint. The quantity to track is the gap between the model's score on the old task before and after adapting to the new one:

EQ OM4.5 — FORGETTING & RETENTION $$ F = a^{\text{old}}_{\text{before}} - a^{\text{old}}_{\text{after}}, \qquad R = \frac{a^{\text{old}}_{\text{after}}}{a^{\text{old}}_{\text{before}}} $$

$a^{\text{old}}$ is accuracy (or any score, higher-is-better) on the original task; $F$ is the absolute forgetting, $R$ the retention ratio. A clean adaptation pushes the new-task score up while keeping $F$ near zero. You cannot manage what you do not measure: if your only eval is the target task, a model can ace it while silently losing half its general ability — the "silent capability regression" that the fine-tuning recipe in Vol II · §6.5 warns about.

The mitigations, in rough order of cost and effectiveness:

Mitigation	How it fights forgetting	Cost
Replay / rehearsal (§4.4)	keep old-distribution gradients alive in every batch	a slice of old data; small slowdown
Parameter-efficient FT (LoRA, §4.2)	freeze the base; learn a small add-on that can be removed	very low; near-zero forgetting of the frozen base
Lower LR / fewer epochs	limit how far weights drift from $\theta_0$	free; trades adaptation for safety
EWC & regularizers	penalize moving weights important to the old task	a Fisher-information pass; extra hyperparameter

Elastic Weight Consolidation (Kirkpatrick et al., 2017) is the canonical regularizer. It estimates how important each weight was to the old task — using the diagonal of the Fisher information matrix, $F_i$, as a proxy for curvature — and adds a quadratic penalty that makes important weights stiff and unimportant ones free to move:

EQ OM4.6 — ELASTIC WEIGHT CONSOLIDATION $$ \mathcal{L}_{\text{EWC}}(\theta) = \mathcal{L}_{\text{new}}(\theta) + \frac{\lambda}{2}\sum_i F_i\,\big(\theta_i - \theta^{\star}_i\big)^2 $$

$\theta^\star$ are the old-task weights; $F_i$ the Fisher importance of weight $i$; $\lambda$ the consolidation strength. The penalty is an anchored spring whose stiffness is $F_i$ — weights the old task relied on are pulled hard back toward $\theta^\star$, while irrelevant weights are left free to specialize. It approximates training on both tasks at once without retaining the old data, which is its appeal when the old corpus is gone. In practice, for LLMs, plain replay plus PEFT usually matches or beats EWC at lower complexity — EWC matters most when you genuinely cannot revisit old data.

A model scores $ a^{\text{old}}_{\text{before}} = 0.90 $ on a general benchmark. After a domain fine-tune with no replay, it scores $ a^{\text{old}}_{\text{after}} = 0.55 $. Using EQ OM4.5, what is the absolute forgetting $ F $?

$ F = a^{\text{old}}_{\text{before}} - a^{\text{old}}_{\text{after}} = 0.90 - 0.55 = $ 0.35 — the model lost 35 accuracy points on what it already knew. The retention ratio $ R = 0.55/0.90 \approx 0.61 $: nearly two-fifths of the old capability is gone.

INSTRUMENT OM4.3 — FORGETTING CURVEEQ OM4.5 · OLD TASK vs NEW TASK OVER TRAINING

REPLAY FRACTION r 0.05

LEARNING RATE 1.0×

METHOD

NEW-TASK ACCURACY

—

OLD-TASK ACCURACY

—

FORGETTING F

—

The mint curve is accuracy on the new task (rising); the blue curve is the old task (falling — forgetting). With full FT at high learning rate and zero replay, the old curve collapses. Add a few percent replay, or switch to a frozen-base method, and the old curve flattens while the new one barely suffers — the whole point of the chapter in one picture. Defaults already show the safe regime: 5% replay, normal LR.

You can now train a model that learns your domain without forgetting the world — the hard half of the open-model craft. Chapter 05 turns from making models capable to making them safe: red-teaming, adversarial probing, jailbreak taxonomies, and the evaluation discipline that decides whether a fine-tuned open model is fit to ship.

4.R

References

Gururangan, S., Marasović, A., Swayamdipta, S. et al. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL 2020 — domain- and task-adaptive continued pre-training (DAPT / TAPT).
Kirkpatrick, J., Pascanu, R., Rabinowitz, N. et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS 2017 — Elastic Weight Consolidation (EQ OM4.6).
Howard, J. & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification (ULMFiT). ACL 2018 — gradual unfreezing and discriminative fine-tuning (§4.2).
Bengio, Y., Louradour, J., Collobert, R. & Weston, J. (2009). Curriculum Learning. ICML 2009 — easy-to-hard example ordering (§4.4).
McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation — the original diagnosis of forgetting.
Chaudhry, A., Ranzato, M., Rohrbach, M. & Elhoseiny, M. (2019). Efficient Lifelong Learning with A-GEM. ICLR 2019 — gradient-episodic-memory replay for continual learning (§4.4).
Luo, Y., Yang, Z., Meng, F. et al. (2023). An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning. Measures forgetting of general ability across instruction fine-tunes (§4.5).