Dataset curation & tagging
The most common reason a fine-tune disappoints is not the algorithm — it is the data. A base model has already seen trillions of tokens of generic text; your few thousand examples can only nudge it, so every one of them must earn its place. The job before training is curation: assemble examples that are correct, on-distribution for what you will actually ask at inference, free of duplicates, and not contaminated by your evaluation set.
Four checks do most of the work, in this order:
| Step | What it removes | Why it matters |
|---|---|---|
| Dedup | near-identical examples | Repeats inflate apparent dataset size and let the model memorize a handful of strings instead of learning the pattern. |
| Decontaminate | eval/test leakage | If your benchmark questions sit in the training data, your numbers are fiction (Vol II · §6.5). |
| Quality filter | wrong, toxic, off-format | One mislabeled example teaches the wrong thing far more efficiently than ten right ones correct it. |
| Balance | skew toward easy/common cases | An imbalanced set makes the model fluent on the majority slice and blind to the tail you care about. |
Tagging is the second half of curation, and it is what turns a flat pile of text into a steerable signal. Every example carries metadata — its source, domain, language, difficulty, quality score, license — and those tags are used in two distinct ways. Offline, they drive filtering and the mixing ratios of §4.4. Inline, special tokens written into the sequence itself let the model condition on provenance: a leading <quality:high> or domain marker that the model learns to associate with the behavior you want, and that you can then assert at inference. This is the idea behind conditional pre-training and quality-tag prefixes — keep the noisy data in the corpus for breadth, but label it so the model can be told to imitate only the good parts.
The format your tags take is not free decoration — they must be reserved tokens the tokenizer treats atomically, not strings the model could also emit as ordinary text. A domain tag spelled as plain words can be hallucinated mid-generation; a true control token cannot, because it lives outside the natural-language vocabulary. Reuse the base model's existing special-token slots where you can.
Curation also has a quantitative side: not every example contributes equally to the loss, and a corpus's effective size after dedup is smaller than its raw count. A simple, honest way to measure the diversity you actually have is the effective number of distinct items — the exponential of the entropy of the source mixture, which collapses toward 1 as one source dominates and rises toward the source count when the mixture is uniform.
# EQ OM4.1: effective dataset diversity = exp(entropy of the source mix)
import numpy as np
def n_eff(p):
p = np.asarray(p, float); p = p / p.sum() # normalize to a distribution
p = p[p > 0] # 0*log0 = 0, skip empty sources
H = -(p * np.log(p)).sum() # Shannon entropy (nats)
return np.exp(H) # perplexity of the mixture
mixes = {
"uniform 4-way ": [1, 1, 1, 1],
"skewed 90/10 ": [0.9, 0.1],
"near single src ": [0.97, 0.01, 0.01, 0.01],
"balanced 3-way ": [1, 1, 1],
}
for name, p in mixes.items():
print(f"{name}: N_eff = {n_eff(p):.3f} (raw sources = {len(p)})")
print("\nN_eff collapses toward 1 as one source dominates -- collecting more")
print("sources buys nothing if 90% of your tokens still come from one of them.")
A blunt heuristic that holds up: read fifty of your own examples by hand before you train on any of them. Tools find duplicates and contamination; only a human notices that the "answers" were scraped from a forum where half of them are wrong, or that the format drifts every few hundred rows. Quality dominates quantity in fine-tuning, and the cheapest quality filter is a pair of eyes.
Freezing & unfreezing blocks
Full fine-tuning lets every weight move. But a transformer's layers do not all do the same job: the lower blocks encode generic, broadly useful features (tokens, syntax, low-level semantics), while the upper blocks specialize toward the output distribution. Freezing a block means excluding its parameters from the optimizer — they keep their pre-trained values, receive no gradient update, and need no optimizer state. The choice of where to draw the freeze line is one of the highest-leverage knobs you have, and it trades three things at once.
| You freeze more → | Trainable params | Compute & memory | Forgetting risk |
|---|---|---|---|
| Freeze nothing (full FT) | 100% | highest | highest |
| Freeze lower blocks | fewer | lower | lower |
| Freeze all but the head | tiny | lowest | lowest (but least capacity) |
The mechanics are simple but worth stating exactly. For a model whose backbone is \(L\) identical transformer blocks of \(P_b\) parameters each, plus an embedding table \(P_e\) and an output head \(P_h\), freezing the first \(k\) blocks (and the embeddings, the usual default) leaves a trainable count and fraction of:
Gradual unfreezing, introduced with ULMFiT, sequences these choices over time rather than fixing one. You begin with everything frozen but the head, train for a bit, then unfreeze the topmost block, then the next, and so on toward the input — each newly thawed layer also given a smaller learning rate than the one above it (discriminative fine-tuning). The intuition: let the task-specific top adapt first on stable lower features, then carefully relax the deeper, more general representations only once the top has found its footing. This was a central recipe for transfer learning before LoRA (Vol II · §6.2) made low-rank adapters the default, and it remains the right mental model for what freezing buys you.
# EQ OM4.2: freeze the first k of L blocks, get the trainable fraction
import numpy as np
L = 32 # transformer blocks
P_b = 200e6 # params per block (200M, ~6.4B backbone)
P_emb = 525e6 # embedding table (frozen with the lower blocks)
P_head = 525e6 # output head (always trainable here)
total = P_emb + L * P_b + P_head
print(f"{'frozen k':>9} {'trainable params':>18} {'fraction':>10}")
for k in (0, 8, 16, 24, 31):
p_train = (L - k) * P_b + P_head # head stays trainable
frac = p_train / total
print(f"{k:>9} {p_train/1e9:>16.3f}B {frac:>10.3f}")
# optimizer (AdamW: 2 moments) + grads ~ 12 bytes/trainable param, fp32-ish
k = 24
p_train = (L - k) * P_b + P_head
opt_gb = p_train * 12 / 1e9
print(f"\nfreeze {k}: ~{opt_gb:.1f} GB of optimizer+grad state, vs "
f"{(total*12/1e9):.1f} GB for full FT -- the memory you buy back.")
Continued pre-training & domain adaptation
Instruction fine-tuning teaches a model how to behave; it does not, by itself, teach it a new domain. If your target is legal contracts, clinical notes, a low-resource language, or an internal codebase whose idioms never appeared at scale on the open web, the base model lacks the underlying language model of that domain — and no amount of supervised examples will install vocabulary and distributional knowledge that the pre-training never built. The fix is continued pre-training (also called domain-adaptive pre-training, or DAPT): take the base model and keep running the original self-supervised objective — next-token prediction — but now on a large corpus of in-domain raw text, before you do any task fine-tuning.
Three engineering points separate a continued-pre-training run that helps from one that quietly damages the model:
- Learning rate. Use a fraction of the original pre-training peak — too high and you overwrite general knowledge (forgetting, §4.5); too low and the domain never sinks in. A short warmup and cosine decay over one to a few epochs of domain text is the standard shape.
- Tokenizer fit. If your domain uses tokens the base tokenizer shatters into fragments (chemical formulae, code, a new script), continued pre-training on top of a bad tokenization is fighting uphill. Vocabulary extension is sometimes warranted — but new embedding rows start untrained and need their own warmup.
- Replay. Mix a slice of general-domain text back into the domain corpus (§4.4). Pure in-domain continued pre-training is the fastest route to a model that aces your jargon and has forgotten how to hold a normal conversation.
When to reach for it. Continued pre-training is expensive relative to LoRA fine-tuning and is the wrong tool for behavioral gaps (format, style, tool protocols) — those are §4.2 / Vol II · Ch 06 territory. It earns its cost when the gap is genuinely knowledge or language: the model would need to have read more of your world to do the task at all. A useful tell is perplexity — if the base model's perplexity on a held-out sample of your domain is high, DAPT has room to work; if it is already low, you mostly need task data, not more pre-training.
Curriculum & data mixing
Once you have curated, tagged, and possibly domain-adapted, two questions about order and proportion remain. Curriculum learning asks: in what sequence should examples be presented? Data mixing asks: in what ratio should different sources be combined? Both are about shaping the distribution the optimizer sees over the course of training, and both can move final quality more than another epoch ever would.
Curriculum
The original curriculum-learning result (Bengio et al., 2009) is that presenting examples in order of increasing difficulty — easy first, hard later — can speed convergence and reach better optima than uniform random shuffling, much as a syllabus does for a student. In LLM training the effect is real but contested: it helps most when there is a clear, reliable difficulty signal (sequence length, a grader's score, a reasoning-step count) and matters less when data is already abundant and diverse. The honest summary in 2026 is that curriculum is a useful lever for reasoning and code fine-tunes with measurable difficulty, and an over-engineered one for generic chat data where good shuffling is hard to beat.
Data mixing & replay
Mixing is the more universally important of the two. When you fine-tune or continue-pre-train, you choose how to weight your sources — and crucially, how much replay (a fraction of the model's original, general-domain data) to fold back in. Replay is the single most reliable defense against forgetting: by keeping the old distribution partly present, you keep its gradients partly alive. The trade-off is direct — more replay protects old capability at the cost of slower adaptation to the new domain.
# Data mixing with replay: how batch composition shifts effective gradients.
# A simple two-domain model: expected per-step movement toward each domain
# is proportional to that domain's share of the batch (EQ OM4.4).
import numpy as np
new_share = lambda r: 1 - r # fraction of batch from the new domain
old_share = lambda r: r # replay fraction from the old domain
print(f"{'replay r':>9} {'new-domain pull':>16} {'old-domain pull':>16}")
for r in (0.0, 0.01, 0.05, 0.1, 0.25, 0.5):
print(f"{r:>9.2f} {new_share(r):>16.2f} {old_share(r):>16.2f}")
# break-even: replay you need so old-domain pull >= a target retention budget
target = 0.05 # want >= 5% of gradient mass on old data
need = target # since old_share(r) = r
print(f"\nTo keep >= {target:.0%} of the gradient on old data, set r >= {need:.2f}.")
print("Even 5% replay keeps the old distribution alive while 95% of each")
print("batch still drives adaptation -- the standard anti-forgetting trick.")
Catastrophic forgetting & mitigations
Every technique in this chapter circles one failure mode. Catastrophic forgetting is the tendency of a neural network, trained sequentially on task B, to overwrite the weights that encoded task A — sometimes destroying a capability it had moments earlier. It is not a bug in the optimizer; it is a direct consequence of how gradient descent works. The loss on B says nothing about A, so the update is free to move into any direction that lowers B's loss, including directions that wreck A. The first careful study of it in modern nets is McCloskey & Cohen (1989); it is the central obstacle to continual learning, and it is exactly what a domain fine-tune risks doing to a model's general skills.
The cleanest way to feel it: fit a linear model to task A, record its error, then keep training on task B alone, and watch A's error climb as the shared weights are pulled toward B.
# Catastrophic forgetting in miniature: fit task A, then train on B,
# and measure how much task-A performance drops (no replay).
import numpy as np
rng = np.random.default_rng(0)
d = 8
wA = rng.normal(0, 1, d) # task A's true weights
wB = rng.normal(0, 1, d) # task B: a DIFFERENT relationship
XA = rng.normal(0, 1, (200, d)); yA = XA @ wA
XB = rng.normal(0, 1, (200, d)); yB = XB @ wB
w = np.linalg.lstsq(XA, yA, rcond=None)[0] # learn task A
mseA_before = float(np.mean((XA @ w - yA) ** 2))
lr = 0.02 # now train ONLY on B (SGD)
for _ in range(300):
grad = XB.T @ (XB @ w - yB) / len(XB)
w -= lr * grad
mseA_after = float(np.mean((XA @ w - yA) ** 2))
print(f"task A MSE before training on B : {mseA_before:.4f}")
print(f"task A MSE after training on B : {mseA_after:.4f}")
print(f"forgetting (MSE increase) : {mseA_after - mseA_before:.4f}")
print("\nA was solved exactly; fitting B with no replay overwrites the shared")
print("weights and A's error explodes. This is catastrophic forgetting.")
Forgetting is usually reported as a drop in a held-out metric on the original capability — a forgetting probe run at every checkpoint. The quantity to track is the gap between the model's score on the old task before and after adapting to the new one:
The mitigations, in rough order of cost and effectiveness:
| Mitigation | How it fights forgetting | Cost |
|---|---|---|
| Replay / rehearsal (§4.4) | keep old-distribution gradients alive in every batch | a slice of old data; small slowdown |
| Parameter-efficient FT (LoRA, §4.2) | freeze the base; learn a small add-on that can be removed | very low; near-zero forgetting of the frozen base |
| Lower LR / fewer epochs | limit how far weights drift from \(\theta_0\) | free; trades adaptation for safety |
| EWC & regularizers | penalize moving weights important to the old task | a Fisher-information pass; extra hyperparameter |
Elastic Weight Consolidation (Kirkpatrick et al., 2017) is the canonical regularizer. It estimates how important each weight was to the old task — using the diagonal of the Fisher information matrix, \(F_i\), as a proxy for curvature — and adds a quadratic penalty that makes important weights stiff and unimportant ones free to move:
You can now train a model that learns your domain without forgetting the world — the hard half of the open-model craft. Chapter 05 turns from making models capable to making them safe: red-teaming, adversarial probing, jailbreak taxonomies, and the evaluation discipline that decides whether a fine-tuned open model is fit to ship.
References
- Gururangan, S., Marasović, A., Swayamdipta, S. et al. (2020). Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N. et al. (2017). Overcoming Catastrophic Forgetting in Neural Networks.
- Howard, J. & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification (ULMFiT).
- Bengio, Y., Louradour, J., Collobert, R. & Weston, J. (2009). Curriculum Learning.
- McCloskey, M. & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem.
- Chaudhry, A., Ranzato, M., Rohrbach, M. & Elhoseiny, M. (2019). Efficient Lifelong Learning with A-GEM.
- Luo, Y., Yang, Z., Meng, F. et al. (2023). An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning.