To tune or not to tune
Fine-tuning is the third tool to reach for, not the first. The escalation ladder:
| Approach | Changes | Right when… | Wrong when… |
|---|---|---|---|
| Prompting | nothing | Instructions + few-shot examples suffice (they usually do) | Behavior must be deeply consistent or token budget matters at scale |
| RAG | context | The gap is knowledge — fresh, private, or vast | The gap is behavior, format, or skill |
| Fine-tuning | weights | Style, format, domain dialect, tool protocols, narrow skills; latency/cost via smaller specialized models | You're trying to inject facts (fragile, stale) or fix what a bigger model does out of the box |
Full fine-tuning updates every weight — maximum capacity, but it costs training-grade memory (≈16 bytes/param with AdamW: a 7B model wants ~112 GB before activations), produces a full model copy per task, and courts catastrophic forgetting of general capability. Parameter-efficient fine-tuning (PEFT) exists to dodge all three.
LoRA: the low-rank hypothesis
Low-Rank Adaptation rests on an empirical observation: the weight update a fine-tune needs has low intrinsic rank — the task lives in a tiny subspace of the full parameter space. So freeze \(W_0\) and learn the update as a product of two thin matrices:
# LoRA algebra: full vs 2dr trainable params, merge check
import numpy as np
rng = np.random.default_rng(0)
d_in, d_out, r = 256, 256, 8
W0 = rng.normal(0, 0.02, (d_out, d_in)) # frozen base weight
A = rng.normal(0, 0.02, (r, d_in)) # adapter A (starts gaussian)
B = rng.normal(0, 0.02, (d_out, r)) # adapter B ("after training": nonzero)
full = d_out * d_in
lora = r * (d_in + d_out)
print(f"full fine-tune params : {full:,}")
print(f"LoRA params (2dr) : {lora:,}")
print(f"trainable fraction : {100*lora/full:.2f} %")
x = rng.normal(0, 1, (5, d_in)) # a batch of 5 activations
y_two_path = x @ W0.T + (x @ A.T) @ B.T # frozen path + adapter path
W_merged = W0 + B @ A # EQ 6.1 merged, alpha/r = 1
y_merged = x @ W_merged.T
print("merged == two-path :", np.allclose(y_two_path, y_merged))
print("max abs difference :", float(np.abs(y_two_path - y_merged).max()))
Where to attach, what rank. Original practice targeted only \(W_Q, W_V\); current default is all linear layers (Q, K, V, O, gate, up, down), which beats raising the rank at equal parameter count. Ranks 8–64 cover most tasks; style transfers sit low, new skills (code dialects, tool-calling formats) sit higher. rsLoRA fixes the scale to \(\alpha/\sqrt{r}\) for stability at high rank; DoRA decomposes magnitude from direction for a small quality bump.
# Rank vs capacity: truncated SVD of a rank-16 target update
import numpy as np
rng = np.random.default_rng(0)
d = 256
U = rng.normal(0, 1, (d, 16)); V = rng.normal(0, 1, (16, d))
dW = U @ V / np.sqrt(d) # a "true" update of intrinsic rank 16
u, s, vt = np.linalg.svd(dW)
ranks, errs = [1, 4, 16, 64], []
for r in ranks:
approx = (u[:, :r] * s[:r]) @ vt[:r] # best rank-r fit (Eckart-Young)
e = np.linalg.norm(dW - approx) / np.linalg.norm(dW)
errs.append(e)
print(f"rank {r:3d}: relative Frobenius error {e:.4f}")
print("\nerror hits zero exactly at the target's intrinsic rank (16);")
print("rank 64 buys nothing. LoRA's bet is that real dW looks like this.")
plot_xy(ranks, errs)
QLoRA: fine-tuning on one GPU
QLoRA stacks three tricks so a 65–70B model fine-tunes on a single 48 GB card: (1) freeze the base weights in 4-bit NF4; (2) train bf16 LoRA adapters on top, dequantizing on the fly per matmul; (3) page optimizer states to CPU on memory spikes.
Gradients flow through the frozen 4-bit weights into the adapters only. Quality matches 16-bit LoRA closely on instruction-tuning benchmarks — the canonical result that made serious fine-tuning a consumer-hardware activity.
The PEFT zoo, briefly
| Method | What trains | Notes |
|---|---|---|
| LoRA / QLoRA | low-rank ΔW | The default. Mergeable, swappable, multi-tenant. |
| Adapters (serial) | small bottleneck MLPs inserted per block | The 2019 original; adds inference latency, now rare. |
| Prefix / prompt tuning | virtual KV prefixes or input embeddings | Tiny footprint; weaker on hard tasks; fully reversible. |
| (IA)³ | per-channel rescaling vectors | Orders of magnitude fewer params than LoRA; niche. |
| BitFit | bias terms only | A useful lower bound on “how little is enough”. |
Everything in Chapter 05 composes with PEFT: DPO-with-LoRA is the standard budget alignment stack, and GRPO over LoRA adapters is increasingly how small reasoning fine-tunes ship.
A practical recipe
# Defaults that survive contact with reality (7–70B, instruction-style task)
base: strongest instruct model that fits serving budget
method: QLoRA (NF4) · r=16 · α=32 · all linear layers · dropout 0.05
data: 500–50k examples; dedup; decontaminate against your evals;
quality >> quantity — read 50 examples yourself
format: exact chat template of the base model (silent killer #1)
lr: 1e-4 (LoRA) · cosine decay · warmup 3% · 1–3 epochs
batch: effective 64–128 sequences via gradient accumulation
eval: held-out task metric + a general benchmark (forgetting probe)
+ manual review of 50 outputs per checkpoint
ship: merge adapter for latency · or serve multi-LoRA (S-LoRA/vLLM)
The four classic failures: (1) wrong/mismatched chat template — model answers fine but formats garbage; (2) eval contamination — your test set leaked into training data and the numbers are fiction; (3) overfitting epoch 3+ — loss down, vibes down; (4) silent capability regression — always probe general skills, not just the target task.
Adaptation changes what a model says; compression changes what it costs. Chapter 07: distillation, the quantization stack from absmax to GPTQ/AWQ, and why bits-per-weight is the real unit of deployment.
Further reading
- Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. — the low-rank update at the heart of this chapter.
- Dettmers, Pagnoni, Holtzman & Zettlemoyer (2023). QLoRA: Efficient Finetuning of Quantized LLMs. — 4-bit NF4 base weights plus LoRA, fine-tuning on a single GPU.
- Houlsby et al. (2019). Parameter-Efficient Transfer Learning for NLP. — adapter modules, the ancestor of the PEFT family.
- Li & Liang (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. — prompt/prefix tuning, a complementary PEFT branch.
- Lester, Al-Rfou & Constant (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. — soft prompts become competitive at scale.
- Aghajanyan, Zettlemoyer & Gupta (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. — the empirical basis for the low-rank hypothesis.