Fine-tune vs prompt vs RAG
Fine-tuning is the third tool you reach for, not the first. Owning the weights makes it tempting to treat every gap as a training problem, but the cheaper two levers solve most of them. The escalation ladder, in order of effort:
| Approach | Changes | Reach for it when… | Wrong tool when… |
|---|---|---|---|
| Prompting | nothing | Instructions + a few examples already steer the base/instruct model where you need it | You need deep, consistent behavior or the prompt tax is paid on every request at scale |
| RAG | context | The gap is knowledge — private, fresh, or too large to memorize; facts that change weekly | The gap is behavior, format, tone, or a skill the model lacks |
| Fine-tuning | weights | Style, output format, domain dialect, tool-call protocols, a narrow skill; or collapsing a long system prompt into the weights to cut latency and cost | You are trying to inject facts (fragile, goes stale) or to fix what a larger model already does out of the box |
The sharpest distinction is knowledge versus behavior. RAG is a retrieval layer: it puts the right documents in the context window so the model can read them, which is exactly what you want when the answer depends on facts that move — a product catalog, this quarter's policy, a codebase that changes daily. Fine-tuning bakes a pattern into the parameters: how to phrase an answer, which JSON schema to emit, how to think through a domain-specific task. Trying to teach facts by fine-tuning is the classic mistake — the model learns them brittly, forgets the long tail, and you must retrain every time the facts change. The three are not rivals; production systems usually stack them: a fine-tuned model that follows your house style, fed retrieved context, behind a thin prompt.
A one-line test. Ask: "If I could paste the perfect paragraph into the prompt, would the problem be solved?" If yes, it is a knowledge gap → RAG (or just a better prompt). If the model still wouldn't behave the way you need even with perfect context, it is a behavior gap → fine-tune. Most teams discover, after honest testing, that prompting plus RAG covers 80% of cases — and reserve fine-tuning for the consistency, format, and cost wins that nothing else delivers.
There is also a cost angle unique to open weights. A fine-tune lets a smaller model match a larger one on a narrow task, and a small specialized model is cheaper to serve, faster to decode, and fits hardware the big model never could (the memory math of Open Models · §2). So fine-tuning is not only "make it better" — sometimes it is "make a 3B model do the one job you previously needed a 70B for."
Building a fine-tuning dataset
Once you have decided to tune, the dataset is the project. The method (§3.3) is a solved, ten-line affair; the data is where all the difficulty and almost all the quality lives. A supervised fine-tune (SFT) is, mechanically, just next-token prediction over examples of the behavior you want — so the model becomes exactly as good as the examples are, and no better.
The unit is a conversation, expressed in the base model's chat template. An instruction example is a list of messages with roles — typically a system message, a user turn, and the gold assistant response you want the model to imitate. The trainer renders these into one flat token string using the model's exact template, and computes the loss only on the assistant tokens (the prompt is context, not a target):
What separates a dataset that works from one that wastes a GPU-week:
- Quality dominates quantity. The LIMA result is the canonical evidence: 1,000 carefully curated examples produced a strongly aligned model, outperforming much larger but noisier sets. A few hundred to a few thousand excellent examples beat tens of thousands of mediocre ones. Read 50 of your own examples by hand before training — if you wouldn't accept the answer, neither should the model.
- Match the serving distribution. Train on prompts shaped like the ones you'll actually receive. A model fine-tuned on tidy textbook questions degrades on the messy, typo-ridden inputs of production.
- Cover the format exactly. If you need valid JSON, every assistant response in the data must be valid JSON. The model learns the surface form ruthlessly.
- Decontaminate against your evals. If test prompts leak into training, your metrics become fiction. Dedup and check overlap before you trust a number.
- Balance, then probe forgetting. A dataset that is 100% one narrow task will sharpen that task and dull everything else (catastrophic forgetting). Mix in a little general instruction data, or accept and measure the regression.
Sources, in rough order of value: your own logs and human-written gold answers (best, expensive); distillation — generate completions from a stronger model and curate them (cheap, scalable, but check the license of the teacher and the data); and existing open instruction sets for general capability ballast. The toy formatter below shows the shape every trainer expects — raw role-tagged turns rendered into one template string.
# Format a toy instruction dataset into chat-template strings (ChatML-style)
import numpy as np
# raw examples: each a list of {role, content} turns
data = [
[{"role": "system", "content": "You are a terse SQL assistant."},
{"role": "user", "content": "users older than 30?"},
{"role": "assistant", "content": "SELECT * FROM users WHERE age > 30;"}],
[{"role": "user", "content": "count the orders"},
{"role": "assistant", "content": "SELECT COUNT(*) FROM orders;"}],
]
def render(msgs): # ChatML: <|im_start|>role ... <|im_end|>
s = ""
for m in msgs:
s += f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>\n"
return s
lens = []
for i, ex in enumerate(data):
text = render(ex)
lens.append(len(text.split())) # crude whitespace token proxy
print(f"--- example {i} ({lens[-1]} ~tokens) ---")
print(text)
print(f"examples: {len(data)} | mean ~tokens: {np.mean(lens):.1f} "
f"| max: {int(np.max(lens))}")
print("loss is computed ONLY on the assistant spans; system+user are context.")
How much do you actually need? For a style/format adaptation, a few hundred clean examples often suffice. For a genuine new skill (a code dialect, a tool-call protocol, a reasoning pattern), low thousands. Past ~10K examples on a single narrow task you are usually buying robustness and edge-case coverage, not headline quality — and your effort is better spent auditing the examples you have than collecting more.
LoRA / QLoRA in practice
The full theory of parameter-efficient fine-tuning — the low-rank hypothesis, the PEFT zoo, the NF4 quantization data type — is laid out in Vol II · Chapter 06. Here is the operating recap and the open-weights specifics, because LoRA is what makes "I own the weights" affordable on hardware you actually have.
Full fine-tuning updates every parameter. With AdamW that costs roughly 16 bytes per parameter (weights + gradients + two optimizer moments, in mixed precision), so a 7B model wants ~112 GB of optimizer state before activations — it overflows an 80 GB card. It also produces a full model copy per task and risks catastrophic forgetting. LoRA dodges all three. Freeze the pretrained weight \(W_0\) and learn the update as a product of two thin matrices:
The trainable count is the whole reason it fits. For a square \(d \times d\) projection the adapter trains \(2dr\) parameters versus \(d^2\) — a fraction of \(2r/d\). At \(d = 4096,\ r = 8\) that is \(2 \cdot 4096 \cdot 8 = 65{,}536\) out of \(16.8\)M, about \(0.39\%\) of the matrix. Optimizer state shrinks proportionally, which is what turns a 112 GB job into a few GB.
QLoRA goes one step further so even 65–70B models tune on a single card. It freezes the base weights in 4-bit NF4 (a data type whose 16 levels sit at the quantiles of a Gaussian, where trained weights actually live), trains bf16 LoRA adapters on top, dequantizing per matmul, and pages optimizer state to CPU on memory spikes. Quality tracks 16-bit LoRA closely on instruction-tuning benchmarks — the result that made serious fine-tuning a consumer-hardware activity. A 70B base at 4 bits is \(70 \times 10^9 \times 0.5 = 35\) GB of frozen weights, small enough for a single 48 GB card with room for adapters and activations.
# LoRA parameter count vs full fine-tune, across a real layer stack
import numpy as np
# one transformer block of a 4096-dim model (Llama-ish): attn + MLP linears
# (name, d_in, d_out)
layers = [
("q_proj", 4096, 4096), ("k_proj", 4096, 1024), ("v_proj", 4096, 1024),
("o_proj", 4096, 4096), ("gate", 4096, 14336),
("up", 4096, 14336), ("down", 14336, 4096),
]
r, alpha = 8, 16
full = lora = 0
print(f"{'layer':>7} | {'full d_in*d_out':>15} | {'LoRA r(d_in+d_out)':>18}")
for name, din, dout in layers:
f = din * dout
l = r * (din + dout)
full += f; lora += l
print(f"{name:>7} | {f:15,} | {l:18,}")
print("-" * 50)
print(f"{'TOTAL':>7} | {full:15,} | {lora:18,}")
print(f"trainable fraction (1 block): {100*lora/full:.3f} %")
# AdamW full FT ~16 B/param of optimizer state; LoRA only on adapters
print(f"optimizer state full: {16*full/1e6:8.1f} MB "
f"LoRA: {16*lora/1e6:6.2f} MB per block")
Where to attach, what rank. The original LoRA paper targeted only \(W_Q, W_V\); the modern default is all linear layers, which beats raising the rank at equal parameter budget. Ranks 8–64 cover most tasks — style transfers sit low, new skills sit higher. Variants tune the edges: rsLoRA rescales by \(\alpha/\sqrt{r}\) for stability at high rank, DoRA decomposes magnitude from direction for a small quality bump, and DPO-with-LoRA is the standard budget-alignment stack (Vol II · §5).
Evaluation & iteration
A fine-tune is not done when the loss curve looks good — it is done when it measurably beats the model you started from on the thing you care about, without quietly breaking everything else. Training loss going down only tells you the model is memorizing your data; it says nothing about generalization, and past a couple of epochs it actively lies (loss falls while real quality drops — the model overfits the exact phrasings in the set).
Build the eval before you train, and hold it out completely. A practical evaluation has three legs:
- The target metric. A held-out set of the task itself, scored automatically where possible — exact match, JSON-validity rate, pass@1 on tests, an LLM-as-judge rubric (Vol II · §5) for open-ended outputs. This is the number you are trying to move.
- A forgetting probe. A small general benchmark (a slice of MMLU, a reasoning set, a few coding tasks) that the base model already passed. If these regress, you traded general capability for narrow skill — sometimes acceptable, never acceptable silently.
- Human eyes. Read 50 outputs per checkpoint. Automated metrics miss tone, subtle format drift, and the "fluent but wrong" failure that no exact-match catches.
Then iterate on the variable that actually matters. The loop is almost always data → train → eval → fix the data, not endless hyperparameter sweeps. When the model fails, the failure is usually a hole in the dataset (a case you didn't cover, a format you weren't consistent about), and the fix is examples, not a different learning rate. Sweep only what is cheap and high-leverage: number of epochs (1–3, watch for overfit), learning rate (≈1e-4 for LoRA, cosine decay), and rank if quality is capped.
The four classic failures, in order of frequency. (1) Wrong chat template — the model answers fine but the special tokens are misaligned, so output is malformed; verify the template end-to-end. (2) Eval contamination — a test example leaked into training and your headline number is fiction; dedup and check overlap. (3) Overfitting at epoch 3+ — training loss down, real quality down; stop earlier. (4) Silent capability regression — always run the forgetting probe, not just the target metric.
A useful sanity bar: if the fine-tune does not clearly beat a well-prompted base model on your held-out set, you have not yet earned the complexity of owning a custom checkpoint. The prompt-only baseline is the number every fine-tune must beat to justify itself.
Serving your fine-tune
A LoRA fine-tune leaves you with a choice at deploy time, and it is one of the quiet superpowers of the method. The adapter is a small set of \(A, B\) matrices; you can either fold them into the base weights or keep them separate.
- Merge for latency. Compute \(W_0 + \tfrac{\alpha}{r} BA\) once and save a normal full-precision (or re-quantized) checkpoint. The result is an ordinary model — zero inference overhead, served by any engine (llama.cpp, vLLM, SGLang) exactly like the base. Best when you have one fine-tune and want maximum tokens-per-second. Note that merging into an already-quantized base then re-quantizing can lose a little quality versus merging into the full-precision weights — merge high, quantize after.
- Keep separate for multi-tenancy. Load one base model in VRAM and swap many tiny adapters in and out, even batching requests for different adapters together. A typical rank-16 adapter is a few hundred MB versus tens of GB for the model it steers, so one GPU can host hundreds of "personalities." Frameworks like S-LoRA and vLLM's multi-LoRA serving make this a production pattern — ideal when you have many per-customer or per-task fine-tunes over a shared base.
The economics of the merged path are the same memory math from Open Models · §2: footprint is bits-per-weight times parameters, decode is bandwidth-bound, KV cache grows with context and batch. A merged 4-bit fine-tune of a 7B model is the same ~3.5 GB and ~200 tok/s ceiling as its base — you changed what it says, not what it costs. For distribution, the GGUF you ship is the merged-then-quantized file; for an internal fleet serving many tasks, the multi-LoRA route keeps your VRAM bill flat as the number of fine-tunes grows.
One open-weights caveat worth stating plainly: licenses bind the fine-tune too. The base model's terms (and the license of any data you distilled from a teacher) flow through to your derivative. "Open weights" is not automatically "use however you like" — check the specific license before you ship a commercial product on top of a fine-tune.
# Open-weights fine-tune recipe that survives contact with reality
base: strongest open instruct model that fits your serving budget
method: QLoRA (NF4) · r=8–16 · alpha=2r · all linear layers · dropout 0.05
data: hundreds–few-thousand curated examples; exact chat template;
dedup; decontaminate vs evals; read 50 by hand
train: lr 1e-4 · cosine · warmup 3% · 1–3 epochs · effective batch 32–128
eval: target metric + forgetting probe + 50 manual reads / checkpoint;
must beat a well-prompted base model
ship: merge -> quantize -> GGUF for one model · or multi-LoRA for many
SFT teaches the model to imitate; the next chapter teaches it to be preferred. Training Techniques goes beyond supervised fine-tuning into the methods that shape behavior at a deeper level — preference optimization (DPO and friends), reward modeling, and the RL-style fine-tunes (GRPO over LoRA adapters) that increasingly ship small reasoning models on open weights.
References
- Hu, E. J., Shen, Y., Wallis, P. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models.
- Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs.
- Zhou, C., Liu, P., Xu, P. et al. (2023). LIMA: Less Is More for Alignment.
- Aghajanyan, A., Zettlemoyer, L. & Gupta, S. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.
- Liu, S.-Y., Wang, C.-Y., Yin, H. et al. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation.
- Sheng, Y., Cao, S., Li, D. et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters.
- Hugging Face. PEFT: Parameter-Efficient Fine-Tuning — Documentation.