Fine-Tuning Open Models — AI Encyclopedia

3.1

Fine-tune vs prompt vs RAG

Fine-tuning is the third tool you reach for, not the first. Owning the weights makes it tempting to treat every gap as a training problem, but the cheaper two levers solve most of them. The escalation ladder, in order of effort:

Approach	Changes	Reach for it when…	Wrong tool when…
Prompting	nothing	Instructions + a few examples already steer the base/instruct model where you need it	You need deep, consistent behavior or the prompt tax is paid on every request at scale
RAG	context	The gap is knowledge — private, fresh, or too large to memorize; facts that change weekly	The gap is behavior, format, tone, or a skill the model lacks
Fine-tuning	weights	Style, output format, domain dialect, tool-call protocols, a narrow skill; or collapsing a long system prompt into the weights to cut latency and cost	You are trying to inject facts (fragile, goes stale) or to fix what a larger model already does out of the box

The sharpest distinction is knowledge versus behavior. RAG is a retrieval layer: it puts the right documents in the context window so the model can read them, which is exactly what you want when the answer depends on facts that move — a product catalog, this quarter's policy, a codebase that changes daily. Fine-tuning bakes a pattern into the parameters: how to phrase an answer, which JSON schema to emit, how to think through a domain-specific task. Trying to teach facts by fine-tuning is the classic mistake — the model learns them brittly, forgets the long tail, and you must retrain every time the facts change. The three are not rivals; production systems usually stack them: a fine-tuned model that follows your house style, fed retrieved context, behind a thin prompt.

DECIDE

A one-line test. Ask: "If I could paste the perfect paragraph into the prompt, would the problem be solved?" If yes, it is a knowledge gap → RAG (or just a better prompt). If the model still wouldn't behave the way you need even with perfect context, it is a behavior gap → fine-tune. Most teams discover, after honest testing, that prompting plus RAG covers 80% of cases — and reserve fine-tuning for the consistency, format, and cost wins that nothing else delivers.

There is also a cost angle unique to open weights. A fine-tune lets a smaller model match a larger one on a narrow task, and a small specialized model is cheaper to serve, faster to decode, and fits hardware the big model never could (the memory math of Open Models · §2). So fine-tuning is not only "make it better" — sometimes it is "make a 3B model do the one job you previously needed a 70B for."

RAG is often preferable to fine-tuning when the knowledge the model needs changes frequently. True or false?

Fine-tuning bakes patterns into weights and must be re-run whenever the underlying facts change, so it goes stale on fast-moving knowledge. RAG simply retrieves the current document into context, so updating the knowledge means updating the index, not the model. For frequently-changing knowledge, RAG is the right tool — the answer is true.

INSTRUMENT OM3.1 — TUNE / PROMPT / RAG DECISION TREEANSWER 3 QUESTIONS · LIVE VERDICT

1 · WHAT IS THE GAP?

2 · DOES THE KNOWLEDGE CHANGE OFTEN?

3 · HOW MANY GOOD EXAMPLES CAN YOU GET?

RECOMMENDATION

—

PRIMARY LEVER

—

COMBINE WITH

—

CONFIDENCE

—

Toggle the three answers and watch the verdict move. The tree encodes the chapter's rule: knowledge gaps go to RAG (especially volatile ones), behavior gaps go to fine-tuning — but only once you have enough examples; below ~100 it usually recommends prompting first. The combinations matter as much as the leaves: a behavior gap over stable knowledge is the canonical fine-tune-plus-RAG stack.

3.2

Building a fine-tuning dataset

Once you have decided to tune, the dataset is the project. The method (§3.3) is a solved, ten-line affair; the data is where all the difficulty and almost all the quality lives. A supervised fine-tune (SFT) is, mechanically, just next-token prediction over examples of the behavior you want — so the model becomes exactly as good as the examples are, and no better.

The unit is a conversation, expressed in the base model's chat template. An instruction example is a list of messages with roles — typically a system message, a user turn, and the gold assistant response you want the model to imitate. The trainer renders these into one flat token string using the model's exact template, and computes the loss only on the assistant tokens (the prompt is context, not a target):

EQ OM3.1 — SFT OBJECTIVE (LOSS ON COMPLETION ONLY) $$ \mathcal{L}_{\text{SFT}} \;=\; -\sum_{i \in \text{completion}} \log p_\theta\!\left( y_i \mid y_{<i},\, \text{prompt} \right) $$

Standard cross-entropy, but masked: the sum runs only over the assistant's tokens $y_i$. The prompt and system tokens are fed in but never scored, so the model learns to produce the answer given the question rather than to also reproduce the question. The single most common bug is using the wrong chat template — train with one delimiter set and serve with another and the model answers correctly but formats garbage, because the special tokens that frame its turn no longer line up.

What separates a dataset that works from one that wastes a GPU-week:

Quality dominates quantity. The LIMA result is the canonical evidence: 1,000 carefully curated examples produced a strongly aligned model, outperforming much larger but noisier sets. A few hundred to a few thousand excellent examples beat tens of thousands of mediocre ones. Read 50 of your own examples by hand before training — if you wouldn't accept the answer, neither should the model.
Match the serving distribution. Train on prompts shaped like the ones you'll actually receive. A model fine-tuned on tidy textbook questions degrades on the messy, typo-ridden inputs of production.
Cover the format exactly. If you need valid JSON, every assistant response in the data must be valid JSON. The model learns the surface form ruthlessly.
Decontaminate against your evals. If test prompts leak into training, your metrics become fiction. Dedup and check overlap before you trust a number.
Balance, then probe forgetting. A dataset that is 100% one narrow task will sharpen that task and dull everything else (catastrophic forgetting). Mix in a little general instruction data, or accept and measure the regression.

Sources, in rough order of value: your own logs and human-written gold answers (best, expensive); distillation — generate completions from a stronger model and curate them (cheap, scalable, but check the license of the teacher and the data); and existing open instruction sets for general capability ballast. The toy formatter below shows the shape every trainer expects — raw role-tagged turns rendered into one template string.

PYTHON · RUNNABLE IN-BROWSER

# Format a toy instruction dataset into chat-template strings (ChatML-style)
import numpy as np

# raw examples: each a list of {role, content} turns
data = [
    [{"role": "system",    "content": "You are a terse SQL assistant."},
     {"role": "user",      "content": "users older than 30?"},
     {"role": "assistant", "content": "SELECT * FROM users WHERE age > 30;"}],
    [{"role": "user",      "content": "count the orders"},
     {"role": "assistant", "content": "SELECT COUNT(*) FROM orders;"}],
]

def render(msgs):                              # ChatML: <|im_start|>role ... <|im_end|>
    s = ""
    for m in msgs:
        s += f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>\n"
    return s

lens = []
for i, ex in enumerate(data):
    text = render(ex)
    lens.append(len(text.split()))            # crude whitespace token proxy
    print(f"--- example {i} ({lens[-1]} ~tokens) ---")
    print(text)

print(f"examples: {len(data)} | mean ~tokens: {np.mean(lens):.1f} "
      f"| max: {int(np.max(lens))}")
print("loss is computed ONLY on the assistant spans; system+user are context.")

edits are live — break it on purpose

INSTRUMENT OM3.2 — DATASET SIZE vs GAINDIMINISHING RETURNS · LIMA INTUITION

EXAMPLES N 1,000

DATA QUALITY 0.75

EST. QUALITY GAIN

—

MARGINAL GAIN / 2× DATA

—

REGIME

—

A saturating curve: gain rises fast then flattens, and the ceiling is set by quality, not count. Slide quality down and the whole curve sags — no amount of mediocre data reaches a high-quality model. Slide N past a few thousand and watch the marginal gain per doubling collapse: this is the LIMA lesson made visible. The model is illustrative, not a benchmark prediction.

How much do you actually need? For a style/format adaptation, a few hundred clean examples often suffice. For a genuine new skill (a code dialect, a tool-call protocol, a reasoning pattern), low thousands. Past ~10K examples on a single narrow task you are usually buying robustness and edge-case coverage, not headline quality — and your effort is better spent auditing the examples you have than collecting more.

3.3

LoRA / QLoRA in practice

The full theory of parameter-efficient fine-tuning — the low-rank hypothesis, the PEFT zoo, the NF4 quantization data type — is laid out in Vol II · Chapter 06. Here is the operating recap and the open-weights specifics, because LoRA is what makes "I own the weights" affordable on hardware you actually have.

Full fine-tuning updates every parameter. With AdamW that costs roughly 16 bytes per parameter (weights + gradients + two optimizer moments, in mixed precision), so a 7B model wants ~112 GB of optimizer state before activations — it overflows an 80 GB card. It also produces a full model copy per task and risks catastrophic forgetting. LoRA dodges all three. Freeze the pretrained weight $W_0$ and learn the update as a product of two thin matrices:

EQ OM3.2 — LoRA: A LOW-RANK UPDATE $$ W \;=\; W_0 + \Delta W \;=\; W_0 + \frac{\alpha}{r}\, B A, \qquad A \in \mathbb{R}^{r \times d_{\text{in}}},\; B \in \mathbb{R}^{d_{\text{out}} \times r},\; r \ll d $$

$A$ starts Gaussian, $B$ starts at zero, so training begins exactly at the pretrained model and drifts away smoothly. The scale $\alpha/r$ keeps behavior stable across ranks. Trainable parameters per matrix drop from $d_{\text{out}}\,d_{\text{in}}$ to $r\,(d_{\text{in}} + d_{\text{out}})$. After training, $BA$ can be merged into $W_0$ for zero inference overhead, or kept separate and hot-swapped — one base model multiplexing many LoRA "personalities" (§3.5).

The trainable count is the whole reason it fits. For a square $d \times d$ projection the adapter trains $2dr$ parameters versus $d^2$ — a fraction of $2r/d$. At $d = 4096,\ r = 8$ that is $2 \cdot 4096 \cdot 8 = 65{,}536$ out of $16.8$M, about $0.39\%$ of the matrix. Optimizer state shrinks proportionally, which is what turns a 112 GB job into a few GB.

You attach a LoRA adapter of rank $ r = 8 $ to a square projection with $ d_{\text{in}} = d_{\text{out}} = 4096 $. How many trainable parameters does the adapter add (matrix $A$ plus matrix $B$)?

Trainable params $ = r(d_{\text{in}} + d_{\text{out}}) = 2dr = 2 \times 8 \times 4096 = $ 65536. That is about $0.39\%$ of the full $4096^2 = 16{,}777{,}216$-parameter update — and the optimizer state shrinks by the same factor, which is why the job fits on a laptop GPU.

QLoRA goes one step further so even 65–70B models tune on a single card. It freezes the base weights in 4-bit NF4 (a data type whose 16 levels sit at the quantiles of a Gaussian, where trained weights actually live), trains bf16 LoRA adapters on top, dequantizing per matmul, and pages optimizer state to CPU on memory spikes. Quality tracks 16-bit LoRA closely on instruction-tuning benchmarks — the result that made serious fine-tuning a consumer-hardware activity. A 70B base at 4 bits is $70 \times 10^9 \times 0.5 = 35$ GB of frozen weights, small enough for a single 48 GB card with room for adapters and activations.

PYTHON · RUNNABLE IN-BROWSER

# LoRA parameter count vs full fine-tune, across a real layer stack
import numpy as np

# one transformer block of a 4096-dim model (Llama-ish): attn + MLP linears
# (name, d_in, d_out)
layers = [
    ("q_proj", 4096, 4096), ("k_proj", 4096, 1024), ("v_proj", 4096, 1024),
    ("o_proj", 4096, 4096), ("gate",   4096, 14336),
    ("up",     4096, 14336), ("down",  14336, 4096),
]
r, alpha = 8, 16

full = lora = 0
print(f"{'layer':>7} | {'full d_in*d_out':>15} | {'LoRA r(d_in+d_out)':>18}")
for name, din, dout in layers:
    f = din * dout
    l = r * (din + dout)
    full += f; lora += l
    print(f"{name:>7} | {f:15,} | {l:18,}")

print("-" * 50)
print(f"{'TOTAL':>7} | {full:15,} | {lora:18,}")
print(f"trainable fraction (1 block): {100*lora/full:.3f} %")
# AdamW full FT ~16 B/param of optimizer state; LoRA only on adapters
print(f"optimizer state  full: {16*full/1e6:8.1f} MB   "
      f"LoRA: {16*lora/1e6:6.2f} MB  per block")

edits are live — break it on purpose

INSTRUMENT OM3.3 — LoRA RANK vs TRAINABLE PARAMSONE LINEAR LAYER · EQ OM3.2

d_in 4,096

d_out 4,096

RANK r 8

TRAINABLE FRACTION OF THE FULL UPDATE

FULL ΔW (d_in·d_out)

—

LoRA r·(d_in+d_out)

—

TRAINABLE %

—

Set both dims to 4,096 and rank to 8 to reproduce the exercise: 65,536 params, 0.39%. Notice that for a fixed layer the trainable count grows linearly in rank, while the full update is fixed — so doubling rank doubles the adapter but barely moves the fraction. Current default: attach to all linear layers (Q, K, V, O, gate, up, down) at rank 8–64, with $\alpha = 2r$ a common starting point.

Where to attach, what rank. The original LoRA paper targeted only $W_Q, W_V$; the modern default is all linear layers, which beats raising the rank at equal parameter budget. Ranks 8–64 cover most tasks — style transfers sit low, new skills sit higher. Variants tune the edges: rsLoRA rescales by $\alpha/\sqrt{r}$ for stability at high rank, DoRA decomposes magnitude from direction for a small quality bump, and DPO-with-LoRA is the standard budget-alignment stack (Vol II · §5).

3.4

Evaluation & iteration

A fine-tune is not done when the loss curve looks good — it is done when it measurably beats the model you started from on the thing you care about, without quietly breaking everything else. Training loss going down only tells you the model is memorizing your data; it says nothing about generalization, and past a couple of epochs it actively lies (loss falls while real quality drops — the model overfits the exact phrasings in the set).

Build the eval before you train, and hold it out completely. A practical evaluation has three legs:

The target metric. A held-out set of the task itself, scored automatically where possible — exact match, JSON-validity rate, pass@1 on tests, an LLM-as-judge rubric (Vol II · §5) for open-ended outputs. This is the number you are trying to move.
A forgetting probe. A small general benchmark (a slice of MMLU, a reasoning set, a few coding tasks) that the base model already passed. If these regress, you traded general capability for narrow skill — sometimes acceptable, never acceptable silently.
Human eyes. Read 50 outputs per checkpoint. Automated metrics miss tone, subtle format drift, and the "fluent but wrong" failure that no exact-match catches.

Then iterate on the variable that actually matters. The loop is almost always data → train → eval → fix the data, not endless hyperparameter sweeps. When the model fails, the failure is usually a hole in the dataset (a case you didn't cover, a format you weren't consistent about), and the fix is examples, not a different learning rate. Sweep only what is cheap and high-leverage: number of epochs (1–3, watch for overfit), learning rate (≈1e-4 for LoRA, cosine decay), and rank if quality is capped.

PITFALLS

The four classic failures, in order of frequency. (1) Wrong chat template — the model answers fine but the special tokens are misaligned, so output is malformed; verify the template end-to-end. (2) Eval contamination — a test example leaked into training and your headline number is fiction; dedup and check overlap. (3) Overfitting at epoch 3+ — training loss down, real quality down; stop earlier. (4) Silent capability regression — always run the forgetting probe, not just the target metric.

A useful sanity bar: if the fine-tune does not clearly beat a well-prompted base model on your held-out set, you have not yet earned the complexity of owning a custom checkpoint. The prompt-only baseline is the number every fine-tune must beat to justify itself.

3.5

Serving your fine-tune

A LoRA fine-tune leaves you with a choice at deploy time, and it is one of the quiet superpowers of the method. The adapter is a small set of $A, B$ matrices; you can either fold them into the base weights or keep them separate.

Merge for latency. Compute $W_0 + \tfrac{\alpha}{r} BA$ once and save a normal full-precision (or re-quantized) checkpoint. The result is an ordinary model — zero inference overhead, served by any engine (llama.cpp, vLLM, SGLang) exactly like the base. Best when you have one fine-tune and want maximum tokens-per-second. Note that merging into an already-quantized base then re-quantizing can lose a little quality versus merging into the full-precision weights — merge high, quantize after.
Keep separate for multi-tenancy. Load one base model in VRAM and swap many tiny adapters in and out, even batching requests for different adapters together. A typical rank-16 adapter is a few hundred MB versus tens of GB for the model it steers, so one GPU can host hundreds of "personalities." Frameworks like S-LoRA and vLLM's multi-LoRA serving make this a production pattern — ideal when you have many per-customer or per-task fine-tunes over a shared base.

The economics of the merged path are the same memory math from Open Models · §2: footprint is bits-per-weight times parameters, decode is bandwidth-bound, KV cache grows with context and batch. A merged 4-bit fine-tune of a 7B model is the same ~3.5 GB and ~200 tok/s ceiling as its base — you changed what it says, not what it costs. For distribution, the GGUF you ship is the merged-then-quantized file; for an internal fleet serving many tasks, the multi-LoRA route keeps your VRAM bill flat as the number of fine-tunes grows.

One open-weights caveat worth stating plainly: licenses bind the fine-tune too. The base model's terms (and the license of any data you distilled from a teacher) flow through to your derivative. "Open weights" is not automatically "use however you like" — check the specific license before you ship a commercial product on top of a fine-tune.

# Open-weights fine-tune recipe that survives contact with reality
base:     strongest open instruct model that fits your serving budget
method:   QLoRA (NF4) · r=8–16 · alpha=2r · all linear layers · dropout 0.05
data:     hundreds–few-thousand curated examples; exact chat template;
          dedup; decontaminate vs evals; read 50 by hand
train:    lr 1e-4 · cosine · warmup 3% · 1–3 epochs · effective batch 32–128
eval:     target metric + forgetting probe + 50 manual reads / checkpoint;
          must beat a well-prompted base model
ship:     merge -> quantize -> GGUF for one model · or multi-LoRA for many

SFT teaches the model to imitate; the next chapter teaches it to be preferred. Training Techniques goes beyond supervised fine-tuning into the methods that shape behavior at a deeper level — preference optimization (DPO and friends), reward modeling, and the RL-style fine-tunes (GRPO over LoRA adapters) that increasingly ship small reasoning models on open weights.

3.R

References

Hu, E. J., Shen, Y., Wallis, P. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022 — the low-rank weight update (EQ OM3.2) at the heart of practical open-model fine-tuning.
Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023 — 4-bit NF4 base weights plus LoRA adapters; fine-tuning a 65B model on a single 48 GB GPU.
Zhou, C., Liu, P., Xu, P. et al. (2023). LIMA: Less Is More for Alignment. NeurIPS 2023 — 1,000 curated examples beat far larger noisy sets; the evidence behind "quality over quantity."
Aghajanyan, A., Zettlemoyer, L. & Gupta, S. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021 — the empirical low-rank hypothesis that motivates LoRA.
Liu, S.-Y., Wang, C.-Y., Yin, H. et al. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ICML 2024 — decouples magnitude from direction for a quality gain over vanilla LoRA.
Sheng, Y., Cao, S., Li, D. et al. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. MLSys 2024 — multi-tenant serving of many adapters over one shared base model (§3.5).
Hugging Face. PEFT: Parameter-Efficient Fine-Tuning — Documentation. Official library docs for LoRA/QLoRA/DoRA training and adapter management.