06 · Fine-tuning — LLM Field Manual

6.1

To tune or not to tune

Fine-tuning is the third tool to reach for, not the first. The escalation ladder:

Approach	Changes	Right when…	Wrong when…
Prompting	nothing	Instructions + few-shot examples suffice (they usually do)	Behavior must be deeply consistent or token budget matters at scale
RAG	context	The gap is knowledge — fresh, private, or vast	The gap is behavior, format, or skill
Fine-tuning	weights	Style, format, domain dialect, tool protocols, narrow skills; latency/cost via smaller specialized models	You're trying to inject facts (fragile, stale) or fix what a bigger model does out of the box

Full fine-tuning updates every weight — maximum capacity, but it costs training-grade memory (≈16 bytes/param with AdamW: a 7B model wants ~112 GB before activations), produces a full model copy per task, and courts catastrophic forgetting of general capability. Parameter-efficient fine-tuning (PEFT) exists to dodge all three.

Full fine-tuning with AdamW costs ≈16 bytes per parameter (weights + gradients + two optimizer moments, in mixed precision). How many GB does a 7B-parameter model need for those states, before activations?

$7\times10^9 \text{ params} \times 16 \text{ bytes} = 1.12\times10^{11}$ bytes $= $ 112 GB. This is why a 7B full fine-tune already overflows a single 80 GB card.

6.2

LoRA: the low-rank hypothesis

Low-Rank Adaptation rests on an empirical observation: the weight update a fine-tune needs has low intrinsic rank — the task lives in a tiny subspace of the full parameter space. So freeze $W_0$ and learn the update as a product of two thin matrices:

EQ 6.1 — LoRA $$ W \;=\; W_0 + \Delta W \;=\; W_0 + \frac{\alpha}{r}\, B A, \qquad A \in \mathbb{R}^{r \times d_{\text{in}}},\; B \in \mathbb{R}^{d_{\text{out}} \times r},\; r \ll d $$

$A$ starts Gaussian, $B$ starts at zero — so training begins exactly at the pre-trained model and drifts smoothly away. $\alpha/r$ rescales so behavior is stable across ranks. Trainable parameters per matrix drop from $d_{\text{out}} d_{\text{in}}$ to $r(d_{\text{in}} + d_{\text{out}})$. After training, $BA$ can be merged into $W_0$ — zero inference overhead — or kept separate and hot-swapped, letting one server multiplex hundreds of LoRA “personalities” over a single base model.

A square projection has $d_{\text{in}} = d_{\text{out}} = 4096$. You attach a LoRA adapter of rank $r = 8$. How many trainable parameters does the adapter add ($A$ plus $B$)?

Trainable params $= r(d_{\text{in}} + d_{\text{out}}) = 2dr = 2 \times 4096 \times 8 = $ 65536.

For that same $d = 4096$, $r = 8$ adapter, what percent of the full $d^2$ update does it train? (Give the percent, e.g. enter 0.39 for 0.39%.)

Fraction $= \dfrac{2dr}{d^2} = \dfrac{2r}{d} = \dfrac{16}{4096} = 0.00390625$. As a percent: $\times 100 = $ 0.39%. Rank 8 trains under four-tenths of one percent of the matrix.

PYTHON · RUNNABLE IN-BROWSER

# LoRA algebra: full vs 2dr trainable params, merge check
import numpy as np
rng = np.random.default_rng(0)
d_in, d_out, r = 256, 256, 8

W0 = rng.normal(0, 0.02, (d_out, d_in))   # frozen base weight
A  = rng.normal(0, 0.02, (r, d_in))       # adapter A (starts gaussian)
B  = rng.normal(0, 0.02, (d_out, r))      # adapter B ("after training": nonzero)

full = d_out * d_in
lora = r * (d_in + d_out)
print(f"full fine-tune params : {full:,}")
print(f"LoRA params (2dr)     : {lora:,}")
print(f"trainable fraction    : {100*lora/full:.2f} %")

x = rng.normal(0, 1, (5, d_in))           # a batch of 5 activations
y_two_path = x @ W0.T + (x @ A.T) @ B.T   # frozen path + adapter path
W_merged   = W0 + B @ A                   # EQ 6.1 merged, alpha/r = 1
y_merged   = x @ W_merged.T
print("merged == two-path    :", np.allclose(y_two_path, y_merged))
print("max abs difference    :", float(np.abs(y_two_path - y_merged).max()))

edits are live — break it on purpose

INSTRUMENT 6.1 — LoRA PARAMETER COUNTERONE d×d PROJECTION · EQ 6.1

MODEL WIDTH d 8,192

RANK r 16

TRAINABLE FRACTION OF THE MATRIX

FULL ΔW PARAMS (d²)

—

LoRA PARAMS (2dr)

—

TRAINABLE %

—

At d = 8,192, rank 16 trains 0.39% of the matrix. Applied across a 70B model's attention + MLP projections, a typical r = 16 adapter is ~200–400 MB of bf16 — versus 140 GB for the model it steers.

Where to attach, what rank. Original practice targeted only $W_Q, W_V$; current default is all linear layers (Q, K, V, O, gate, up, down), which beats raising the rank at equal parameter count. Ranks 8–64 cover most tasks; style transfers sit low, new skills (code dialects, tool-calling formats) sit higher. rsLoRA fixes the scale to $\alpha/\sqrt{r}$ for stability at high rank; DoRA decomposes magnitude from direction for a small quality bump.

PYTHON · RUNNABLE IN-BROWSER

# Rank vs capacity: truncated SVD of a rank-16 target update
import numpy as np
rng = np.random.default_rng(0)
d = 256
U = rng.normal(0, 1, (d, 16)); V = rng.normal(0, 1, (16, d))
dW = U @ V / np.sqrt(d)                   # a "true" update of intrinsic rank 16

u, s, vt = np.linalg.svd(dW)
ranks, errs = [1, 4, 16, 64], []
for r in ranks:
    approx = (u[:, :r] * s[:r]) @ vt[:r]  # best rank-r fit (Eckart-Young)
    e = np.linalg.norm(dW - approx) / np.linalg.norm(dW)
    errs.append(e)
    print(f"rank {r:3d}: relative Frobenius error {e:.4f}")

print("\nerror hits zero exactly at the target's intrinsic rank (16);")
print("rank 64 buys nothing. LoRA's bet is that real dW looks like this.")
plot_xy(ranks, errs)

edits are live — break it on purpose

6.3

QLoRA: fine-tuning on one GPU

QLoRA stacks three tricks so a 65–70B model fine-tunes on a single 48 GB card: (1) freeze the base weights in 4-bit NF4; (2) train bf16 LoRA adapters on top, dequantizing on the fly per matmul; (3) page optimizer states to CPU on memory spikes.

EQ 6.2 — NF4: QUANTILES OF A GAUSSIAN $$ q_i = \mathrm{Quantile}_{\mathcal{N}(0,1)}\!\left( \delta + \frac{i}{15}\,(1 - 2\delta) \right), \quad i = 0, \ldots, 15, \quad \delta \approx 0.03 \qquad\text{(then normalized to } [-1, 1]\text{)} $$

Trained weights are approximately Gaussian, so NF4 places its 16 levels at Gaussian quantiles — equal probability mass per bin, minimizing expected error where weights actually live (dense near zero, sparse at the tails). A second-order trick, double quantization, quantizes the per-block scale factors themselves, saving another ~0.4 bits/param. Full quantization theory: Chapter 07.

NF4 stores each weight in $b = 4$ bits. How many distinct quantization levels does that allow? ($2^b$.)

A $b$-bit code addresses $2^b$ values: $2^4 = $ 16 levels. NF4 places these 16 at Gaussian quantiles rather than on a uniform grid.

QLoRA freezes the base weights at 4 bits (0.5 bytes/param). How many GB do the frozen weights of a 70B model occupy?

$70\times10^9 \times 0.5 \text{ bytes} = 3.5\times10^{10}$ bytes $= $ 35 GB — small enough to sit on one 48 GB card with room left for bf16 adapters and activations.

Gradients flow through the frozen 4-bit weights into the adapters only. Quality matches 16-bit LoRA closely on instruction-tuning benchmarks — the canonical result that made serious fine-tuning a consumer-hardware activity.

INSTRUMENT 6.2 — WILL IT FINE-TUNE?VRAM ESTIMATE · SINGLE NODE

MODEL SIZE 8B params

METHOD

Rule-of-thumb totals (weights + optimizer + modest activations at batch 1, seq 2K). Full fine-tuning a 70B wants ~1.1 TB; QLoRA squeezes the same model under 48 GB. Vertical lines mark common cards.

6.4

The PEFT zoo, briefly

Method	What trains	Notes
LoRA / QLoRA	low-rank ΔW	The default. Mergeable, swappable, multi-tenant.
Adapters (serial)	small bottleneck MLPs inserted per block	The 2019 original; adds inference latency, now rare.
Prefix / prompt tuning	virtual KV prefixes or input embeddings	Tiny footprint; weaker on hard tasks; fully reversible.
(IA)³	per-channel rescaling vectors	Orders of magnitude fewer params than LoRA; niche.
BitFit	bias terms only	A useful lower bound on “how little is enough”.

Everything in Chapter 05 composes with PEFT: DPO-with-LoRA is the standard budget alignment stack, and GRPO over LoRA adapters is increasingly how small reasoning fine-tunes ship.

6.5

A practical recipe

# Defaults that survive contact with reality (7–70B, instruction-style task)
base:        strongest instruct model that fits serving budget
method:      QLoRA (NF4) · r=16 · α=32 · all linear layers · dropout 0.05
data:        500–50k examples; dedup; decontaminate against your evals;
             quality >> quantity — read 50 examples yourself
format:      exact chat template of the base model (silent killer #1)
lr:          1e-4 (LoRA) · cosine decay · warmup 3% · 1–3 epochs
batch:       effective 64–128 sequences via gradient accumulation
eval:        held-out task metric + a general benchmark (forgetting probe)
             + manual review of 50 outputs per checkpoint
ship:        merge adapter for latency · or serve multi-LoRA (S-LoRA/vLLM)

PITFALLS

The four classic failures: (1) wrong/mismatched chat template — model answers fine but formats garbage; (2) eval contamination — your test set leaked into training data and the numbers are fiction; (3) overfitting epoch 3+ — loss down, vibes down; (4) silent capability regression — always probe general skills, not just the target task.

Adaptation changes what a model says; compression changes what it costs. Chapter 07: distillation, the quantization stack from absmax to GPTQ/AWQ, and why bits-per-weight is the real unit of deployment.

§

Fine-tuning

To tune or not to tune

LoRA: the low-rank hypothesis

QLoRA: fine-tuning on one GPU

The PEFT zoo, briefly

A practical recipe

Further reading