AI // ENCYCLOPEDIA / VOL II / 06 / FINE-TUNING INDEX NEXT: COMPRESSION →
CHAPTER 06 / 10

Fine-tuning

Adapting a pre-trained model to your task is mostly a question of which parameters you allow to move. This chapter covers the spectrum from full fine-tuning to parameter-efficient methods, with particular focus on LoRA, whose low-rank algebra lets a laptop-class GPU specialize a multi-billion-parameter model.

READING TIME≈ 20 MIN BUILDS ONCH 04–05 INSTRUMENTSLoRA RANK · VRAM FIT
6.1

To tune or not to tune

Fine-tuning is the third tool to reach for, not the first. The escalation ladder:

ApproachChangesRight when…Wrong when…
PromptingnothingInstructions + few-shot examples suffice (they usually do)Behavior must be deeply consistent or token budget matters at scale
RAGcontextThe gap is knowledge — fresh, private, or vastThe gap is behavior, format, or skill
Fine-tuningweightsStyle, format, domain dialect, tool protocols, narrow skills; latency/cost via smaller specialized modelsYou're trying to inject facts (fragile, stale) or fix what a bigger model does out of the box

Full fine-tuning updates every weight — maximum capacity, but it costs training-grade memory (≈16 bytes/param with AdamW: a 7B model wants ~112 GB before activations), produces a full model copy per task, and courts catastrophic forgetting of general capability. Parameter-efficient fine-tuning (PEFT) exists to dodge all three.

Full fine-tuning with AdamW costs ≈16 bytes per parameter (weights + gradients + two optimizer moments, in mixed precision). How many GB does a 7B-parameter model need for those states, before activations?
\(7\times10^9 \text{ params} \times 16 \text{ bytes} = 1.12\times10^{11}\) bytes \(= \) 112 GB. This is why a 7B full fine-tune already overflows a single 80 GB card.
6.2

LoRA: the low-rank hypothesis

Low-Rank Adaptation rests on an empirical observation: the weight update a fine-tune needs has low intrinsic rank — the task lives in a tiny subspace of the full parameter space. So freeze \(W_0\) and learn the update as a product of two thin matrices:

EQ 6.1 — LoRA $$ W \;=\; W_0 + \Delta W \;=\; W_0 + \frac{\alpha}{r}\, B A, \qquad A \in \mathbb{R}^{r \times d_{\text{in}}},\; B \in \mathbb{R}^{d_{\text{out}} \times r},\; r \ll d $$
\(A\) starts Gaussian, \(B\) starts at zero — so training begins exactly at the pre-trained model and drifts smoothly away. \(\alpha/r\) rescales so behavior is stable across ranks. Trainable parameters per matrix drop from \(d_{\text{out}} d_{\text{in}}\) to \(r(d_{\text{in}} + d_{\text{out}})\). After training, \(BA\) can be merged into \(W_0\) — zero inference overhead — or kept separate and hot-swapped, letting one server multiplex hundreds of LoRA “personalities” over a single base model.
A square projection has \(d_{\text{in}} = d_{\text{out}} = 4096\). You attach a LoRA adapter of rank \(r = 8\). How many trainable parameters does the adapter add (\(A\) plus \(B\))?
Trainable params \(= r(d_{\text{in}} + d_{\text{out}}) = 2dr = 2 \times 4096 \times 8 = \) 65536.
For that same \(d = 4096\), \(r = 8\) adapter, what percent of the full \(d^2\) update does it train? (Give the percent, e.g. enter 0.39 for 0.39%.)
Fraction \(= \dfrac{2dr}{d^2} = \dfrac{2r}{d} = \dfrac{16}{4096} = 0.00390625\). As a percent: \(\times 100 = \) 0.39%. Rank 8 trains under four-tenths of one percent of the matrix.
PYTHON · RUNNABLE IN-BROWSER
# LoRA algebra: full vs 2dr trainable params, merge check
import numpy as np
rng = np.random.default_rng(0)
d_in, d_out, r = 256, 256, 8

W0 = rng.normal(0, 0.02, (d_out, d_in))   # frozen base weight
A  = rng.normal(0, 0.02, (r, d_in))       # adapter A (starts gaussian)
B  = rng.normal(0, 0.02, (d_out, r))      # adapter B ("after training": nonzero)

full = d_out * d_in
lora = r * (d_in + d_out)
print(f"full fine-tune params : {full:,}")
print(f"LoRA params (2dr)     : {lora:,}")
print(f"trainable fraction    : {100*lora/full:.2f} %")

x = rng.normal(0, 1, (5, d_in))           # a batch of 5 activations
y_two_path = x @ W0.T + (x @ A.T) @ B.T   # frozen path + adapter path
W_merged   = W0 + B @ A                   # EQ 6.1 merged, alpha/r = 1
y_merged   = x @ W_merged.T
print("merged == two-path    :", np.allclose(y_two_path, y_merged))
print("max abs difference    :", float(np.abs(y_two_path - y_merged).max()))
edits are live — break it on purpose
INSTRUMENT 6.1 — LoRA PARAMETER COUNTERONE d×d PROJECTION · EQ 6.1
TRAINABLE FRACTION OF THE MATRIX
FULL ΔW PARAMS (d²)
LoRA PARAMS (2dr)
TRAINABLE %
At d = 8,192, rank 16 trains 0.39% of the matrix. Applied across a 70B model's attention + MLP projections, a typical r = 16 adapter is ~200–400 MB of bf16 — versus 140 GB for the model it steers.

Where to attach, what rank. Original practice targeted only \(W_Q, W_V\); current default is all linear layers (Q, K, V, O, gate, up, down), which beats raising the rank at equal parameter count. Ranks 8–64 cover most tasks; style transfers sit low, new skills (code dialects, tool-calling formats) sit higher. rsLoRA fixes the scale to \(\alpha/\sqrt{r}\) for stability at high rank; DoRA decomposes magnitude from direction for a small quality bump.

PYTHON · RUNNABLE IN-BROWSER
# Rank vs capacity: truncated SVD of a rank-16 target update
import numpy as np
rng = np.random.default_rng(0)
d = 256
U = rng.normal(0, 1, (d, 16)); V = rng.normal(0, 1, (16, d))
dW = U @ V / np.sqrt(d)                   # a "true" update of intrinsic rank 16

u, s, vt = np.linalg.svd(dW)
ranks, errs = [1, 4, 16, 64], []
for r in ranks:
    approx = (u[:, :r] * s[:r]) @ vt[:r]  # best rank-r fit (Eckart-Young)
    e = np.linalg.norm(dW - approx) / np.linalg.norm(dW)
    errs.append(e)
    print(f"rank {r:3d}: relative Frobenius error {e:.4f}")

print("\nerror hits zero exactly at the target's intrinsic rank (16);")
print("rank 64 buys nothing. LoRA's bet is that real dW looks like this.")
plot_xy(ranks, errs)
edits are live — break it on purpose
6.3

QLoRA: fine-tuning on one GPU

QLoRA stacks three tricks so a 65–70B model fine-tunes on a single 48 GB card: (1) freeze the base weights in 4-bit NF4; (2) train bf16 LoRA adapters on top, dequantizing on the fly per matmul; (3) page optimizer states to CPU on memory spikes.

EQ 6.2 — NF4: QUANTILES OF A GAUSSIAN $$ q_i = \mathrm{Quantile}_{\mathcal{N}(0,1)}\!\left( \delta + \frac{i}{15}\,(1 - 2\delta) \right), \quad i = 0, \ldots, 15, \quad \delta \approx 0.03 \qquad\text{(then normalized to } [-1, 1]\text{)} $$
Trained weights are approximately Gaussian, so NF4 places its 16 levels at Gaussian quantiles — equal probability mass per bin, minimizing expected error where weights actually live (dense near zero, sparse at the tails). A second-order trick, double quantization, quantizes the per-block scale factors themselves, saving another ~0.4 bits/param. Full quantization theory: Chapter 07.
NF4 stores each weight in \(b = 4\) bits. How many distinct quantization levels does that allow? (\(2^b\).)
A \(b\)-bit code addresses \(2^b\) values: \(2^4 = \) 16 levels. NF4 places these 16 at Gaussian quantiles rather than on a uniform grid.
QLoRA freezes the base weights at 4 bits (0.5 bytes/param). How many GB do the frozen weights of a 70B model occupy?
\(70\times10^9 \times 0.5 \text{ bytes} = 3.5\times10^{10}\) bytes \(= \) 35 GB — small enough to sit on one 48 GB card with room left for bf16 adapters and activations.

Gradients flow through the frozen 4-bit weights into the adapters only. Quality matches 16-bit LoRA closely on instruction-tuning benchmarks — the canonical result that made serious fine-tuning a consumer-hardware activity.

INSTRUMENT 6.2 — WILL IT FINE-TUNE?VRAM ESTIMATE · SINGLE NODE
Rule-of-thumb totals (weights + optimizer + modest activations at batch 1, seq 2K). Full fine-tuning a 70B wants ~1.1 TB; QLoRA squeezes the same model under 48 GB. Vertical lines mark common cards.
6.4

The PEFT zoo, briefly

MethodWhat trainsNotes
LoRA / QLoRAlow-rank ΔWThe default. Mergeable, swappable, multi-tenant.
Adapters (serial)small bottleneck MLPs inserted per blockThe 2019 original; adds inference latency, now rare.
Prefix / prompt tuningvirtual KV prefixes or input embeddingsTiny footprint; weaker on hard tasks; fully reversible.
(IA)³per-channel rescaling vectorsOrders of magnitude fewer params than LoRA; niche.
BitFitbias terms onlyA useful lower bound on “how little is enough”.

Everything in Chapter 05 composes with PEFT: DPO-with-LoRA is the standard budget alignment stack, and GRPO over LoRA adapters is increasingly how small reasoning fine-tunes ship.

6.5

A practical recipe

# Defaults that survive contact with reality (7–70B, instruction-style task)
base:        strongest instruct model that fits serving budget
method:      QLoRA (NF4) · r=16 · α=32 · all linear layers · dropout 0.05
data:        500–50k examples; dedup; decontaminate against your evals;
             quality >> quantity — read 50 examples yourself
format:      exact chat template of the base model (silent killer #1)
lr:          1e-4 (LoRA) · cosine decay · warmup 3% · 1–3 epochs
batch:       effective 64–128 sequences via gradient accumulation
eval:        held-out task metric + a general benchmark (forgetting probe)
             + manual review of 50 outputs per checkpoint
ship:        merge adapter for latency · or serve multi-LoRA (S-LoRA/vLLM)
PITFALLS

The four classic failures: (1) wrong/mismatched chat template — model answers fine but formats garbage; (2) eval contamination — your test set leaked into training data and the numbers are fiction; (3) overfitting epoch 3+ — loss down, vibes down; (4) silent capability regression — always probe general skills, not just the target task.

NEXT

Adaptation changes what a model says; compression changes what it costs. Chapter 07: distillation, the quantization stack from absmax to GPTQ/AWQ, and why bits-per-weight is the real unit of deployment.

§

Further reading

  • Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. — the low-rank update at the heart of this chapter.
  • Dettmers, Pagnoni, Holtzman & Zettlemoyer (2023). QLoRA: Efficient Finetuning of Quantized LLMs. — 4-bit NF4 base weights plus LoRA, fine-tuning on a single GPU.
  • Houlsby et al. (2019). Parameter-Efficient Transfer Learning for NLP. — adapter modules, the ancestor of the PEFT family.
  • Li & Liang (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. — prompt/prefix tuning, a complementary PEFT branch.
  • Lester, Al-Rfou & Constant (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. — soft prompts become competitive at scale.
  • Aghajanyan, Zettlemoyer & Gupta (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. — the empirical basis for the low-rank hypothesis.