Why bits are speed
During autoregressive decoding at small batch, each new token requires reading all model weights from HBM once. The arithmetic is trivial relative to the data movement, so:
Distillation: small model, big teacher
Knowledge distillation trains a small student to match a large teacher's output distribution rather than the one-hot data labels. The classic loss blends soft and hard targets, with a temperature that exposes the teacher's “dark knowledge” — the relative probabilities of wrong answers:
# Dark knowledge: teacher logits softened at temperature tau
import numpy as np
classes = ["cat", "kitten", "lynx", "dog", "loaf", "carburetor"]
z = np.array([9.0, 6.5, 4.0, 2.5, 1.0, -4.0]) # teacher logits, cat photo
taus, ents = [1, 2, 5, 10], []
for tau in taus:
p = np.exp(z / tau); p /= p.sum() # EQ 7.2's softened softmax
H = float(-np.sum(p * np.log2(p)))
ents.append(H)
row = " ".join(f"{c} {q:.3f}" for c, q in zip(classes, p))
print(f"tau={tau:2d} H={H:.3f} bits | {row}")
print("\nat tau=1 the target is ~one-hot: a glorified label. by tau=5")
print("the ranking over WRONG answers (kitten >> carburetor) is visible --")
print("that structure is the extra signal the student trains on.")
plot_xy(taus, ents)
The three production flavors
- Logit/soft-label distillation (EQ 7.2): needs teacher logits — natural when you own the teacher (Gemini Flash from larger Gemini, Claude Haiku-class models, Llama-3.2-1B/3B from 8B/70B).
- Sequence-level / hard distillation: generate outputs from the teacher, SFT the student on them. All you need is API access — this is how DeepSeek-R1's reasoning was poured into Qwen/Llama students, and what most “distilled” open models mean.
- On-policy distillation (GKD-style): the student generates, the teacher grades/corrects each token (reverse-KL on student samples). Fixes exposure bias — the student gets feedback on its own mistakes, not just on teacher-perfect prefixes — and is rapidly becoming the default for reasoning transfer.
The frontier ladder. Standard industry economics: train one expensive flagship, then distill a family (pro/flash/nano) for the latency-cost curve. Capability flows downhill from each frontier generation into models 10–100× cheaper within months.
Quantization fundamentals
Quantization maps continuous weights onto a small grid of representable values. The workhorse is uniform affine quantization; for weights, the symmetric (zero-point-free) form:
# Absmax INT-k roundtrip: one global scale vs groups of 64
import numpy as np
rng = np.random.default_rng(0)
w = rng.normal(0, 0.02, 10_000)
w[rng.random(10_000) < 0.004] *= 6 # rare outliers, like real layers
def rmse(w, bits, group):
qmax = 2**(bits - 1) - 1
out = np.empty_like(w)
for i in range(0, len(w), group):
blk = w[i:i+group]
s = np.abs(blk).max() / qmax # EQ 7.3's scale, per group
out[i:i+group] = np.clip(np.round(blk / s), -qmax - 1, qmax) * s
return np.sqrt(np.mean((out - w)**2))
print("bits | one global scale | group-wise (g=64)")
for bits in [8, 4, 3, 2]:
print(f" {bits} | {rmse(w, bits, len(w)):.6f} | {rmse(w, bits, 64):.6f}")
print("\nthe outliers stretch the single scale s and crush the gaussian")
print("bulk onto a few levels; per-group scales quarantine them -- the")
print("whole reason GGUF/GPTQ ship a scale every 64-128 weights.")
The outlier problem. LLM weight matrices are friendly (near-Gaussian) but activations are not: past ~6B parameters, a few hidden channels carry systematically huge magnitudes. Naïve W8A8 quantization breaks on them — the discovery (LLM.int8) that shaped every method since.
Post-training quantization: the methods that matter
PTQ compresses a finished model with a small calibration set and no (or minimal) retraining — minutes to hours, and the way virtually every deployed quantized LLM is made.
GPTQ — error-correcting rounding
AWQ — protect what activations say matters
The format landscape
| Format / method | Bits | Quality cost | Where you meet it |
|---|---|---|---|
| BF16 (reference) | 16 | — | Training output, quality baseline |
| FP8 (E4M3) | 8 | ≈ none | Datacenter serving on Hopper/Blackwell; weights + activations + KV |
| INT8 (SmoothQuant / LLM.int8) | 8 | negligible | Older datacenter GPUs, CPUs |
| INT4 group-wise (GPTQ / AWQ / GGUF Q4_K) | ~4.2–4.6 | small, task-dependent | The local-inference default (llama.cpp, Ollama) |
| NF4 (QLoRA) | ~4.1 | small | Fine-tuning base weights (CH 06) |
| MXFP4 / NVFP4 | 4 + micro-scales | small | Blackwell-native block-scaled FP4; GPT-OSS ships in it |
| ~2-bit (AQLM / QuIP#, vector quant) | 2–2.5 | visible | Research edge; rotations + codebooks |
KV-cache quantization (FP8/INT4 keys and values) composes with all of the above and directly multiplies serving concurrency — revisit Instrument 03.
Quantization-aware training
When PTQ's accuracy floor isn't enough — extreme bit widths, or shipping a flagship at FP4 — train with quantization in the loop. The forward pass uses fake-quantized weights; the backward pass pretends rounding didn't happen:
Pruning & structured sparsity
Pruning zeroes connections outright. Magnitude pruning (drop the smallest \(|w|\)) needs no data; Wanda ranks by \(|w| \cdot \|x\|\) (weight × typical input magnitude) and prunes LLMs to ~50% unstructured sparsity with little loss and no retraining; SparseGPT runs a GPTQ-style reconstruction.
- Unstructured sparsity is hard to monetize — random zeros don't speed up dense matmul units.
- 2:4 semi-structured (two zeros in every four weights) is the exception: NVIDIA tensor cores execute it at up to 2× — the one sparsity pattern with first-class hardware.
- Structural pruning + heal: remove whole layers/heads/width, then distill briefly to recover (Minitron-style: 15B → 8B → 4B families at a fraction of from-scratch cost).
- MoE as “learned sparsity”: the most successful sparsity story of all is architectural — activate only the experts you need (Chapter 09).
The model is trained, aligned, adapted, and shrunk. Chapter 08: what actually happens when a request arrives — prefill, decode, batching, paging, speculation, and the serving stack that turns weights into a product.
Further reading
- Hinton, Vinyals & Dean (2015). Distilling the Knowledge in a Neural Network. — the soft-target distillation objective.
- Frantar, Ashkboos, Hoefler & Alistarh (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. — one-shot weight quantization to 3–4 bits.
- Lin et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. — protecting salient weights by activation scale.
- Dettmers, Lewis, Belkada & Zettlemoyer (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. — outlier-aware int8, and why naive quantization breaks at scale.
- Jacob et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. — the foundations of quantization-aware training.
- Frantar & Alistarh (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. — one-shot pruning, the sparsity counterpart to GPTQ.