07 · Compression — LLM Field Manual

7.1

Why bits are speed

During autoregressive decoding at small batch, each new token requires reading all model weights from HBM once. The arithmetic is trivial relative to the data movement, so:

EQ 7.1 — THE DECODE SPEED-OF-LIGHT $$ \text{tokens/s per sequence} \;\lesssim\; \frac{\text{memory bandwidth}}{\text{bytes per parameter} \times N_{\text{active}}} $$

H100: 3.35 TB/s. A 70B dense model in FP16 (140 GB… already > one GPU) streams at best ~24 tok/s; in INT4 (35 GB) the ceiling is ~96 tok/s on one card. Halve the bits, double the speed and double the KV-cache room — quantization is the rare optimization that pays twice. ($N_{\text{active}}$ matters: MoE models only read routed experts — Chapter 09.)

An H100 has $3.35\times10^{12}$ B/s of bandwidth. Decoding a $70\text{B}$ dense model in INT4 (0.5 bytes/param), what is the single-stream tokens/s ceiling? $\;\text{tok/s} \approx \dfrac{\text{BW}}{\text{bytes}\times N}$.

Bytes streamed per token $= 0.5 \times 70\times10^9 = 3.5\times10^{10}$. Ceiling $= \dfrac{3.35\times10^{12}}{3.5\times10^{10}} = $ 95.7 tok/s. (In FP16 the same model only reaches ~24 — halving the bits doubled the speed.)

7.2

Distillation: small model, big teacher

Knowledge distillation trains a small student to match a large teacher's output distribution rather than the one-hot data labels. The classic loss blends soft and hard targets, with a temperature that exposes the teacher's “dark knowledge” — the relative probabilities of wrong answers:

EQ 7.2 — DISTILLATION LOSS (HINTON 2015) $$ \mathcal{L}_{\text{KD}} = (1-\lambda)\, \mathcal{L}_{\text{CE}}(y, p_S) \;+\; \lambda\, \tau^2\, \mathrm{KL}\!\Big( p_T^{(\tau)} \,\Big\|\, p_S^{(\tau)} \Big), \qquad p^{(\tau)} = \mathrm{softmax}(z / \tau) $$

A full distribution per token is a vastly richer signal than a single label — “the next token is cat, but kitten was nearly as good and carburetor was absurd” — which is why students train far more sample-efficiently than from raw text.

The distillation loss scales its soft-target KL term by $\tau^2$ (to keep gradient magnitudes comparable to the hard-label term). If you distill at temperature $\tau = 4$, by what factor is that KL term multiplied?

The prefactor is $\tau^2 = 4^2 = $ 16. Without it, softening the targets (which shrinks every gradient by roughly $1/\tau^2$) would silently down-weight the teacher signal.

PYTHON · RUNNABLE IN-BROWSER

# Dark knowledge: teacher logits softened at temperature tau
import numpy as np
classes = ["cat", "kitten", "lynx", "dog", "loaf", "carburetor"]
z = np.array([9.0, 6.5, 4.0, 2.5, 1.0, -4.0])   # teacher logits, cat photo

taus, ents = [1, 2, 5, 10], []
for tau in taus:
    p = np.exp(z / tau); p /= p.sum()           # EQ 7.2's softened softmax
    H = float(-np.sum(p * np.log2(p)))
    ents.append(H)
    row = "  ".join(f"{c} {q:.3f}" for c, q in zip(classes, p))
    print(f"tau={tau:2d}  H={H:.3f} bits | {row}")

print("\nat tau=1 the target is ~one-hot: a glorified label. by tau=5")
print("the ranking over WRONG answers (kitten >> carburetor) is visible --")
print("that structure is the extra signal the student trains on.")
plot_xy(taus, ents)

edits are live — break it on purpose

INSTRUMENT 7.1 — DARK KNOWLEDGETEACHER SOFTMAX AT TEMPERATURE τ

DISTILLATION TEMPERATURE τ = 1.0

ENTROPY OF SOFT TARGETS

—

A teacher classifying an image of a cat. At τ = 1 the target is nearly one-hot — barely more informative than a label. Raise τ and the structure appears: kitten ≈ cat, lynx plausible, loaf-of-bread amusingly possible, carburetor absurd. That ranking over wrong answers is what the student actually learns from.

The three production flavors

Logit/soft-label distillation (EQ 7.2): needs teacher logits — natural when you own the teacher (Gemini Flash from larger Gemini, Claude Haiku-class models, Llama-3.2-1B/3B from 8B/70B).
Sequence-level / hard distillation: generate outputs from the teacher, SFT the student on them. All you need is API access — this is how DeepSeek-R1's reasoning was poured into Qwen/Llama students, and what most “distilled” open models mean.
On-policy distillation (GKD-style): the student generates, the teacher grades/corrects each token (reverse-KL on student samples). Fixes exposure bias — the student gets feedback on its own mistakes, not just on teacher-perfect prefixes — and is rapidly becoming the default for reasoning transfer.

PATTERN

The frontier ladder. Standard industry economics: train one expensive flagship, then distill a family (pro/flash/nano) for the latency-cost curve. Capability flows downhill from each frontier generation into models 10–100× cheaper within months.

7.3

Quantization fundamentals

Quantization maps continuous weights onto a small grid of representable values. The workhorse is uniform affine quantization; for weights, the symmetric (zero-point-free) form:

EQ 7.3 — SYMMETRIC UNIFORM QUANTIZATION $$ \hat{w} = s \cdot \mathrm{clamp}\!\Big( \mathrm{round}\big( w / s \big),\, -2^{b-1},\, 2^{b-1}-1 \Big), \qquad s = \frac{\max_i |w_i|}{2^{b-1}-1} $$

One FP scale $s$ per group of weights; the integers are stored, the scale rides along. The whole game is choosing the granularity of $s$: per-tensor (cheapest, coarsest) → per-channel → per-group of 64–128 (the GGUF/GPTQ standard) — smaller groups isolate outliers at slightly more bits/param overhead.

Symmetric $b$-bit quantization uses $s = \dfrac{\max_i |w_i|}{2^{b-1}-1}$. For a group whose largest magnitude is $\max_i|w_i| = 0.6$, quantized to $b = 4$ bits, what is the scale $s$?

Denominator $= 2^{4-1} - 1 = 2^3 - 1 = 7$. So $s = \dfrac{0.6}{7} = $ 0.0857. Each stored integer is one of $\{-8,\dots,7\}$, recovered as integer$\times s$.

PYTHON · RUNNABLE IN-BROWSER

# Absmax INT-k roundtrip: one global scale vs groups of 64
import numpy as np
rng = np.random.default_rng(0)
w = rng.normal(0, 0.02, 10_000)
w[rng.random(10_000) < 0.004] *= 6        # rare outliers, like real layers

def rmse(w, bits, group):
    qmax = 2**(bits - 1) - 1
    out = np.empty_like(w)
    for i in range(0, len(w), group):
        blk = w[i:i+group]
        s = np.abs(blk).max() / qmax          # EQ 7.3's scale, per group
        out[i:i+group] = np.clip(np.round(blk / s), -qmax - 1, qmax) * s
    return np.sqrt(np.mean((out - w)**2))

print("bits | one global scale | group-wise (g=64)")
for bits in [8, 4, 3, 2]:
    print(f"  {bits}  |    {rmse(w, bits, len(w)):.6f}      |    {rmse(w, bits, 64):.6f}")

print("\nthe outliers stretch the single scale s and crush the gaussian")
print("bulk onto a few levels; per-group scales quarantine them -- the")
print("whole reason GGUF/GPTQ ship a scale every 64-128 weights.")

edits are live — break it on purpose

INSTRUMENT 7.2 — QUANTIZE A WEIGHT TENSOR8,192 WEIGHTS · GAUSSIAN + OUTLIERS

BIT WIDTH 4-bit

FORMAT

—

LEVELS

—

RMS ERROR

—

70B MODEL WEIGHT MEMORY

—

Grey: original distribution. Mint: surviving levels — mass collapses onto the grid. At 2–3 bits with one global scale, the rare outliers stretch s and crush the bulk into a few levels; switch group-wise scales ON and watch the error fall. That single observation motivates most of §7.4.

The outlier problem. LLM weight matrices are friendly (near-Gaussian) but activations are not: past ~6B parameters, a few hidden channels carry systematically huge magnitudes. Naïve W8A8 quantization breaks on them — the discovery (LLM.int8) that shaped every method since.

7.4

Post-training quantization: the methods that matter

PTQ compresses a finished model with a small calibration set and no (or minimal) retraining — minutes to hours, and the way virtually every deployed quantized LLM is made.

GPTQ — error-correcting rounding

EQ 7.4 — LAYER-WISE OBJECTIVE $$ \min_{\widehat{W}} \;\big\| W X - \widehat{W} X \big\|_F^2 $$

Don't preserve weights — preserve the layer's output on real calibration activations $X$. GPTQ quantizes one column at a time and redistributes each column's rounding error onto not-yet-quantized columns, using second-order (Hessian $\,H = XX^\top$) information from Optimal Brain Surgeon lineage. 3–4 bit weights with minor loss, at billion-parameter scale, in hours on one GPU.

AWQ — protect what activations say matters

EQ 7.5 — ACTIVATION-AWARE SCALING $$ \hat{y} = \big( W \,\mathrm{diag}(s) \big)_{\text{quantized}} \cdot \big( \mathrm{diag}(s)^{-1} x \big) $$

A small fraction (~1%) of weight channels — those multiplying large activations — cause most of the damage. AWQ scales them up before quantization (and inversely scales the activations, mathematically a no-op) so rounding error lands on channels that matter least. No reconstruction loop; robust across domains; the standard for 4-bit instruction models. SmoothQuant applies the same migration trick to enable fast W8A8.

The format landscape

Format / method	Bits	Quality cost	Where you meet it
BF16 (reference)	16	—	Training output, quality baseline
FP8 (E4M3)	8	≈ none	Datacenter serving on Hopper/Blackwell; weights + activations + KV
INT8 (SmoothQuant / LLM.int8)	8	negligible	Older datacenter GPUs, CPUs
INT4 group-wise (GPTQ / AWQ / GGUF Q4_K)	~4.2–4.6	small, task-dependent	The local-inference default (llama.cpp, Ollama)
NF4 (QLoRA)	~4.1	small	Fine-tuning base weights (CH 06)
MXFP4 / NVFP4	4 + micro-scales	small	Blackwell-native block-scaled FP4; GPT-OSS ships in it
~2-bit (AQLM / QuIP#, vector quant)	2–2.5	visible	Research edge; rotations + codebooks

KV-cache quantization (FP8/INT4 keys and values) composes with all of the above and directly multiplies serving concurrency — revisit Instrument 03.

7.5

Quantization-aware training

When PTQ's accuracy floor isn't enough — extreme bit widths, or shipping a flagship at FP4 — train with quantization in the loop. The forward pass uses fake-quantized weights; the backward pass pretends rounding didn't happen:

EQ 7.6 — STRAIGHT-THROUGH ESTIMATOR $$ \frac{\partial \mathcal{L}}{\partial w} \;\approx\; \frac{\partial \mathcal{L}}{\partial \hat{w}} \cdot \mathbb{1}\big[\, |w/s| \le 2^{b-1} \big] $$

round() has zero gradient almost everywhere, so the STE passes gradients straight through inside the clipping range. The model learns weights that sit comfortably on the grid. Cost: a (usually short) training run with training-grade infrastructure — reserved for high-volume deployments where the last percent matters. Llama-3.2's QAT+LoRA spins and Gemma's QAT releases are the open exemplars.

7.6

Pruning & structured sparsity

Pruning zeroes connections outright. Magnitude pruning (drop the smallest $|w|$) needs no data; Wanda ranks by $|w| \cdot \|x\|$ (weight × typical input magnitude) and prunes LLMs to ~50% unstructured sparsity with little loss and no retraining; SparseGPT runs a GPTQ-style reconstruction.

You prune a 7B-parameter model to 50% unstructured sparsity (Wanda-style). How many billions of weights remain nonzero?

Half are zeroed: $7\text{B} \times (1 - 0.50) = 7 \times 0.5 = $ 3.5B nonzero. (Without 2:4 structure or sparse kernels, though, those zeros rarely buy real speed on dense matmul units.)

Unstructured sparsity is hard to monetize — random zeros don't speed up dense matmul units.
2:4 semi-structured (two zeros in every four weights) is the exception: NVIDIA tensor cores execute it at up to 2× — the one sparsity pattern with first-class hardware.
Structural pruning + heal: remove whole layers/heads/width, then distill briefly to recover (Minitron-style: 15B → 8B → 4B families at a fraction of from-scratch cost).
MoE as “learned sparsity”: the most successful sparsity story of all is architectural — activate only the experts you need (Chapter 09).

The model is trained, aligned, adapted, and shrunk. Chapter 08: what actually happens when a request arrives — prefill, decode, batching, paging, speculation, and the serving stack that turns weights into a product.

§

Compression

Why bits are speed

Distillation: small model, big teacher

The three production flavors

Quantization fundamentals

Post-training quantization: the methods that matter

GPTQ — error-correcting rounding

AWQ — protect what activations say matters

The format landscape

Quantization-aware training

Pruning & structured sparsity

Further reading