05 · Post-training — LLM Field Manual

5.1

The alignment pipeline at a glance

FIG 5.AFROM BASE MODEL TO ASSISTANT

Stacked, not exclusive. Production pipelines iterate several rounds: SFT → preference optimization → RL on verifiable tasks → safety-specific passes — each stage consuming the previous stage's model. Post-training costs <1–10% of pre-training compute and is now where products differentiate (and where reasoning RL keeps growing that share).

5.2

Supervised fine-tuning

SFT is pre-training's loss on curated conversations: prompts plus high-quality demonstration responses, formatted in a chat template (special tokens delimiting system / user / assistant turns). The single technical wrinkle is masking — the loss is computed only on response tokens:

EQ 5.1 — SFT LOSS (RESPONSE-MASKED) $$ \mathcal{L}_{\text{SFT}} = -\,\mathbb{E}_{(x, y) \sim \mathcal{D}} \sum_{t=1}^{|y|} \log \pi_\theta\!\left(y_t \mid x, y_{<t}\right) $$

The model learns to produce answers, not to predict prompts. Quality dominates quantity: a few thousand to low-millions of carefully written or model-generated-then-filtered examples. LIMA's “superficial alignment hypothesis” — that SFT mostly teaches style and format on top of pre-trained capability — holds up reasonably well, with the important exception of distilled reasoning traces (§5.7).

5.3

Reward models: judgment as a function

Humans cannot write a reward function for “helpful, honest, harmless” — but they can compare two answers. A reward model $r_\phi(x, y)$ (usually the LLM itself with a scalar head) is trained on those comparisons under the Bradley–Terry assumption:

EQ 5.2 — BRADLEY–TERRY PREFERENCE LOSS $$ \mathcal{L}_{\text{RM}} = -\,\mathbb{E}_{(x,\, y_w \succ y_l)} \Big[ \log \sigma\big( r_\phi(x, y_w) - r_\phi(x, y_l) \big) \Big] $$

$y_w$ = chosen, $y_l$ = rejected. The model only learns reward differences — absolute scale is unidentified. Reward models are imperfect proxies, which makes them gameable: over-optimize and you harvest sycophancy, verbosity, and confident nonsense. That failure mode is called reward hacking, and the KL term in EQ 5.3 is its leash.

A reward model scores the chosen answer $ r(y_w) = 2.5 $ and the rejected one $ r(y_l) = 1.5 $. Under Bradley–Terry, what is $ \sigma(\Delta r) = P(\text{prefer } y_w) $? (Use $ e^{-1} = 0.3679 $.)

Margin $ \Delta r = 2.5 - 1.5 = 1.0 $. Then $ \sigma(1) = \dfrac{1}{1 + e^{-1}} = \dfrac{1}{1.3679} = $ 0.731.

PYTHON · RUNNABLE IN-BROWSER

# Bradley-Terry P(win) and the DPO gradient: solved pairs stop teaching
import numpy as np

def sigmoid(z): return 1 / (1 + np.exp(-z))

print("margin   P(y_w wins)   loss -log P   |dL/dmargin| = sigma(-margin)")
for m in (-4.0, -2.0, -1.0, 0.0, 1.0, 2.0, 4.0, 8.0):
    p = sigmoid(m)
    print(f"{m:6.1f} {p:12.4f} {-np.log(p):13.4f} {sigmoid(-m):18.4f}")

print("\nmargin = r(y_w) - r(y_l) for a reward model (EQ 5.2);")
print("margin = beta*(logratio_w - logratio_l) for DPO (EQ 5.5) -- same sigmoid.")
print("at margin 8 the gradient is 0.0003: a confidently-ordered pair is inert.")
print("misordered pairs (margin < 0) carry weight ~1 and dominate every update.")

xs = np.linspace(-6, 6, 100)          # the gradient-magnitude curve
plot_xy(xs, sigmoid(-xs))

edits are live — break it on purpose

INSTRUMENT 5.1 — PREFERENCE MARGINEQ 5.2 · σ(Δr)

REWARD MARGIN Δ = r(y_w) − r(y_l) 1.0

P(HUMAN PREFERS y_w) UNDER BRADLEY–TERRY

—

The same sigmoid is DPO's engine: there, Δ becomes β·(log-ratio of chosen − log-ratio of rejected), so pushing P(prefer) toward 1 directly reshapes the policy's probabilities. Note the flat tails — once a pair is confidently ordered, its gradient vanishes.

5.4

RLHF with PPO

With a reward model in hand, alignment becomes reinforcement learning: the LLM is a policy $\pi_\theta$, a full generated response is an episode, and we maximize reward without drifting far from the SFT model:

EQ 5.3 — THE RLHF OBJECTIVE $$ \max_\theta\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta} \Big[ r_\phi(x, y) \Big] \;-\; \beta\, \mathrm{KL}\!\big[ \pi_\theta(\cdot \mid x) \,\big\|\, \pi_{\text{ref}}(\cdot \mid x) \big] $$

The KL penalty against the frozen reference (SFT) policy keeps text fluent and bounds reward hacking: the policy may only spend so much “distributional budget” chasing reward. $\beta$ tunes the leash length.

PPO (proximal policy optimization) is the workhorse algorithm. Its core is a clipped surrogate that forbids any single update from moving the policy too far:

EQ 5.4 — PPO CLIPPED SURROGATE $$ \mathcal{L}_{\text{PPO}} = -\,\mathbb{E}_t \Big[ \min\Big( \rho_t\, \hat{A}_t,\;\; \mathrm{clip}\big(\rho_t,\, 1-\varepsilon,\, 1+\varepsilon\big)\, \hat{A}_t \Big) \Big], \qquad \rho_t = \frac{\pi_\theta(y_t \mid \cdot)}{\pi_{\theta_{\text{old}}}(y_t \mid \cdot)} $$

$\hat{A}_t$ is the advantage — how much better this token was than expected, estimated with a learned value model (GAE). The clip at $\varepsilon \approx 0.2$ caps per-token incentive. Total machinery in flight: policy + reference + reward + value models, four large networks — RLHF-PPO is famously an engineering project, which set the stage for the two simplifications that follow.

5.5

DPO: preference optimization without RL

Direct Preference Optimization begins from a closed-form fact: the optimal policy for EQ 5.3 is $ \pi^*(y|x) \propto \pi_{\text{ref}}(y|x)\, e^{r(x,y)/\beta} $. Inverting this expresses the reward in terms of the policy itself — substitute into the Bradley–Terry loss and the reward model cancels out entirely:

EQ 5.5 — DPO LOSS $$ \mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} \;-\; \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right] $$

A plain classification loss over preference pairs: push the chosen response's likelihood ratio up, the rejected one's down. The language model is secretly its own reward model. No sampling loop, no value network, no RM — just two forward passes per pair. Variants abound (IPO, KTO for unpaired thumbs-up/down, ORPO without a reference, SimPO length-normalized); DPO-family methods dominate open-weight post-training, while frontier labs still lean on online RL for the final stretch.

In DPO with $ \beta = 0.1 $, the chosen log-ratio is $ \log\frac{\pi_\theta}{\pi_{\text{ref}}}(y_w) = 2 $ and the rejected log-ratio is $ -3 $. What is the margin $ \beta\big(\text{logratio}_w - \text{logratio}_l\big) $ fed to the sigmoid?

Difference $ = 2 - (-3) = 5 $. Margin $ = \beta \times 5 = 0.1 \times 5 = $ 0.5 — the same quantity $ \Delta $ that Bradley–Terry's sigmoid consumes.

5.6

GRPO: RL slimmed for LLMs

Group Relative Policy Optimization (DeepSeek) keeps online RL but deletes the value network. For each prompt, sample a group of $G$ responses; the baseline is simply the group's mean reward:

EQ 5.6 — GROUP-RELATIVE ADVANTAGE $$ \hat{A}_i \;=\; \frac{r_i - \mathrm{mean}\big(r_1, \ldots, r_G\big)}{\mathrm{std}\big(r_1, \ldots, r_G\big)} $$

Each response is scored against its siblings: better than the group average ⇒ positive advantage, all its tokens reinforced (with PPO-style clipping and a KL term). Half the networks, half the memory of PPO — and a natural fit for verifiable rewards, where sampling 8–64 attempts per problem is exactly what you want anyway. This is the algorithm behind DeepSeek-R1; successors (DAPO, Dr. GRPO, GSPO) patch its length and difficulty biases.

A GRPO group of $ G = 4 $ responses earns verifiable rewards $ (1,\ 0,\ 1,\ 0) $. The group mean is $ 0.5 $ and the (population) std is $ 0.5 $. What is the advantage $ \hat{A}_i $ of a response that scored $ r_i = 1 $?

$ \hat{A}_i = \dfrac{r_i - \text{mean}}{\text{std}} = \dfrac{1 - 0.5}{0.5} = $ 1 — a correct answer sits one standard deviation above its siblings, so all its tokens get reinforced.

PYTHON · RUNNABLE IN-BROWSER

# what a topology costs: one task, four shapes, tokens + wall-clock
UNIT = 30_000     # tokens a single agent burns solving the task alone
RESULT = 300      # a structured result handed back (never a transcript)

shapes = {"single agent": (UNIT, 1.00)}
# orchestrator + 3 parallel workers, each ~40% of the exploring + handback
shapes["orchestrator-workers"] = (int(0.25*UNIT + 3*0.4*UNIT + 3*RESULT), 0.25 + 0.40 + 0.10)
# pipeline: 4 serial stages at ~30% each, artifact checks at the seams
shapes["pipeline"] = (int(4*0.3*UNIT + 3*RESULT), 4 * 0.30)
# council: 3 full independent attempts + a judge reading three results
shapes["council + judge"] = (int(3*UNIT + 3*RESULT + 2_000), 1.00 + 0.10)

print(f"{'topology':22s}{'tokens':>8s}{'vs single':>10s}{'wall-clock':>11s}")
for name, (tok, wall) in shapes.items():
    print(f"{name:22s}{tok:8,d}{tok/UNIT:9.2f}x{wall:10.2f}")

o_tok, o_wall = shapes["orchestrator-workers"]
c_tok = shapes["council + judge"][0]
print(f"\nfan-out buys wall-clock, never tokens: the orchestrator runs "
      f"{1 - o_wall:.0%} faster for {o_tok/UNIT - 1:.0%} more tokens;")
print(f"the council pays {c_tok/UNIT:.1f}x for independent judgment — worth it only")
print("when no ground-truth verifier exists, because tests beat votes")

edits are live — break it on purpose

INSTRUMENT 5.2 — GROUP ADVANTAGESEQ 5.6 · G = 8 · VERIFIABLE REWARD

GROUP MEAN

—

GROUP STD

—

READING

—

Eight attempts at one math problem; r = 1 if the verifier accepts. Keep sampling — when a group comes back all-correct or all-wrong, advantages collapse to zero. Curriculum (problems near the model's edge) is what keeps GRPO's gradient alive.

5.7

Reasoning models: RL on verifiable rewards

The decisive shift of 2024–25: for math, code, and logic, you don't need a learned reward model at all. The answer is checkable — a unit test passes, the boxed number matches. RLVR (RL with verifiable rewards) optimizes against that binary signal:

EQ 5.7 — VERIFIABLE REWARD $$ r(x, y) = \mathbb{1}\big[\, \mathrm{verify}(x, y) \,\big] \;+\; \lambda_{\text{fmt}}\, \mathbb{1}\big[\,\text{format ok}\,\big] $$

Unhackable (to first order), infinitely scalable, no human raters. Trained with GRPO at scale, models spontaneously learn to emit long chains of thought, check their own work, backtrack, and try alternatives — DeepSeek-R1's training curves show response length and accuracy growing together, the “aha moment” emerging rather than being taught.

RLVR reward with $ \lambda_{\text{fmt}} = 0.2 $: an answer passes the verifier ($ \mathrm{verify} = 1 $) and is correctly formatted ($ \text{format ok} = 1 $). What total reward $ r(x,y) $ does EQ 5.7 assign?

$ r = \mathbb{1}[\text{verify}] + \lambda_{\text{fmt}}\,\mathbb{1}[\text{format}] = 1 + 0.2 \times 1 = $ 1.2. A correct-but-unformatted answer would score only $ 1.0 $; a formatted-but-wrong one only $ 0.2 $.

Test-time compute as a new scaling axis. o1/R1-class models trade tokens for accuracy: more thinking tokens, better answers — a dial (reasoning effort) exposed to users and a curve that compounds with train-time scaling.
Distilled reasoning. SFT on traces sampled from a strong reasoning model transfers a surprising fraction of the skill to small models (R1-distill family) — cheaper than running RL on the small model itself (Chapter 07).
Open challenge. Extending RLVR beyond verifiable domains — essays, strategy, taste — currently routes through model-as-judge rewards (rubric- or AI-feedback based), reintroducing the proxy-gaming problem in subtler form.

5.8

Constitutional AI & RLAIF

Human feedback does not scale to every edge case, and raters disagree. Constitutional AI (Anthropic) replaces much of the human signal with an explicit list of principles: the model critiques and revises its own outputs against the constitution (supervised phase), then an AI judge applies the same principles to generate preference labels for RL (RLAIF). The result is cheaper, more consistent, and — importantly — auditable: the normative choices live in a readable document rather than in a million implicit rating decisions.

Production stacks blend everything in this chapter: human preferences where stakes are high, AI feedback for breadth, verifiable rewards where possible, plus deliberate safety training (refusal boundaries, jailbreak robustness) and post-hoc evals/red-teaming as the release gate.

§

Post-training

The alignment pipeline at a glance

Supervised fine-tuning

Reward models: judgment as a function

RLHF with PPO

DPO: preference optimization without RL

GRPO: RL slimmed for LLMs

Reasoning models: RL on verifiable rewards

Constitutional AI & RLAIF

Further reading