The alignment pipeline at a glance
Supervised fine-tuning
SFT is pre-training's loss on curated conversations: prompts plus high-quality demonstration responses, formatted in a chat template (special tokens delimiting system / user / assistant turns). The single technical wrinkle is masking — the loss is computed only on response tokens:
Reward models: judgment as a function
Humans cannot write a reward function for “helpful, honest, harmless” — but they can compare two answers. A reward model \(r_\phi(x, y)\) (usually the LLM itself with a scalar head) is trained on those comparisons under the Bradley–Terry assumption:
# Bradley-Terry P(win) and the DPO gradient: solved pairs stop teaching
import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))
print("margin P(y_w wins) loss -log P |dL/dmargin| = sigma(-margin)")
for m in (-4.0, -2.0, -1.0, 0.0, 1.0, 2.0, 4.0, 8.0):
p = sigmoid(m)
print(f"{m:6.1f} {p:12.4f} {-np.log(p):13.4f} {sigmoid(-m):18.4f}")
print("\nmargin = r(y_w) - r(y_l) for a reward model (EQ 5.2);")
print("margin = beta*(logratio_w - logratio_l) for DPO (EQ 5.5) -- same sigmoid.")
print("at margin 8 the gradient is 0.0003: a confidently-ordered pair is inert.")
print("misordered pairs (margin < 0) carry weight ~1 and dominate every update.")
xs = np.linspace(-6, 6, 100) # the gradient-magnitude curve
plot_xy(xs, sigmoid(-xs))
RLHF with PPO
With a reward model in hand, alignment becomes reinforcement learning: the LLM is a policy \(\pi_\theta\), a full generated response is an episode, and we maximize reward without drifting far from the SFT model:
PPO (proximal policy optimization) is the workhorse algorithm. Its core is a clipped surrogate that forbids any single update from moving the policy too far:
DPO: preference optimization without RL
Direct Preference Optimization begins from a closed-form fact: the optimal policy for EQ 5.3 is \( \pi^*(y|x) \propto \pi_{\text{ref}}(y|x)\, e^{r(x,y)/\beta} \). Inverting this expresses the reward in terms of the policy itself — substitute into the Bradley–Terry loss and the reward model cancels out entirely:
GRPO: RL slimmed for LLMs
Group Relative Policy Optimization (DeepSeek) keeps online RL but deletes the value network. For each prompt, sample a group of \(G\) responses; the baseline is simply the group's mean reward:
# what a topology costs: one task, four shapes, tokens + wall-clock
UNIT = 30_000 # tokens a single agent burns solving the task alone
RESULT = 300 # a structured result handed back (never a transcript)
shapes = {"single agent": (UNIT, 1.00)}
# orchestrator + 3 parallel workers, each ~40% of the exploring + handback
shapes["orchestrator-workers"] = (int(0.25*UNIT + 3*0.4*UNIT + 3*RESULT), 0.25 + 0.40 + 0.10)
# pipeline: 4 serial stages at ~30% each, artifact checks at the seams
shapes["pipeline"] = (int(4*0.3*UNIT + 3*RESULT), 4 * 0.30)
# council: 3 full independent attempts + a judge reading three results
shapes["council + judge"] = (int(3*UNIT + 3*RESULT + 2_000), 1.00 + 0.10)
print(f"{'topology':22s}{'tokens':>8s}{'vs single':>10s}{'wall-clock':>11s}")
for name, (tok, wall) in shapes.items():
print(f"{name:22s}{tok:8,d}{tok/UNIT:9.2f}x{wall:10.2f}")
o_tok, o_wall = shapes["orchestrator-workers"]
c_tok = shapes["council + judge"][0]
print(f"\nfan-out buys wall-clock, never tokens: the orchestrator runs "
f"{1 - o_wall:.0%} faster for {o_tok/UNIT - 1:.0%} more tokens;")
print(f"the council pays {c_tok/UNIT:.1f}x for independent judgment — worth it only")
print("when no ground-truth verifier exists, because tests beat votes")
Reasoning models: RL on verifiable rewards
The decisive shift of 2024–25: for math, code, and logic, you don't need a learned reward model at all. The answer is checkable — a unit test passes, the boxed number matches. RLVR (RL with verifiable rewards) optimizes against that binary signal:
- Test-time compute as a new scaling axis. o1/R1-class models trade tokens for accuracy: more thinking tokens, better answers — a dial (reasoning effort) exposed to users and a curve that compounds with train-time scaling.
- Distilled reasoning. SFT on traces sampled from a strong reasoning model transfers a surprising fraction of the skill to small models (R1-distill family) — cheaper than running RL on the small model itself (Chapter 07).
- Open challenge. Extending RLVR beyond verifiable domains — essays, strategy, taste — currently routes through model-as-judge rewards (rubric- or AI-feedback based), reintroducing the proxy-gaming problem in subtler form.
Constitutional AI & RLAIF
Human feedback does not scale to every edge case, and raters disagree. Constitutional AI (Anthropic) replaces much of the human signal with an explicit list of principles: the model critiques and revises its own outputs against the constitution (supervised phase), then an AI judge applies the same principles to generate preference labels for RL (RLAIF). The result is cheaper, more consistent, and — importantly — auditable: the normative choices live in a readable document rather than in a million implicit rating decisions.
Production stacks blend everything in this chapter: human preferences where stakes are high, AI feedback for breadth, verifiable rewards where possible, plus deliberate safety training (refusal boundaries, jailbreak robustness) and post-hoc evals/red-teaming as the release gate.
You rarely get to post-train a frontier model — but you can adapt one. Chapter 06: fine-tuning as a consumer of all the machinery above, and the low-rank algebra that makes it affordable.
Further reading
- Ouyang et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). — the canonical SFT → reward model → PPO pipeline.
- Christiano et al. (2017). Deep Reinforcement Learning from Human Preferences. — the preference-based reward learning that RLHF rests on.
- Schulman et al. (2017). Proximal Policy Optimization Algorithms (PPO). — the RL algorithm used to optimize against the reward model.
- Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO). — preference tuning without a separate RL loop.
- Shao et al. (2024). DeepSeekMath. — introduces GRPO, the critic-free RL variant tuned for LLMs.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. — RL on verifiable rewards producing emergent reasoning.
- Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. — RLAIF and the principle-based critique loop.