Bandits & the contextual case
Before the conversation, the slot machine. A multi-armed bandit is the smallest non-trivial RL problem: one state, \(K\) actions ("arms"), each returning a noisy reward, and a single dilemma — explore arms you are unsure about, or exploit the one that has paid best so far. There are no transitions and no credit assignment across time, which is exactly what makes it the cleanest laboratory for the explore–exploit trade-off of Chapter 01.
Add a twist and you get the frame that makes the rest of this chapter click. In a contextual bandit, before each pull the agent sees a context \(x\) and chooses an arm \(a\) conditioned on it; the reward depends on both. The episode is exactly one step long: observe, act, get rewarded, done. There is no \(s_{t+1}\) to plan toward, so the discount factor and the Bellman recursion fall away entirely.
That reframing is the load-bearing idea of the whole chapter. An LLM response is a single action drawn from a policy \(\pi_\theta(y \mid x)\) — yes, it is built token by token, but the reward arrives once, on the finished sequence, so the optimization is bandit-shaped even though the generation is sequential. The action space is the set of all token sequences, combinatorially huge, which is why we never enumerate arms; we sample, score, and nudge the sampling distribution. Two things are missing from EQ R6.1, and supplying them is the entire history that follows: where does \(r(x,a)\) come from when no environment hands it to us, and how do we optimize it when we cannot try every arm.
A caveat experts insist on: treating an LLM rollout as one bandit action throws away all intermediate structure. Per-token credit assignment (the dense-reward, token-level MDP view) is an active research frontier, and process-reward models that score reasoning steps rather than only final answers are exactly an attempt to reintroduce the horizon the bandit framing discards. The bandit picture is the right first model — not the last word.
RLHF — learning from human preferences
The reward in EQ R6.1 is the problem. "How good is this answer" has no closed form — helpfulness, honesty, and tone are not functions you can write down. The insight that unlocked modern alignment, due to Christiano and colleagues in 2017 and scaled to language by InstructGPT in 2022, is that people cannot reliably score a response on an absolute scale, but they can reliably compare two. So do not ask for a number; ask which of two completions is better, and learn a reward function that explains those choices.
The bridge from comparisons to a scalar is the Bradley–Terry model, a century-old model of paired comparisons. Assign each response a latent reward \(r_\phi(x,y)\); the probability that response \(y_w\) is preferred over \(y_l\) is the sigmoid of their reward difference.
The reward model \(r_\phi\) is itself a transformer — usually the supervised-fine-tuned policy with its token head replaced by a single scalar head reading the final hidden state. It is trained by maximum likelihood on a dataset of comparisons \(\{(x, y_w, y_l)\}\): minimize the negative log-likelihood of the human's choice under EQ R6.2.
With a learned \(r_\phi\) standing in for the human, the contextual-bandit objective of EQ R6.1 is finally concrete: maximize the reward model's score over completions the policy generates. The classic RLHF pipeline is three stages — supervised fine-tuning (SFT) to teach the format, reward-model training on preferences, then policy optimization against the reward model — and the third stage is the subject of §6.3.
# Bradley-Terry: fit a scalar reward per item from pairwise preferences (EQ R6.2-3)
import numpy as np
rng = np.random.default_rng(0)
# 4 responses with hidden "true" qualities; we only get to SEE comparisons
true_r = np.array([2.0, 1.0, 0.0, -1.0])
n = len(true_r)
# generate 600 noisy pairwise preferences: winner sampled by Bradley-Terry
pairs = rng.integers(0, n, (600, 2)); pairs = pairs[pairs[:,0] != pairs[:,1]]
p_win = 1 / (1 + np.exp(-(true_r[pairs[:,0]] - true_r[pairs[:,1]])))
i_wins = rng.random(len(pairs)) < p_win # True => left item won
r = np.zeros(n) # learned rewards, start at 0
for step in range(400): # gradient descent on EQ R6.3
w = np.where(i_wins, pairs[:,0], pairs[:,1]) # winner index per pair
l = np.where(i_wins, pairs[:,1], pairs[:,0]) # loser index per pair
pred = 1 / (1 + np.exp(-(r[w] - r[l]))) # P(winner beats loser) under model
g = np.zeros(n) # dL/dr ; (pred-1) flows to winner
np.add.at(g, w, (pred - 1)); np.add.at(g, l, (1 - pred))
r -= 0.05 * g / len(pairs)
r -= r.mean() # rewards fixed only up to a shift
print("true (centered):", (true_r - true_r.mean()).round(2))
print("learned(centered):", r.round(2))
print("ranking recovered:", list(np.argsort(-r)), "== ", list(np.argsort(-true_r)))
PPO for language models
Stage three optimizes the policy against the reward model. The workhorse is Proximal Policy Optimization (PPO), a policy-gradient method (Chapter 05) chosen for one property above all: it takes small, conservative steps. That conservatism is not incidental. The reward model is a fragile, frozen approximation; optimize against it too aggressively and the policy sprints off-distribution into regions where \(r_\phi\) is meaningless — and produces fluent nonsense that the reward model nonetheless loves.
PPO's mechanism is the clipped surrogate objective. Let \(\rho_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t)\) be the probability ratio between the updated and the data-collecting policy, and \(\hat A_t\) the advantage estimate. PPO maximizes the smaller of the unclipped and clipped products, which caps how far one update can move the policy.
On top of the clip, RLHF adds a second leash: a per-token KL penalty against the original SFT model. The reward actually optimized is not \(r_\phi\) alone but \(r_\phi\) minus a penalty for drifting away from where the policy started.
The cost is the story. Four large models resident simultaneously, online rollouts at every step, a separate value network to train, and a notorious sensitivity to hyperparameters — PPO-RLHF works, and it produced InstructGPT, ChatGPT, and the first generation of aligned assistants, but it is heavy, finicky, and hard to reproduce. Every method that follows is, in part, an attempt to keep RLHF's results while shedding its weight.
# PPO clipped surrogate vs the raw ratio objective (EQ R6.4)
import numpy as np
eps = 0.2
ratio = np.linspace(0.0, 2.5, 26) # pi_new / pi_old
def clip_obj(rho, A, eps=0.2):
return np.minimum(rho * A, np.clip(rho, 1-eps, 1+eps) * A)
A_pos, A_neg = 1.0, -1.0
obj_pos = clip_obj(ratio, A_pos, eps) # good action: A > 0
obj_neg = clip_obj(ratio, A_neg, eps) # bad action: A < 0
print(" ratio clip(A=+1) clip(A=-1)")
for r, op, on in list(zip(ratio, obj_pos, obj_neg))[::4]:
print(f" {r:5.2f} {op:8.3f} {on:9.3f}")
# the objective FLATTENS past the clip edges -> no reward for a giant step
print(f"\nA>0 objective is flat for ratio >= {1+eps}: ",
np.allclose(obj_pos[ratio >= 1+eps], 1+eps))
print(f"A<0 objective is flat for ratio <= {1-eps}: ",
np.allclose(obj_neg[ratio <= 1-eps], -(1-eps)))
plot_xy(ratio.tolist(), obj_pos.tolist())
DPO — preferences without RL
Here is the elegant turn. The KL-regularized objective of EQ R6.5 is not an open-ended search — it has a known, closed-form optimal policy. For a fixed reward \(r\), the policy that maximizes "expected reward minus \(\beta\)-KL to the reference" is the reference distribution reweighted by the exponentiated reward:
Rafailov and colleagues (2023) made the leap: substitute that inverted reward into the Bradley–Terry preference model (EQ R6.2). The intractable \(Z(x)\) is the same for both completions of a pair, so in the difference \(r(x,y_w) - r(x,y_l)\) it cancels exactly. What remains is a reward expressed purely as a log-ratio of the policy to the reference — and the entire reward-model-plus-RL pipeline collapses into a single supervised loss on preference pairs.
The gradient makes the behavior vivid. Its magnitude scales with how badly the implicit reward model currently ranks the pair — pairs the model already gets right contribute little, pairs it gets backwards contribute a lot — and its direction increases \(\log\pi_\theta(y_w\mid x)\) while decreasing \(\log\pi_\theta(y_l\mid x)\). DPO is preference learning that looks and runs exactly like supervised learning, and that simplicity made it the default for budget alignment almost overnight.
Honest caveats, because the field is not settled. DPO is offline: it optimizes on a fixed preference set and cannot explore beyond it, so it is sensitive to how well that data covers the policy's behavior, and the implicit reward can drift on out-of-distribution completions. Online and iterative variants (sampling fresh pairs, IPO's bounded objective, KTO's prospect-theory single-label loss) exist precisely to patch these gaps. Several careful studies find well-tuned PPO still edges out DPO on the hardest tasks; DPO's win is overwhelmingly one of simplicity and cost, not a clean dominance on quality.
# DPO loss on toy preferred/rejected pairs; check the gradient DIRECTION (EQ R6.7)
import numpy as np
beta = 1.0
# log-probs (policy and frozen reference) for chosen y_w and rejected y_l
lp_pi_w, lp_ref_w = -2.0, -2.3 # policy already prefers y_w a bit
lp_pi_l, lp_ref_l = -1.5, -2.4 # but policy still over-likes y_l
# implicit reward = beta * (log pi - log ref) -- the policy IS the reward model
r_w = beta * (lp_pi_w - lp_ref_w)
r_l = beta * (lp_pi_l - lp_ref_l)
gap = r_w - r_l
p_pref = 1 / (1 + np.exp(-gap)) # P(y_w > y_l) under EQ R6.2
loss = -np.log(p_pref)
print(f"implicit reward r_w={r_w:+.3f} r_l={r_l:+.3f} gap={gap:+.3f}")
print(f"P(chosen preferred) = {p_pref:.3f} DPO loss = {loss:.3f}")
# dL/d(logprob): coefficient (p_pref - 1) < 0 => RAISE logpi(y_w), LOWER logpi(y_l)
coef = p_pref - 1.0
g_w = beta * coef * (+1) # gradient wrt log pi(y_w)
g_l = beta * coef * (-1) # gradient wrt log pi(y_l)
print(f"\ngrad wrt logpi(y_w) = {g_w:+.3f} (negative -> ascent RAISES y_w)")
print(f"grad wrt logpi(y_l) = {g_l:+.3f} (positive -> ascent LOWERS y_l)")
print("direction: push probability mass from the rejected toward the chosen answer.")
GRPO & RLVR — verifiable rewards
DPO and PPO both lean on human preferences, with all the noise, expense, and gameability that entails. But for some tasks the reward needs no human at all: a math answer is right or wrong, code passes the unit tests or it does not. This is RLVR — reinforcement learning from verifiable rewards: replace the learned, hackable reward model with a deterministic checker that returns a clean, ungameable signal. It is the engine behind the reasoning models — DeepSeek-R1, OpenAI's o-series, and their kin — that surged through 2024–2025.
The optimizer of choice is GRPO — Group Relative Policy Optimization, introduced with DeepSeekMath. Its central move attacks PPO's most expensive component: the value network (the critic) that estimates a baseline for the advantage. GRPO deletes it. Instead, for each prompt it samples a group of \(G\) complete responses, scores them all, and uses the group's own statistics as the baseline — the advantage of a response is simply how far above or below the group average its reward sits.
Strip away the value network and what remains is almost startlingly simple: sample several answers, reward each (often just 1 for correct, 0 for wrong), standardize the rewards within the group, and push the policy toward the above-average answers. Run that loop on verifiable math and code, and reasoning behavior — longer chains of thought, self-checking, backtracking — emerges without any of it being explicitly supervised. That emergence, more than the algorithm itself, is what made GRPO the defining method of the reasoning era.
The recurring failure of every method in this chapter. The policy optimizes the measured reward, not the intended one — so any gap between them gets exploited. A reward model that slightly favors longer answers breeds verbosity; one that likes confident tone breeds confident wrongness; a verifiable checker with a loophole gets gamed by answers that pass the test without solving the task. This is Goodhart's law in a gradient: when a measure becomes a target, it ceases to be a good measure. The KL leash (EQ R6.5) is the main defense — it keeps the policy near the trustworthy region — but it only slows the drift, it does not remove the incentive.
# GRPO group-relative advantage from a group of sampled outputs (EQ R6.8)
import numpy as np
rng = np.random.default_rng(0)
# one prompt, G=8 sampled responses; verifiable reward = 1 if correct else 0
correct = np.array([1, 0, 1, 1, 0, 0, 1, 0], dtype=float) # RLVR: pass/fail
G = len(correct)
mean = correct.mean()
std = correct.std() + 1e-8 # population std, EQ R6.8
adv = (correct - mean) / std # group-relative advantage
print(f"rewards : {correct.astype(int).tolist()}")
print(f"group mean (baseline) = {mean:.3f} group std = {std:.3f}")
print("advantages :", adv.round(3).tolist())
print("\ncorrect responses get +adv (reinforced), wrong get -adv (suppressed);")
print("the baseline is the GROUP itself -- no learned value network anywhere.")
# if EVERY sample is correct, std -> 0: the group gives no learning signal
allright = np.ones(G)
adv0 = (allright - allright.mean()) / (allright.std() + 1e-8)
print("\nall-correct group advantages:", adv0.round(3).tolist(),
"-> zero signal (nothing to prefer)")
Every method here turned a goal into a number and maximized it — and every failure was a player gaming the rules. Preference learning, reward hacking, and self-play are all strategic interaction in disguise. The Game Theory volume opens with the formal language for that: players, payoffs, strategies, and the equilibria that emerge when every agent optimizes against every other — including against the very humans whose preferences we just spent a chapter learning.
References
- Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences.
- Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal Policy Optimization Algorithms.
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
- Stiennon, N. et al. (2020). Learning to Summarize from Human Feedback.