RL Meets LLMs — RLHF, DPO & GRPO

6.1

Bandits & the contextual case

Before the conversation, the slot machine. A multi-armed bandit is the smallest non-trivial RL problem: one state, $K$ actions ("arms"), each returning a noisy reward, and a single dilemma — explore arms you are unsure about, or exploit the one that has paid best so far. There are no transitions and no credit assignment across time, which is exactly what makes it the cleanest laboratory for the explore–exploit trade-off of Chapter 01.

Add a twist and you get the frame that makes the rest of this chapter click. In a contextual bandit, before each pull the agent sees a context $x$ and chooses an arm $a$ conditioned on it; the reward depends on both. The episode is exactly one step long: observe, act, get rewarded, done. There is no $s_{t+1}$ to plan toward, so the discount factor and the Bellman recursion fall away entirely.

EQ R6.1 — THE CONTEXTUAL-BANDIT OBJECTIVE $$ \max_{\pi}\; \mathbb{E}_{x \sim \mathcal{D}}\; \mathbb{E}_{a \sim \pi(\cdot \mid x)}\big[\, r(x, a) \,\big] $$

$\mathcal{D}$ is the distribution of contexts; $\pi(a \mid x)$ the policy; $r(x,a)$ the reward for taking action $a$ in context $x$. Compare the full RL return (Vol RL · EQ R1.3): the sum over future steps has collapsed to a single expected reward, because the horizon is one. This is the exact shape of LLM alignment. Read $x$ as the prompt, $a$ as the entire generated response, and $r(x,a)$ as "how good was that answer" — and RLHF is nothing more than a contextual bandit over an astronomically large action space.

That reframing is the load-bearing idea of the whole chapter. An LLM response is a single action drawn from a policy $\pi_\theta(y \mid x)$ — yes, it is built token by token, but the reward arrives once, on the finished sequence, so the optimization is bandit-shaped even though the generation is sequential. The action space is the set of all token sequences, combinatorially huge, which is why we never enumerate arms; we sample, score, and nudge the sampling distribution. Two things are missing from EQ R6.1, and supplying them is the entire history that follows: where does $r(x,a)$ come from when no environment hands it to us, and how do we optimize it when we cannot try every arm.

A caveat experts insist on: treating an LLM rollout as one bandit action throws away all intermediate structure. Per-token credit assignment (the dense-reward, token-level MDP view) is an active research frontier, and process-reward models that score reasoning steps rather than only final answers are exactly an attempt to reintroduce the horizon the bandit framing discards. The bandit picture is the right first model — not the last word.

A contextual bandit episode is exactly how many environment steps long (observe context, take one action, receive one reward, terminate)? Enter the integer.

There is one context, one action, one reward, then termination — no $s_{t+1}$. The horizon is 1, which is why the discounted return of Chapter 01 collapses to a single expected reward (EQ R6.1).

6.2

RLHF — learning from human preferences

The reward in EQ R6.1 is the problem. "How good is this answer" has no closed form — helpfulness, honesty, and tone are not functions you can write down. The insight that unlocked modern alignment, due to Christiano and colleagues in 2017 and scaled to language by InstructGPT in 2022, is that people cannot reliably score a response on an absolute scale, but they can reliably compare two. So do not ask for a number; ask which of two completions is better, and learn a reward function that explains those choices.

The bridge from comparisons to a scalar is the Bradley–Terry model, a century-old model of paired comparisons. Assign each response a latent reward $r_\phi(x,y)$; the probability that response $y_w$ is preferred over $y_l$ is the sigmoid of their reward difference.

EQ R6.2 — BRADLEY–TERRY PREFERENCE MODEL $$ P\big(y_w \succ y_l \mid x\big) \;=\; \frac{\exp r_\phi(x, y_w)}{\exp r_\phi(x, y_w) + \exp r_\phi(x, y_l)} \;=\; \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big) $$

$\sigma$ is the logistic sigmoid; $y_w$ ("win") is the preferred completion, $y_l$ ("lose") the rejected one. Only the difference of rewards matters — the model is invariant to adding any constant to every reward, so the scale is fixed only up to a shift. Equal rewards give exactly $\sigma(0) = 0.5$: a coin flip when the two answers are equally good. Fitting $r_\phi$ is then a binary-classification problem on preference pairs.

The reward model $r_\phi$ is itself a transformer — usually the supervised-fine-tuned policy with its token head replaced by a single scalar head reading the final hidden state. It is trained by maximum likelihood on a dataset of comparisons $\{(x, y_w, y_l)\}$: minimize the negative log-likelihood of the human's choice under EQ R6.2.

EQ R6.3 — REWARD-MODEL LOSS $$ \mathcal{L}_{\text{RM}}(\phi) \;=\; -\,\mathbb{E}_{(x,\,y_w,\,y_l)\sim \mathcal{D}}\Big[\, \log \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big) \Big] $$

This is logistic regression on the reward gap. The gradient pushes $r_\phi(x,y_w)$ up and $r_\phi(x,y_l)$ down until the model's predicted preference probability matches the humans'. A subtlety that bites in practice: the reward model is a frozen snapshot of human judgment, and as the policy drifts to exploit it (§6.5's reward hacking), its scores grow unreliable on exactly the off-distribution outputs the policy is now producing.

With a learned $r_\phi$ standing in for the human, the contextual-bandit objective of EQ R6.1 is finally concrete: maximize the reward model's score over completions the policy generates. The classic RLHF pipeline is three stages — supervised fine-tuning (SFT) to teach the format, reward-model training on preferences, then policy optimization against the reward model — and the third stage is the subject of §6.3.

PYTHON · RUNNABLE IN-BROWSER

# Bradley-Terry: fit a scalar reward per item from pairwise preferences (EQ R6.2-3)
import numpy as np
rng = np.random.default_rng(0)

# 4 responses with hidden "true" qualities; we only get to SEE comparisons
true_r = np.array([2.0, 1.0, 0.0, -1.0])
n = len(true_r)
# generate 600 noisy pairwise preferences: winner sampled by Bradley-Terry
pairs = rng.integers(0, n, (600, 2)); pairs = pairs[pairs[:,0] != pairs[:,1]]
p_win = 1 / (1 + np.exp(-(true_r[pairs[:,0]] - true_r[pairs[:,1]])))
i_wins = rng.random(len(pairs)) < p_win        # True => left item won

r = np.zeros(n)                                # learned rewards, start at 0
for step in range(400):                        # gradient descent on EQ R6.3
    w = np.where(i_wins, pairs[:,0], pairs[:,1])  # winner index per pair
    l = np.where(i_wins, pairs[:,1], pairs[:,0])  # loser index per pair
    pred = 1 / (1 + np.exp(-(r[w] - r[l])))    # P(winner beats loser) under model
    g = np.zeros(n)                            # dL/dr ; (pred-1) flows to winner
    np.add.at(g, w, (pred - 1)); np.add.at(g, l, (1 - pred))
    r -= 0.05 * g / len(pairs)
r -= r.mean()                                  # rewards fixed only up to a shift

print("true   (centered):", (true_r - true_r.mean()).round(2))
print("learned(centered):", r.round(2))
print("ranking recovered:", list(np.argsort(-r)), "== ", list(np.argsort(-true_r)))

edits are live — break it on purpose

INSTRUMENT R6.1 — PREFERENCE → REWARD-MODEL PIPELINEBRADLEY–TERRY · EQ R6.2 · LIVE

REWARD r(y_w) — CHOSEN 2.0

REWARD r(y_l) — REJECTED 0.0

REWARD GAP Δ = r_w − r_l

—

P(CHOSEN ≻ REJECTED) = σ(Δ)

—

RM LOSS −log σ(Δ)

—

The reward model only ever sees the gap between two completions, never an absolute score (EQ R6.2). Slide the two rewards: when they are equal the preference probability sits at exactly 0.50 — a coin flip — and the loss is its maximum, $\log 2 \approx 0.69$. Push the chosen response above the rejected one and the sigmoid curve marks how confidently the model now predicts the human's pick. Make the gap negative (rate the rejected answer higher) and watch the loss explode: the model is being told it ranked the pair backwards.

6.3

PPO for language models

Stage three optimizes the policy against the reward model. The workhorse is Proximal Policy Optimization (PPO), a policy-gradient method (Chapter 05) chosen for one property above all: it takes small, conservative steps. That conservatism is not incidental. The reward model is a fragile, frozen approximation; optimize against it too aggressively and the policy sprints off-distribution into regions where $r_\phi$ is meaningless — and produces fluent nonsense that the reward model nonetheless loves.

PPO's mechanism is the clipped surrogate objective. Let $\rho_t = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t)$ be the probability ratio between the updated and the data-collecting policy, and $\hat A_t$ the advantage estimate. PPO maximizes the smaller of the unclipped and clipped products, which caps how far one update can move the policy.

EQ R6.4 — PPO CLIPPED SURROGATE $$ \mathcal{L}^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\Big[\, \min\big(\rho_t\,\hat A_t,\; \operatorname{clip}(\rho_t,\, 1-\varepsilon,\, 1+\varepsilon)\,\hat A_t\big) \Big] $$

The ratio $\rho_t$ is clipped to $[1-\varepsilon,\, 1+\varepsilon]$ (typically $\varepsilon = 0.2$). When the advantage is positive, the objective stops rewarding the update once $\rho_t > 1+\varepsilon$; when negative, once $\rho_t < 1-\varepsilon$. The $\min$ makes the bound pessimistic — it removes the incentive to move the policy too far in a single step, a cheap surrogate for the trust region of TRPO without the second-order machinery.

On top of the clip, RLHF adds a second leash: a per-token KL penalty against the original SFT model. The reward actually optimized is not $r_\phi$ alone but $r_\phi$ minus a penalty for drifting away from where the policy started.

EQ R6.5 — THE KL-REGULARIZED RLHF REWARD $$ \max_{\pi_\theta}\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta}\big[\, r_\phi(x, y) \,\big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta(y\mid x)\,\|\,\pi_{\text{ref}}(y\mid x)\big) $$

$\pi_{\text{ref}}$ is the frozen SFT reference; $\beta$ sets the strength of the leash. The KL term keeps the policy near a region where the reward model is trustworthy and where the model still speaks fluent, on-distribution language. The whole RLHF objective is this one line — and §6.4 shows it has a closed-form optimum, which is the crack DPO pries open. Standard PPO-RLHF needs four models in memory at once: policy, reference, reward model, and a value/critic network.

The cost is the story. Four large models resident simultaneously, online rollouts at every step, a separate value network to train, and a notorious sensitivity to hyperparameters — PPO-RLHF works, and it produced InstructGPT, ChatGPT, and the first generation of aligned assistants, but it is heavy, finicky, and hard to reproduce. Every method that follows is, in part, an attempt to keep RLHF's results while shedding its weight.

In PPO with $\varepsilon = 0.2$, the new policy makes an action four times as likely as the old policy, so $\rho_t = 4$, and the advantage $\hat A_t$ is positive. Using EQ R6.4, what effective ratio multiplies $\hat A_t$ in the clipped objective?

For positive $\hat A_t$ the objective is the $\min$, which selects the clipped branch once $\rho_t > 1+\varepsilon$. With $\varepsilon = 0.2$ the cap is $1 + 0.2 = $ 1.2: pushing the ratio from 1.2 toward 4 buys no extra objective, so PPO has no incentive to take the giant step.

PYTHON · RUNNABLE IN-BROWSER

# PPO clipped surrogate vs the raw ratio objective (EQ R6.4)
import numpy as np
eps = 0.2
ratio = np.linspace(0.0, 2.5, 26)             # pi_new / pi_old

def clip_obj(rho, A, eps=0.2):
    return np.minimum(rho * A, np.clip(rho, 1-eps, 1+eps) * A)

A_pos, A_neg = 1.0, -1.0
obj_pos = clip_obj(ratio, A_pos, eps)         # good action: A > 0
obj_neg = clip_obj(ratio, A_neg, eps)         # bad action:  A < 0

print(" ratio  clip(A=+1)  clip(A=-1)")
for r, op, on in list(zip(ratio, obj_pos, obj_neg))[::4]:
    print(f" {r:5.2f}   {op:8.3f}   {on:9.3f}")

# the objective FLATTENS past the clip edges -> no reward for a giant step
print(f"\nA>0 objective is flat for ratio >= {1+eps}: ",
      np.allclose(obj_pos[ratio >= 1+eps], 1+eps))
print(f"A<0 objective is flat for ratio <= {1-eps}: ",
      np.allclose(obj_neg[ratio <= 1-eps], -(1-eps)))
plot_xy(ratio.tolist(), obj_pos.tolist())

edits are live — break it on purpose

6.4

DPO — preferences without RL

Here is the elegant turn. The KL-regularized objective of EQ R6.5 is not an open-ended search — it has a known, closed-form optimal policy. For a fixed reward $r$, the policy that maximizes "expected reward minus $\beta$-KL to the reference" is the reference distribution reweighted by the exponentiated reward:

EQ R6.6 — THE OPTIMAL KL-REGULARIZED POLICY $$ \pi_r(y \mid x) \;=\; \frac{1}{Z(x)}\,\pi_{\text{ref}}(y \mid x)\,\exp\!\Big(\tfrac{1}{\beta}\, r(x, y)\Big), \qquad Z(x) = \sum_{y}\pi_{\text{ref}}(y \mid x)\,\exp\!\Big(\tfrac{1}{\beta}\, r(x,y)\Big) $$

This is a standard result (a Gibbs / Boltzmann distribution); the partition function $Z(x)$ is intractable because it sums over all sequences, which is why RLHF resorts to PPO instead of using it directly. But invert it — solve for $r$ in terms of $\pi_r$ — and the reward becomes a function of the policy itself, with $Z(x)$ appearing as an additive term that depends only on $x$.

Rafailov and colleagues (2023) made the leap: substitute that inverted reward into the Bradley–Terry preference model (EQ R6.2). The intractable $Z(x)$ is the same for both completions of a pair, so in the difference $r(x,y_w) - r(x,y_l)$ it cancels exactly. What remains is a reward expressed purely as a log-ratio of the policy to the reference — and the entire reward-model-plus-RL pipeline collapses into a single supervised loss on preference pairs.

EQ R6.7 — THE DPO LOSS $$ \mathcal{L}_{\text{DPO}}(\theta) = -\,\mathbb{E}_{(x,y_w,y_l)}\!\left[\log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log \frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right)\right] $$

The bracketed term is the implicit reward $\hat r_\theta(x,y) = \beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}$: the policy is its own reward model. Minimizing this raises the likelihood of $y_w$ and lowers that of $y_l$, each measured relative to the reference. No reward model is trained, no rollouts are sampled, no RL loop runs — just a forward/backward pass on a fixed dataset, like ordinary supervised fine-tuning. The $\beta$ that was the KL strength in EQ R6.5 reappears here as the loss temperature.

The gradient makes the behavior vivid. Its magnitude scales with how badly the implicit reward model currently ranks the pair — pairs the model already gets right contribute little, pairs it gets backwards contribute a lot — and its direction increases $\log\pi_\theta(y_w\mid x)$ while decreasing $\log\pi_\theta(y_l\mid x)$. DPO is preference learning that looks and runs exactly like supervised learning, and that simplicity made it the default for budget alignment almost overnight.

Honest caveats, because the field is not settled. DPO is offline: it optimizes on a fixed preference set and cannot explore beyond it, so it is sensitive to how well that data covers the policy's behavior, and the implicit reward can drift on out-of-distribution completions. Online and iterative variants (sampling fresh pairs, IPO's bounded objective, KTO's prospect-theory single-label loss) exist precisely to patch these gaps. Several careful studies find well-tuned PPO still edges out DPO on the hardest tasks; DPO's win is overwhelmingly one of simplicity and cost, not a clean dominance on quality.

True or false: DPO removes the need for a separately trained reward model and for an online RL optimization loop, optimizing preferences with a single supervised loss instead. (Answer true or false.)

EQ R6.7 depends only on the policy $\pi_\theta$ and the frozen reference $\pi_{\text{ref}}$ — the intractable $Z(x)$ cancelled and the reward model became implicit in the policy. There is no $r_\phi$ to train and no rollout loop; a single supervised gradient step on preference pairs suffices. The statement is true.

For a preference pair, the policy assigns the chosen completion twice the reference likelihood ($\pi_\theta/\pi_{\text{ref}} = 2$) and the rejected completion half ($\pi_\theta/\pi_{\text{ref}} = 0.5$). With $\beta = 1$, the implicit-reward gap is $\beta(\ln 2 - \ln 0.5) = 2\ln 2 \approx 1.386$. What preference probability $\sigma(\text{gap})$ does the model now assign to the chosen completion? (Use $\sigma(z) = 1/(1+e^{-z})$.)

Since $e^{2\ln 2} = (e^{\ln 2})^2 = 2^2 = 4$, we have $e^{-1.386} = 1/4 = 0.25$. So $\sigma(2\ln 2) = \dfrac{1}{1 + 0.25} = \dfrac{1}{1.25} = $ 0.8. The DPO gradient keeps pushing this toward 1 — raising $\pi_\theta(y_w)$, lowering $\pi_\theta(y_l)$.

PYTHON · RUNNABLE IN-BROWSER

# DPO loss on toy preferred/rejected pairs; check the gradient DIRECTION (EQ R6.7)
import numpy as np
beta = 1.0
# log-probs (policy and frozen reference) for chosen y_w and rejected y_l
lp_pi_w,  lp_ref_w  = -2.0, -2.3      # policy already prefers y_w a bit
lp_pi_l,  lp_ref_l  = -1.5, -2.4      # but policy still over-likes y_l

# implicit reward = beta * (log pi - log ref)  -- the policy IS the reward model
r_w = beta * (lp_pi_w - lp_ref_w)
r_l = beta * (lp_pi_l - lp_ref_l)
gap = r_w - r_l
p_pref = 1 / (1 + np.exp(-gap))               # P(y_w > y_l) under EQ R6.2
loss   = -np.log(p_pref)
print(f"implicit reward  r_w={r_w:+.3f}  r_l={r_l:+.3f}   gap={gap:+.3f}")
print(f"P(chosen preferred) = {p_pref:.3f}   DPO loss = {loss:.3f}")

# dL/d(logprob): coefficient (p_pref - 1) < 0 => RAISE logpi(y_w), LOWER logpi(y_l)
coef = p_pref - 1.0
g_w  = beta * coef * (+1)                      # gradient wrt log pi(y_w)
g_l  = beta * coef * (-1)                      # gradient wrt log pi(y_l)
print(f"\ngrad wrt logpi(y_w) = {g_w:+.3f}  (negative -> ascent RAISES y_w)")
print(f"grad wrt logpi(y_l) = {g_l:+.3f}  (positive -> ascent LOWERS y_l)")
print("direction: push probability mass from the rejected toward the chosen answer.")

edits are live — break it on purpose

INSTRUMENT R6.2 — DPO vs PPOSAME OBJECTIVE · TWO MACHINES · EQ R6.5–R6.7

OPTIMIZER

MODELS IN MEMORY

—

ONLINE ROLLOUTS

—

SEPARATE REWARD MODEL

—

Both targets optimize the same KL-regularized objective (EQ R6.5). Toggle between them: PPO-RLHF trains a reward model, then samples online rollouts and runs four models at once (policy, reference, reward, critic); DPO proves that objective has a closed-form optimum (EQ R6.6), folds the reward into the policy (EQ R6.7), and reduces the whole thing to one supervised loss over a fixed preference set — two models, no rollouts, no reward model. The stages light up to show exactly which pieces each pipeline keeps.

6.5

GRPO & RLVR — verifiable rewards

DPO and PPO both lean on human preferences, with all the noise, expense, and gameability that entails. But for some tasks the reward needs no human at all: a math answer is right or wrong, code passes the unit tests or it does not. This is RLVR — reinforcement learning from verifiable rewards: replace the learned, hackable reward model with a deterministic checker that returns a clean, ungameable signal. It is the engine behind the reasoning models — DeepSeek-R1, OpenAI's o-series, and their kin — that surged through 2024–2025.

The optimizer of choice is GRPO — Group Relative Policy Optimization, introduced with DeepSeekMath. Its central move attacks PPO's most expensive component: the value network (the critic) that estimates a baseline for the advantage. GRPO deletes it. Instead, for each prompt it samples a group of $G$ complete responses, scores them all, and uses the group's own statistics as the baseline — the advantage of a response is simply how far above or below the group average its reward sits.

EQ R6.8 — GRPO GROUP-RELATIVE ADVANTAGE $$ \hat A_i \;=\; \frac{r_i - \operatorname{mean}(r_1, \ldots, r_G)}{\operatorname{std}(r_1, \ldots, r_G)}, \qquad i = 1, \ldots, G $$

$r_i$ is the reward of the $i$-th sampled response to the same prompt; the baseline is the group mean and the scale is the group standard deviation. No learned value network is needed — the baseline that PPO spends a whole second model to estimate, GRPO reads straight off a batch of samples. A response beats its peers $\Rightarrow$ positive advantage $\Rightarrow$ its tokens are reinforced; it lags $\Rightarrow$ negative $\Rightarrow$ suppressed. The normalized advantage then enters a PPO-style clipped objective (EQ R6.4) with the usual KL leash to the reference.

Strip away the value network and what remains is almost startlingly simple: sample several answers, reward each (often just 1 for correct, 0 for wrong), standardize the rewards within the group, and push the policy toward the above-average answers. Run that loop on verifiable math and code, and reasoning behavior — longer chains of thought, self-checking, backtracking — emerges without any of it being explicitly supervised. That emergence, more than the algorithm itself, is what made GRPO the defining method of the reasoning era.

REWARD HACKING

The recurring failure of every method in this chapter. The policy optimizes the measured reward, not the intended one — so any gap between them gets exploited. A reward model that slightly favors longer answers breeds verbosity; one that likes confident tone breeds confident wrongness; a verifiable checker with a loophole gets gamed by answers that pass the test without solving the task. This is Goodhart's law in a gradient: when a measure becomes a target, it ceases to be a good measure. The KL leash (EQ R6.5) is the main defense — it keeps the policy near the trustworthy region — but it only slows the drift, it does not remove the incentive.

True or false: GRPO estimates the advantage of each response from the statistics of a group of sampled outputs for the same prompt, removing the need for a separately learned value (critic) network. (Answer true or false.)

EQ R6.8's baseline is the group mean and its scale the group std — both read directly off a batch of $G$ sampled responses, never from a learned critic. That is precisely how GRPO drops PPO's value network. The statement is true.

A GRPO group of $G = 4$ responses to one prompt scores rewards $(1, 0, 0, 1)$ (1 = correct). Using EQ R6.8, what is the standardized advantage $\hat A_i$ of a correct response? (Mean $= 0.5$; population std $= 0.5$.)

Mean $= (1+0+0+1)/4 = 0.5$. Variance $= \frac{1}{4}\big[(0.5)^2\cdot 4\big] = 0.25$, so std $= 0.5$. A correct response: $\hat A = (1 - 0.5)/0.5 = $ 1. A wrong one gets $(0-0.5)/0.5 = -1$: symmetric, and the whole group needn't be re-baselined by any extra network.

PYTHON · RUNNABLE IN-BROWSER

# GRPO group-relative advantage from a group of sampled outputs (EQ R6.8)
import numpy as np
rng = np.random.default_rng(0)

# one prompt, G=8 sampled responses; verifiable reward = 1 if correct else 0
correct = np.array([1, 0, 1, 1, 0, 0, 1, 0], dtype=float)   # RLVR: pass/fail
G = len(correct)

mean = correct.mean()
std  = correct.std() + 1e-8                    # population std, EQ R6.8
adv  = (correct - mean) / std                  # group-relative advantage

print(f"rewards            : {correct.astype(int).tolist()}")
print(f"group mean (baseline) = {mean:.3f}   group std = {std:.3f}")
print("advantages         :", adv.round(3).tolist())
print("\ncorrect responses get +adv (reinforced), wrong get -adv (suppressed);")
print("the baseline is the GROUP itself -- no learned value network anywhere.")

# if EVERY sample is correct, std -> 0: the group gives no learning signal
allright = np.ones(G)
adv0 = (allright - allright.mean()) / (allright.std() + 1e-8)
print("\nall-correct group advantages:", adv0.round(3).tolist(),
      "-> zero signal (nothing to prefer)")

edits are live — break it on purpose

INSTRUMENT R6.3 — REWARD HACKINGPROXY REWARD vs TRUE QUALITY · EQ R6.5

OPTIMIZATION PRESSURE (STEPS) 40

KL LEASH β 0.20

PROXY REWARD r_φ

—

TRUE QUALITY

—

KL FROM REFERENCE

—

The mint curve is what the reward model measures; the blue curve is the true quality you actually want. Early optimization lifts both — the proxy is a decent stand-in near the reference. Crank up the pressure and they diverge: the proxy keeps climbing while true quality peaks and falls as the policy learns to exploit the reward model's blind spots. This gap is reward hacking, and the dashed line is where true quality turns over. Tighten the KL leash β and the policy stays near the reference — flatter proxy gains, but the divergence is delayed and shallower. Loosen it toward 0 and the hack arrives fast and hard.

Every method here turned a goal into a number and maximized it — and every failure was a player gaming the rules. Preference learning, reward hacking, and self-play are all strategic interaction in disguise. The Game Theory volume opens with the formal language for that: players, payoffs, strategies, and the equilibria that emerge when every agent optimizes against every other — including against the very humans whose preferences we just spent a chapter learning.

6.R

References

Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS — the original preference-to-reward pipeline and the Bradley–Terry reward model (EQ R6.2–R6.3) that RLHF scaled to language.
Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS — the three-stage SFT → reward model → PPO RLHF recipe behind ChatGPT; source of the KL-regularized objective (EQ R6.5).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv — the clipped surrogate objective (EQ R6.4) used as the RLHF policy optimizer.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS — DPO; the closed-form optimal policy (EQ R6.6) and the supervised preference loss (EQ R6.7) that skip the reward model and RL loop.
Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv — introduces GRPO; the group-relative advantage (EQ R6.8) that removes PPO's value network.
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv — RLVR with GRPO at scale; reasoning behavior emerging from verifiable rewards (§6.5).
Stiennon, N. et al. (2020). Learning to Summarize from Human Feedback. NeurIPS — the reward-hacking dynamics of over-optimizing a learned reward model (Instrument R6.3 §6.5).