Policy Gradients & Actor-Critic

4.1

Optimizing the policy directly

The value-based methods of the previous chapters — Q-learning, SARSA — all share a shape: learn a value function, then read a policy off it by taking $\arg\max_a Q(s,a)$. The value function is the object you fit; the policy is a side effect. Policy-gradient methods invert this. They treat the policy as the primary object, give it its own parameters $\theta$, and optimize those parameters to maximize expected return directly. There is no $\arg\max$ at the end — the policy is the answer.

Write the policy as a differentiable function $\pi_\theta(a \mid s)$: a neural network whose output is a probability distribution over actions, with parameters $\theta$ you can move. The quantity we want to maximize is the expected return under that policy — the same return from Chapter 01 (EQ R1.3), now viewed as a function of $\theta$:

EQ R4.1 — THE OBJECTIVE $$ J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\!\big[\, R(\tau) \,\big] \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\, \sum_{t=0}^{T} \gamma^{t}\, r_{t+1} \,\right] $$

$\tau = (s_0, a_0, s_1, a_1, \ldots)$ is a trajectory the policy rolls out; $R(\tau)$ is its total discounted return. The expectation is over every source of randomness — the policy's action choices and the environment's transitions. We are no longer fitting a value; we are doing gradient ascent on the thing we actually care about. The only obstacle is that the distribution we average over, $\pi_\theta$, is itself what we are differentiating — and that is exactly what the next section resolves.

Why bother, when value methods already work? Three reasons make policy gradients indispensable, not merely an alternative.

Continuous and high-dimensional action spaces. Taking $\arg\max_a Q(s,a)$ over a continuous $a$ — a torque, a steering angle, a 50-joint robot pose — is itself an optimization problem at every step. A policy network simply outputs the action (or its distribution), no inner search required. This is why robotics and control are policy-gradient territory.
Stochastic optimal policies. The greedy policy of a value method is deterministic. But in partially-observed environments and in every game with a bluff, the optimal policy is irreducibly random — rock-paper-scissors has no good deterministic strategy. Policy gradients can represent and learn such policies natively.
Smooth improvement. A small change to $\theta$ is a small change to the policy. Value methods can flip the entire greedy policy from one $\arg\max$ to another over an infinitesimal change in $Q$, which makes their learning brittle. Gradient ascent on $\pi_\theta$ moves the behavior continuously.

The cost of this directness is the dominant theme of the chapter: policy-gradient estimates are unbiased but high-variance. You are estimating a gradient from noisy rollouts of a stochastic policy in a stochastic world. Taming that variance — first with baselines (§4.3), then with a learned critic (§4.4) — is most of what separates a toy from a working algorithm.

4.2

The policy gradient theorem

To ascend $J(\theta)$ we need its gradient. The difficulty is that $\theta$ appears inside the distribution we are taking the expectation over, so we cannot just differentiate the integrand. The fix is the log-derivative trick (also called the score-function or likelihood-ratio estimator), an identity that turns the gradient of an expectation into an expectation of a gradient:

EQ R4.2 — THE LOG-DERIVATIVE TRICK $$ \nabla_\theta\, \mathbb{E}_{x \sim p_\theta}[\,f(x)\,] \;=\; \mathbb{E}_{x \sim p_\theta}\!\big[\, f(x)\, \nabla_\theta \log p_\theta(x) \,\big] $$

The single identity behind every policy gradient. It follows from $\nabla_\theta p_\theta = p_\theta\, \nabla_\theta \log p_\theta$ (because $\nabla \log p = \nabla p / p$). Its magic is that the right-hand side is itself an expectation under $p_\theta$ — so it can be estimated by sampling, with no knowledge of how the distribution was generated. The environment's transition probabilities $P$ drop out entirely, because they do not depend on $\theta$: we never need a model of the world.

Apply this to the objective. A trajectory's probability factorizes into the environment's transitions (which do not depend on $\theta$) and the policy's action choices (which do). When we take $\nabla_\theta \log p_\theta(\tau)$, every transition term differentiates to zero and only the policy terms survive. The result is the policy gradient theorem:

EQ R4.3 — THE POLICY GRADIENT THEOREM $$ \nabla_\theta J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\, \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\; \Psi_t \,\right] $$

$\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ is the score — the direction in parameter space that makes the action just taken more likely. $\Psi_t$ is a scalar weight that says how much we should reinforce that action. The whole zoo of policy-gradient algorithms is one choice: what to plug in for $\Psi_t$. The full return $R(\tau)$, the return-to-go $G_t$, the advantage $A^\pi(s_t,a_t)$, the TD error — each is a valid $\Psi_t$, and they trade bias against variance differently (§4.3–4.4).

The intuition is worth stating in plain language, because it is the entire algorithm. Each gradient step nudges the parameters to increase the log-probability of actions that led to high reward, and decrease it for actions that led to low reward, weighted by how good the outcome was. The policy is not told the right action — it is only told whether what it did was, on balance, worth doing more often. That is trial-and-error learning written as calculus.

A softmax policy over two actions currently assigns $\pi_\theta(a \mid s) = 0.6$ to the action the agent actually sampled. For a softmax parameterization the score with respect to that action's logit is $1 - \pi_\theta(a \mid s)$. What is the score $\nabla_{\theta_a} \log \pi_\theta(a \mid s)$?

For a softmax (the standard discrete policy), $\nabla_{\theta_i} \log \pi_\theta(a \mid s) = \mathbb{1}[i = a] - \pi_\theta(i \mid s)$. For the sampled action itself $(i = a)$ the indicator is $1$, so the score is $1 - \pi_\theta(a \mid s) = 1 - 0.6 = $ 0.4. It is positive because raising this action's logit raises its (sub-one) probability — the update will push exactly that way if the reward weight $\Psi_t$ is positive.

Two technical notes experts will insist on. First, the theorem is exact for the discounted objective only with a subtle discounting of the state distribution that practical implementations almost universally ignore; the resulting estimator is a slightly biased but well-behaved approximation that everyone uses. Second, the score-function estimator is unbiased but, as warned, high-variance — the same trajectory return $R(\tau)$ multiplies every action's score, so a single lucky or unlucky rollout swings the whole gradient. Fixing that is §4.3.

4.3

REINFORCE & the baseline

The oldest and simplest realization of EQ R4.3 is REINFORCE (Williams, 1992): a pure Monte-Carlo policy gradient. Run an episode to completion, compute the return-to-go $G_t$ from each step, and take one gradient step with $\Psi_t = G_t$. No value function, no bootstrapping — just rollouts and the log-derivative trick.

EQ R4.4 — REINFORCE UPDATE $$ \theta \;\leftarrow\; \theta \;+\; \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\; G_t, \qquad G_t = \sum_{k=t}^{T} \gamma^{\,k-t}\, r_{k+1} $$

$\alpha$ is the learning rate, $G_t$ the return-to-go from step $t$. Note that only rewards after $a_t$ appear in $G_t$ — an action cannot be credited for reward that preceded it, the causality refinement that already cuts variance versus weighting by the whole-episode return. REINFORCE is unbiased and dead simple, but it learns slowly: it must wait for an entire episode, and the raw magnitude of $G_t$ makes its gradient estimates extremely noisy.

That noise has a specific and fixable cause. Suppose every reward in your environment is large and positive — say returns hover around $+100$. Then every action gets reinforced (its log-probability pushed up), just by different amounts. The gradient is dominated by the shared offset of $100$ rather than by the differences that actually distinguish good actions from bad. The estimator is still unbiased, but its variance is enormous and learning crawls.

The cure is a baseline: subtract a reference value $b(s)$ from the return before weighting the score. The remarkable fact — the one that makes baselines free — is that any baseline that does not depend on the action leaves the gradient unbiased, because the expected score is zero:

EQ R4.5 — BASELINE LEAVES THE GRADIENT UNBIASED $$ \mathbb{E}_{a \sim \pi_\theta}\!\big[\, \nabla_\theta \log \pi_\theta(a \mid s)\; b(s) \,\big] \;=\; b(s)\, \nabla_\theta \!\sum_{a} \pi_\theta(a \mid s) \;=\; b(s)\, \nabla_\theta\, 1 \;=\; 0 $$

Because probabilities sum to one, $\sum_a \pi_\theta(a\mid s) = 1$ is constant, so its gradient is exactly zero. Subtracting $b(s)$ therefore adds zero in expectation — the gradient stays pointed the same way — while it can dramatically reduce variance by re-centering the returns around their typical value. The near-optimal choice for $b(s)$ is the state-value $V^\pi(s)$: then the weight becomes the advantage $G_t - V^\pi(s_t) \approx A^\pi(s_t, a_t)$, which asks the only question that matters — did this action beat the policy's own average from here?

So REINFORCE-with-baseline weights each score by $G_t - b(s_t)$. With $b(s) = V^\pi(s)$, an action that did better than expected gets reinforced and one that did worse gets suppressed — even if both produced positive raw return. This is the conceptual hinge of the chapter, and it points straight at actor-critic: if a learned $V^\pi(s)$ is the best baseline, learn one.

True or false: subtracting a baseline $b(s)$ that depends only on the state (not the action) reduces the variance of the policy-gradient estimate without introducing any bias. (Answer true or false.)

By EQ R4.5, $\mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta \log \pi_\theta(a\mid s)\, b(s)] = b(s)\,\nabla_\theta \sum_a \pi_\theta(a\mid s) = b(s)\,\nabla_\theta 1 = 0$. The subtracted term contributes nothing in expectation, so the gradient is unchanged (no bias), while re-centering the returns can sharply cut variance. The statement is true — this is the single most important variance-reduction tool in policy-gradient RL.

PYTHON · RUNNABLE IN-BROWSER

# REINFORCE on a 2-armed bandit: a softmax policy ascends toward the better arm
import numpy as np
rng = np.random.default_rng(0)

true_mean = np.array([1.0, 2.0])          # arm 1 is genuinely better
theta = np.zeros(2)                        # policy logits (one state, no transitions)
alpha = 0.1

for t in range(400):
    p = np.exp(theta - theta.max()); p /= p.sum()   # softmax policy pi(a)
    a = rng.choice(2, p=p)                           # sample an action
    reward = true_mean[a] + rng.normal(0, 1)         # noisy reward
    score = -p.copy(); score[a] += 1.0               # d log pi / d theta = 1[i=a] - p
    theta += alpha * reward * score                  # EQ R4.4, one-step bandit
    if t in (0, 50, 200, 399):
        print(f"step {t:3d}:  pi = [{p[0]:.3f}, {p[1]:.3f}]")

p = np.exp(theta - theta.max()); p /= p.sum()
print(f"\nconverged policy pi(better arm) = {p[1]:.3f}  (started at 0.500)")
print("the policy climbed -- no value function, no argmax, just gradient ascent.")

edits are live — break it on purpose

INSTRUMENT R4.1 — POLICY-GRADIENT ON A BANDITSOFTMAX POLICY · ONLINE ASCENT · EQ R4.4

LEARNING RATE α 0.10

REWARD GAP (ARM B − ARM A) 1.0

π(BETTER ARM)

—

STEPS RUN

—

AVG REWARD

—

Two arms; the green one pays more on average. The mint curve is the policy's probability of pulling the better arm — it starts at exactly 0.5 (no preference) and climbs as ascent reinforces the actions that earned reward. Press STEP ×20 to advance 20 rollouts at a time. Raise the learning rate and it climbs faster but jitters more; shrink the reward gap to zero and the two arms become indistinguishable, so the policy has nothing to learn and the curve wanders near 0.5. This is EQ R4.4 with one state — policy gradients stripped to their skeleton.

The instrument above also exposes the variance problem viscerally: with a small reward gap the curve thrashes, because the gradient signal is buried in noise. The next demonstration isolates exactly that effect — and the baseline's cure.

PYTHON · RUNNABLE IN-BROWSER

# Baseline = variance reduction. A large constant reward offset wrecks the
# naive gradient; subtracting a running baseline restores it. (EQ R4.5)
import numpy as np

def train(use_baseline, seed=1, steps=300, offset=10.0):
    r = np.random.default_rng(seed)
    theta = np.zeros(2); b = 0.0; sq = []
    mean = np.array([0.0, 1.0]) + offset          # arm 1 better, but huge offset
    for t in range(steps):
        p = np.exp(theta - theta.max()); p /= p.sum()
        a = r.choice(2, p=p)
        reward = mean[a] + r.normal(0, 1)
        adv = reward - (b if use_baseline else 0.0)   # baseline-subtracted weight
        score = -p.copy(); score[a] += 1.0
        g = adv * score
        sq.append(g[1] ** 2)                           # squared gradient (one coord)
        theta += 0.1 * g
        b += 0.1 * (reward - b)                        # running estimate of E[return]
    p = np.exp(theta - theta.max()); p /= p.sum()
    return p[1], float(np.mean(sq))

p_no,  v_no  = train(False)
p_yes, v_yes = train(True)
print(f"no baseline : pi(best) = {p_no:.3f}   mean grad^2 = {v_no:.3f}")
print(f"w/ baseline : pi(best) = {p_yes:.3f}   mean grad^2 = {v_yes:.3f}")
print(f"\nbaseline cut gradient variance ~{v_no/v_yes:.1f}x --")
print("and only the baselined run actually found the better arm.")

edits are live — break it on purpose

INSTRUMENT R4.2 — BASELINE VARIANCE REDUCTIONSAME GRADIENT, RE-CENTERED RETURNS · EQ R4.5

REWARD OFFSET (CONSTANT ADDED TO ALL ARMS) 10

DISTRIBUTION OF GRADIENT-WEIGHT (Ψ) ACROSS ROLLOUTS

E[Ψ²] NO BASELINE

—

E[Ψ²] WITH V(s) BASELINE

—

SECOND-MOMENT REDUCTION

—

The histogram shows the scalar weight $\Psi$ that multiplies the score, over many sampled returns. Grey is the raw return $G$; mint is the advantage $G - V(s)$ after subtracting the baseline. The two clouds have the same spread — but the mint one is re-centered on zero. Since the score has zero mean, the gradient estimator's variance is governed by $\mathbb{E}[\Psi^2]$, the second moment shown in the readouts, and re-centering $\Psi$ on zero collapses it. Crank the reward offset up: the grey weights march off to the right (every action looks "good"), inflating $\mathbb{E}[\Psi^2]$, while the baselined weights stay parked around zero. Both give the same expected gradient — EQ R4.5 — but the mint one is far easier to estimate from a handful of samples.

4.4

Actor-critic methods

REINFORCE-with-baseline still has a Monte-Carlo heart: it waits for a full episode and uses the actual return $G_t$. That keeps it unbiased but slow and noisy. Actor-critic methods take the natural next step suggested by §4.3 — learn the baseline as its own function — and then go further, using that learned value function to bootstrap, replacing the full return with a one-step estimate. Two networks, two jobs:

The actor is the policy $\pi_\theta(a \mid s)$. It chooses actions and is updated by the policy gradient — pushed toward actions the critic judges better than average.
The critic is a value function $V_w(s)$ (or $Q_w(s,a)$) with its own parameters $w$. The critic estimates the value function — it learns how much return to expect from a state, and supplies that estimate as both the baseline and the bootstrap target for the actor.

The actor acts; the environment returns reward and the next state; the critic scores how that step compared to its own prediction and feeds the advantage back to the actor. The actor learns what to do; the critic learns how good it is. They co-evolve.

The glue between them is the TD error $\delta$, the one-step temporal-difference signal (Chapter 03). It is the difference between a slightly-better-informed estimate of value — this step's reward plus the discounted value of where we landed — and the critic's current prediction:

EQ R4.6 — THE TD ERROR AS ADVANTAGE ESTIMATE $$ \delta_t \;=\; r_{t+1} + \gamma\, V_w(s_{t+1}) - V_w(s_t) \;\approx\; A^\pi(s_t, a_t) $$

$\delta_t$ is a low-variance, one-sample estimate of the advantage: if the step turned out better than the critic expected, $\delta_t > 0$ and the actor reinforces the action; if worse, $\delta_t < 0$ and it suppresses it. Bootstrapping from $V_w(s_{t+1})$ trades a little bias for a large variance cut — the actor no longer waits for the full return, and the noisy $G_t$ is replaced by reward plus one value lookup. This is the bias–variance dial at the heart of actor-critic.

The two updates, applied every step (online, no episode boundary required), are:

EQ R4.7 — ACTOR AND CRITIC UPDATES $$ \underbrace{\theta \leftarrow \theta + \alpha_\theta\, \delta_t\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)}_{\textbf{actor: policy gradient, weighted by }\delta_t} \qquad \underbrace{w \leftarrow w + \alpha_w\, \delta_t\, \nabla_w V_w(s_t)}_{\textbf{critic: TD(0) regression}} $$

The same $\delta_t$ drives both: it tells the actor which way to push the policy and tells the critic how wrong its value estimate was. The critic update is ordinary semi-gradient TD(0) — fit $V_w$ toward $r_{t+1} + \gamma V_w(s_{t+1})$. The danger is that the two are learning simultaneously from each other: a biased critic biases the actor, which shifts the data the critic sees. Stability tricks — slower critic learning, target networks, careful step sizes — exist precisely to keep this coupled system from spiraling.

Where this sits on the spectrum is the clean way to remember it. REINFORCE uses the full Monte-Carlo return $G_t$: zero bias, maximum variance, must wait for the episode to end. One-step actor-critic uses $\delta_t$: some bias from bootstrapping, much lower variance, learns online. In between sits a continuum — $n$-step returns and, most commonly today, Generalized Advantage Estimation (GAE), which exponentially blends advantage estimates across all horizons with a single knob $\lambda$ to tune the bias–variance trade-off explicitly.

True or false: in an actor-critic method, the critic is the component that estimates the value function (such as $V_w(s)$), while the actor is the policy that selects actions. (Answer true or false.)

Yes. The actor is the parameterized policy $\pi_\theta(a\mid s)$ that chooses actions; the critic is the value estimator $V_w(s)$ (or $Q_w(s,a)$) that judges them. The critic's value estimate supplies the baseline and the bootstrap target — via the TD error $\delta_t$ of EQ R4.6 — that the actor's policy gradient is weighted by. The statement is true.

A critic estimates $V_w(s) = 5.0$ for the current state and $V_w(s') = 6.0$ for the next. The agent takes an action, receives reward $r = 0.5$, and $\gamma = 0.9$. What is the TD error $\delta = r + \gamma V_w(s') - V_w(s)$ that drives both updates?

$\delta = 0.5 + 0.9 \times 6.0 - 5.0 = 0.5 + 5.4 - 5.0 = $ 0.9. Because $\delta > 0$, the step beat the critic's expectation: the actor will make this action more likely and the critic will revise $V_w(s)$ upward.

INSTRUMENT R4.3 — ACTOR-CRITIC ARCHITECTURETD ERROR FLOWS TO BOTH HEADS · EQ R4.6–R4.7

REWARD r 0.50

V(s) 5.0

V(s′) 6.0

DISCOUNT γ 0.90

TD ERROR δ = r + γV(s′) − V(s)

—

ACTOR SIGNAL

—

CRITIC SIGNAL

—

The single scalar $\delta$ (EQ R4.6) is computed from one transition and routed to both heads (EQ R4.7). Set $V(s')$ above $V(s)$ and add reward and $\delta$ goes positive — the bootstrap target $r + \gamma V(s')$ exceeds the critic's current guess, so the actor reinforces the action and the critic raises $V(s)$. Drag the reward negative and $\delta$ flips: the action is suppressed and $V(s)$ is pulled down. Watch $\gamma$ scale how much the next state's value counts — at $\gamma = 0$ the critic is purely myopic and $\delta$ reduces to $r - V(s)$. One number, two learners.

4.5

A2C / A3C

Naive online actor-critic has a quiet flaw inherited from all on-policy gradient methods: consecutive samples from a single rollout are highly correlated, and that correlation inflates gradient variance and destabilizes training. Value-based deep RL (DQN) broke the correlation with a replay buffer, but a policy gradient must be estimated on-policy — from data the current policy generated — so a buffer of stale experience is off-limits. The answer DeepMind shipped in 2016 was to break correlation a different way: run many actors in parallel.

A3C — Asynchronous Advantage Actor-Critic (Mnih et al., 2016) — launches many actor-learners, each with its own copy of the policy, exploring different parts of the environment simultaneously and asynchronously pushing gradients to a shared parameter server. Because the workers are in different states at any instant, the gradients they contribute are decorrelated — the parallelism itself plays the role the replay buffer played for DQN, and it does so while keeping the updates strictly on-policy.

EQ R4.8 — THE ADVANTAGE ACTOR-CRITIC OBJECTIVE $$ \nabla_\theta J \;=\; \mathbb{E}\!\big[\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)\; \hat{A}_t \,\big] \;+\; \beta\, \nabla_\theta\, \mathcal{H}\!\big[\pi_\theta(\cdot \mid s_t)\big], \qquad \hat{A}_t = \sum_{i=0}^{n-1}\gamma^{\,i} r_{t+i+1} + \gamma^{\,n} V_w(s_{t+n}) - V_w(s_t) $$

$\hat{A}_t$ is the $n$-step advantage — the bias–variance compromise between REINFORCE $(n = \infty)$ and one-step actor-critic $(n = 1)$. The second term is an entropy bonus: $\mathcal{H}[\pi]$ rewards the policy for staying uncertain, which discourages premature collapse onto a single action and keeps the agent exploring. $\beta$ tunes its strength. This objective — $n$-step advantage plus entropy regularization — is the template virtually every modern policy-gradient algorithm (A2C, PPO, IMPALA) builds on.

A2C — Advantage Actor-Critic — is the synchronous sibling and, in practice, the one most people reach for. A2C found that the asynchrony in A3C was not the source of the benefit; the parallelism was. So A2C runs the same many environments in lockstep, batches their transitions into one large synchronized update, and gets equal or better results with simpler, more GPU-friendly code. The lesson stuck: gather diverse on-policy experience in parallel, batch it, update once.

Algorithm	Ψ weight	Bias / variance	Data collection
REINFORCE	G_t (full return)	unbiased · high variance	one episode at a time
REINFORCE + baseline	G_t − V(s)	unbiased · lower variance	one episode at a time
One-step actor-critic	δ_t (TD error)	biased · low variance	fully online
A3C	n-step Â + entropy	tunable via n	parallel · asynchronous
A2C	n-step Â + entropy	tunable via n	parallel · synchronous

An honest caveat. Vanilla policy gradients — even with advantages and entropy — are notoriously step-size sensitive: too large a step can collapse the policy in a way it cannot recover from, because the update changes the very distribution the next batch is drawn from. The line of work that fixed this — trust regions (TRPO) and the clipped surrogate objective of PPO — is what made policy gradients robust enough to dominate, and is the natural sequel to this chapter. PPO is also, not coincidentally, the workhorse of RLHF: aligning a language model is a policy-gradient problem in disguise, with the reward model as the environment.

We have the policy-gradient skeleton; now scale it with deep networks and make it stable. Chapter 05 takes these ideas into deep reinforcement learning proper — function approximation with neural networks, the deadly triad of bootstrapping, off-policy learning and approximation, DQN on the value side, and the trust-region and clipped objectives (TRPO, PPO) that turned the brittle gradient ascent of this chapter into the reliable engine behind game-playing agents, robotics, and RLHF.

4.R

References

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 — the original REINFORCE estimator (EQ R4.4) and the log-derivative / score-function trick behind every policy gradient.
Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS 1999 — the policy gradient theorem (EQ R4.3) and its compatibility with a learned value function, the formal basis of actor-critic.
Mnih, V. et al. (2016). Asynchronous methods for deep reinforcement learning. ICML 2016 — A3C: parallel actor-learners, the n-step advantage and entropy bonus of EQ R4.8 (and the synchronous A2C that followed).
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — Chapter 13 develops policy-gradient methods, REINFORCE with baselines, and actor-critic exactly as framed here.
Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR 2016 — GAE, the λ-blended advantage estimator that sets the bias–variance dial between TD and Monte-Carlo (§4.4).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv — PPO's clipped surrogate objective, the step-size fix that made policy gradients robust and the workhorse of RLHF (the §4.5 sequel).