Optimizing the policy directly
The value-based methods of the previous chapters — Q-learning, SARSA — all share a shape: learn a value function, then read a policy off it by taking \(\arg\max_a Q(s,a)\). The value function is the object you fit; the policy is a side effect. Policy-gradient methods invert this. They treat the policy as the primary object, give it its own parameters \(\theta\), and optimize those parameters to maximize expected return directly. There is no \(\arg\max\) at the end — the policy is the answer.
Write the policy as a differentiable function \(\pi_\theta(a \mid s)\): a neural network whose output is a probability distribution over actions, with parameters \(\theta\) you can move. The quantity we want to maximize is the expected return under that policy — the same return from Chapter 01 (EQ R1.3), now viewed as a function of \(\theta\):
Why bother, when value methods already work? Three reasons make policy gradients indispensable, not merely an alternative.
- Continuous and high-dimensional action spaces. Taking \(\arg\max_a Q(s,a)\) over a continuous \(a\) — a torque, a steering angle, a 50-joint robot pose — is itself an optimization problem at every step. A policy network simply outputs the action (or its distribution), no inner search required. This is why robotics and control are policy-gradient territory.
- Stochastic optimal policies. The greedy policy of a value method is deterministic. But in partially-observed environments and in every game with a bluff, the optimal policy is irreducibly random — rock-paper-scissors has no good deterministic strategy. Policy gradients can represent and learn such policies natively.
- Smooth improvement. A small change to \(\theta\) is a small change to the policy. Value methods can flip the entire greedy policy from one \(\arg\max\) to another over an infinitesimal change in \(Q\), which makes their learning brittle. Gradient ascent on \(\pi_\theta\) moves the behavior continuously.
The cost of this directness is the dominant theme of the chapter: policy-gradient estimates are unbiased but high-variance. You are estimating a gradient from noisy rollouts of a stochastic policy in a stochastic world. Taming that variance — first with baselines (§4.3), then with a learned critic (§4.4) — is most of what separates a toy from a working algorithm.
The policy gradient theorem
To ascend \(J(\theta)\) we need its gradient. The difficulty is that \(\theta\) appears inside the distribution we are taking the expectation over, so we cannot just differentiate the integrand. The fix is the log-derivative trick (also called the score-function or likelihood-ratio estimator), an identity that turns the gradient of an expectation into an expectation of a gradient:
Apply this to the objective. A trajectory's probability factorizes into the environment's transitions (which do not depend on \(\theta\)) and the policy's action choices (which do). When we take \(\nabla_\theta \log p_\theta(\tau)\), every transition term differentiates to zero and only the policy terms survive. The result is the policy gradient theorem:
The intuition is worth stating in plain language, because it is the entire algorithm. Each gradient step nudges the parameters to increase the log-probability of actions that led to high reward, and decrease it for actions that led to low reward, weighted by how good the outcome was. The policy is not told the right action — it is only told whether what it did was, on balance, worth doing more often. That is trial-and-error learning written as calculus.
Two technical notes experts will insist on. First, the theorem is exact for the discounted objective only with a subtle discounting of the state distribution that practical implementations almost universally ignore; the resulting estimator is a slightly biased but well-behaved approximation that everyone uses. Second, the score-function estimator is unbiased but, as warned, high-variance — the same trajectory return \(R(\tau)\) multiplies every action's score, so a single lucky or unlucky rollout swings the whole gradient. Fixing that is §4.3.
REINFORCE & the baseline
The oldest and simplest realization of EQ R4.3 is REINFORCE (Williams, 1992): a pure Monte-Carlo policy gradient. Run an episode to completion, compute the return-to-go \(G_t\) from each step, and take one gradient step with \(\Psi_t = G_t\). No value function, no bootstrapping — just rollouts and the log-derivative trick.
That noise has a specific and fixable cause. Suppose every reward in your environment is large and positive — say returns hover around \(+100\). Then every action gets reinforced (its log-probability pushed up), just by different amounts. The gradient is dominated by the shared offset of \(100\) rather than by the differences that actually distinguish good actions from bad. The estimator is still unbiased, but its variance is enormous and learning crawls.
The cure is a baseline: subtract a reference value \(b(s)\) from the return before weighting the score. The remarkable fact — the one that makes baselines free — is that any baseline that does not depend on the action leaves the gradient unbiased, because the expected score is zero:
So REINFORCE-with-baseline weights each score by \(G_t - b(s_t)\). With \(b(s) = V^\pi(s)\), an action that did better than expected gets reinforced and one that did worse gets suppressed — even if both produced positive raw return. This is the conceptual hinge of the chapter, and it points straight at actor-critic: if a learned \(V^\pi(s)\) is the best baseline, learn one.
# REINFORCE on a 2-armed bandit: a softmax policy ascends toward the better arm
import numpy as np
rng = np.random.default_rng(0)
true_mean = np.array([1.0, 2.0]) # arm 1 is genuinely better
theta = np.zeros(2) # policy logits (one state, no transitions)
alpha = 0.1
for t in range(400):
p = np.exp(theta - theta.max()); p /= p.sum() # softmax policy pi(a)
a = rng.choice(2, p=p) # sample an action
reward = true_mean[a] + rng.normal(0, 1) # noisy reward
score = -p.copy(); score[a] += 1.0 # d log pi / d theta = 1[i=a] - p
theta += alpha * reward * score # EQ R4.4, one-step bandit
if t in (0, 50, 200, 399):
print(f"step {t:3d}: pi = [{p[0]:.3f}, {p[1]:.3f}]")
p = np.exp(theta - theta.max()); p /= p.sum()
print(f"\nconverged policy pi(better arm) = {p[1]:.3f} (started at 0.500)")
print("the policy climbed -- no value function, no argmax, just gradient ascent.")
The instrument above also exposes the variance problem viscerally: with a small reward gap the curve thrashes, because the gradient signal is buried in noise. The next demonstration isolates exactly that effect — and the baseline's cure.
# Baseline = variance reduction. A large constant reward offset wrecks the
# naive gradient; subtracting a running baseline restores it. (EQ R4.5)
import numpy as np
def train(use_baseline, seed=1, steps=300, offset=10.0):
r = np.random.default_rng(seed)
theta = np.zeros(2); b = 0.0; sq = []
mean = np.array([0.0, 1.0]) + offset # arm 1 better, but huge offset
for t in range(steps):
p = np.exp(theta - theta.max()); p /= p.sum()
a = r.choice(2, p=p)
reward = mean[a] + r.normal(0, 1)
adv = reward - (b if use_baseline else 0.0) # baseline-subtracted weight
score = -p.copy(); score[a] += 1.0
g = adv * score
sq.append(g[1] ** 2) # squared gradient (one coord)
theta += 0.1 * g
b += 0.1 * (reward - b) # running estimate of E[return]
p = np.exp(theta - theta.max()); p /= p.sum()
return p[1], float(np.mean(sq))
p_no, v_no = train(False)
p_yes, v_yes = train(True)
print(f"no baseline : pi(best) = {p_no:.3f} mean grad^2 = {v_no:.3f}")
print(f"w/ baseline : pi(best) = {p_yes:.3f} mean grad^2 = {v_yes:.3f}")
print(f"\nbaseline cut gradient variance ~{v_no/v_yes:.1f}x --")
print("and only the baselined run actually found the better arm.")
Actor-critic methods
REINFORCE-with-baseline still has a Monte-Carlo heart: it waits for a full episode and uses the actual return \(G_t\). That keeps it unbiased but slow and noisy. Actor-critic methods take the natural next step suggested by §4.3 — learn the baseline as its own function — and then go further, using that learned value function to bootstrap, replacing the full return with a one-step estimate. Two networks, two jobs:
- The actor is the policy \(\pi_\theta(a \mid s)\). It chooses actions and is updated by the policy gradient — pushed toward actions the critic judges better than average.
- The critic is a value function \(V_w(s)\) (or \(Q_w(s,a)\)) with its own parameters \(w\). The critic estimates the value function — it learns how much return to expect from a state, and supplies that estimate as both the baseline and the bootstrap target for the actor.
The glue between them is the TD error \(\delta\), the one-step temporal-difference signal (Chapter 03). It is the difference between a slightly-better-informed estimate of value — this step's reward plus the discounted value of where we landed — and the critic's current prediction:
The two updates, applied every step (online, no episode boundary required), are:
Where this sits on the spectrum is the clean way to remember it. REINFORCE uses the full Monte-Carlo return \(G_t\): zero bias, maximum variance, must wait for the episode to end. One-step actor-critic uses \(\delta_t\): some bias from bootstrapping, much lower variance, learns online. In between sits a continuum — \(n\)-step returns and, most commonly today, Generalized Advantage Estimation (GAE), which exponentially blends advantage estimates across all horizons with a single knob \(\lambda\) to tune the bias–variance trade-off explicitly.
A2C / A3C
Naive online actor-critic has a quiet flaw inherited from all on-policy gradient methods: consecutive samples from a single rollout are highly correlated, and that correlation inflates gradient variance and destabilizes training. Value-based deep RL (DQN) broke the correlation with a replay buffer, but a policy gradient must be estimated on-policy — from data the current policy generated — so a buffer of stale experience is off-limits. The answer DeepMind shipped in 2016 was to break correlation a different way: run many actors in parallel.
A3C — Asynchronous Advantage Actor-Critic (Mnih et al., 2016) — launches many actor-learners, each with its own copy of the policy, exploring different parts of the environment simultaneously and asynchronously pushing gradients to a shared parameter server. Because the workers are in different states at any instant, the gradients they contribute are decorrelated — the parallelism itself plays the role the replay buffer played for DQN, and it does so while keeping the updates strictly on-policy.
A2C — Advantage Actor-Critic — is the synchronous sibling and, in practice, the one most people reach for. A2C found that the asynchrony in A3C was not the source of the benefit; the parallelism was. So A2C runs the same many environments in lockstep, batches their transitions into one large synchronized update, and gets equal or better results with simpler, more GPU-friendly code. The lesson stuck: gather diverse on-policy experience in parallel, batch it, update once.
| Algorithm | Ψ weight | Bias / variance | Data collection |
|---|---|---|---|
| REINFORCE | G_t (full return) | unbiased · high variance | one episode at a time |
| REINFORCE + baseline | G_t − V(s) | unbiased · lower variance | one episode at a time |
| One-step actor-critic | δ_t (TD error) | biased · low variance | fully online |
| A3C | n-step  + entropy | tunable via n | parallel · asynchronous |
| A2C | n-step  + entropy | tunable via n | parallel · synchronous |
An honest caveat. Vanilla policy gradients — even with advantages and entropy — are notoriously step-size sensitive: too large a step can collapse the policy in a way it cannot recover from, because the update changes the very distribution the next batch is drawn from. The line of work that fixed this — trust regions (TRPO) and the clipped surrogate objective of PPO — is what made policy gradients robust enough to dominate, and is the natural sequel to this chapter. PPO is also, not coincidentally, the workhorse of RLHF: aligning a language model is a policy-gradient problem in disguise, with the reward model as the environment.
We have the policy-gradient skeleton; now scale it with deep networks and make it stable. Chapter 05 takes these ideas into deep reinforcement learning proper — function approximation with neural networks, the deadly triad of bootstrapping, off-policy learning and approximation, DQN on the value side, and the trust-region and clipped objectives (TRPO, PPO) that turned the brittle gradient ascent of this chapter into the reliable engine behind game-playing agents, robotics, and RLHF.
References
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.
- Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation.
- Mnih, V. et al. (2016). Asynchronous methods for deep reinforcement learning.
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
- Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms.