Function approximation & the deadly triad
Tabular RL — a separate entry in a lookup table for every state-action pair — is exact and has clean convergence guarantees. It is also useless the moment the world is large. A 210×160 RGB Atari frame has more configurations than there are atoms in the universe; a tabular agent would never visit the same state twice, let alone learn from it. The escape is function approximation: parameterize the value function or policy with a model \(f_\theta\) — a neural network — that generalizes across states, so that what it learns in one state transfers to similar states it has never seen.
This single substitution is what "deep" reinforcement learning means. It is also where the guarantees fall apart. Tabular Q-learning converges; the same algorithm with a neural network in the loop can diverge spectacularly — values exploding to infinity, the policy collapsing to a single useless action. Sutton and Barto named the cause the deadly triad: instability is provoked when three ingredients are present at once.
Why does the combination misbehave? In Q-learning the regression target \(y = r + \gamma \max_{a'} Q_\theta(s', a')\) is computed using the same network \(Q_\theta\) we are updating. A gradient step that raises \(Q_\theta(s,a)\) also raises \(Q_\theta(s',a')\) for similar \((s',a')\) — function approximation guarantees the change leaks to neighbors — which raises the target, which raises the next estimate. The network is chasing a target it moves every time it takes a step toward it. With on-policy data and a fresh table this loop is damped; with off-policy data, generalization, and bootstrapping together, it can amplify without bound.
The triad is a diagnosis, not a theorem: it identifies the conditions under which divergence is possible, not a guarantee that it happens. In practice well-tuned deep agents are stable far more often than the worst case suggests — but the failure mode is real, it is hard to predict in advance, and the engineering of §5.2 and §5.3 is the field's accumulated wisdom for staying out of its way.
Deep Q-Networks — replay & target nets
The 2015 DQN paper is the landmark: a single architecture, learning straight from raw pixels and a score, reached human-level play on most of 49 Atari games. The network itself is unremarkable — a small convnet mapping a stack of four frames to one Q-value per action. The two ideas that made it stable are the lesson, and both are direct countermeasures to the deadly triad.
Experience replay
Instead of learning from each transition the instant it occurs and then discarding it, DQN writes every transition \((s, a, r, s')\) into a large circular replay buffer and trains on random minibatches sampled from it. This buys two things. First, it breaks the temporal correlation between consecutive samples: successive frames of one episode are near-identical and violate the i.i.d. assumption every SGD convergence proof leans on; shuffling from a buffer of a million transitions restores approximate independence. Second, it reuses each experience many times, turning a precious environment interaction into many gradient updates — a large gain in sample efficiency.
The target network
The second stabilizer attacks the moving-target problem head on. DQN keeps a separate copy of the network, the target network \(Q_{\theta^-}\), whose weights are frozen and only periodically copied from the online network \(Q_\theta\) (every \(C\) steps in the original; modern code often uses a slow Polyak average instead). The regression target is computed with the frozen copy, so it does not move while the online network chases it.
The target network's weights are refreshed by a hard copy every \(C\) steps, or by a soft Polyak (exponential) update applied every step — the form most continuous-control code now uses:
# DQN target + replay on a toy 4-state chain MDP (EQ R5.2, EQ R5.3)
import numpy as np
rng = np.random.default_rng(0)
# states 0..3, action "go" advances one state; reward +1 only on reaching s3
nS, gamma = 4, 0.9
def step(s): # deterministic toy dynamics
ns = min(s + 1, 3); r = 1.0 if ns == 3 else 0.0; done = (ns == 3)
return ns, r, done
# fill a replay buffer with transitions from random starts
buffer = [(s, *step(s)) for s in rng.integers(0, 3, size=400)]
Q = np.zeros(nS) # "online" tabular value (one per state, greedy action)
Qt = Q.copy() # frozen target network
lr, tau = 0.5, 0.1
for it in range(60):
s, ns, r, done = buffer[rng.integers(len(buffer))] # sample from replay
y = r if done else r + gamma * Qt[ns] # target uses FROZEN Qt
Q[s] += lr * (y - Q[s]) # gradient step on online Q only
Qt = tau * Q + (1 - tau) * Qt # Polyak soft update (EQ R5.3)
true = np.array([gamma**2, gamma**1, gamma**0, 0.0]) # exact V from each state
print("learned Q :", Q.round(3).tolist())
print("true V :", true.round(3).tolist())
print("max error :", float(np.abs(Q - true).max()).__round__(4))
print("\nfreezing the target (Qt) is what stops Q from chasing its own moving estimate.")
Proximal Policy Optimization
DQN learns a value and acts greedily; it is confined to discrete actions and is famously fiddly to tune. The other half of deep RL learns the policy directly (Chapter 04). Vanilla policy gradients are unbiased but high-variance and brittle: a single overlarge step can push the policy into a region where it collects no reward, and with no good data it never recovers. The fix that won the field is Proximal Policy Optimization (PPO) — robust, simple to implement, and the workhorse behind everything from robotics to the RLHF that aligns language models (Chapter 06).
PPO descends from Trust Region Policy Optimization (TRPO), whose principle is: improve the policy, but never step so far that the new policy is unrecognizably different from the old one, because the advantage estimates were collected under the old policy and stop being valid far from it. TRPO enforces this with a hard KL-divergence constraint and a second-order optimization — correct but heavy. PPO achieves nearly the same effect with a first-order trick that fits in a few lines: clip the probability ratio.
Let \(r_t(\theta) = \dfrac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\) be the ratio of the new policy's probability of the taken action to the old policy's. \(r_t = 1\) means no change; \(r_t > 1\) means the new policy is more likely to take that action. PPO maximizes the clipped surrogate objective:
The asymmetry is the clever part. Read it case by case. When the advantage is positive (a good action, push its probability up) the objective stops rewarding any increase in \(r_t\) past \(1+\varepsilon\): the upside is capped, so the update cannot over-commit. When the advantage is negative (a bad action, push its probability down) the clipping floors the term at \(1-\varepsilon\), again removing the incentive to over-correct. Crucially, the \(\min\) only ever removes incentive when the policy has already moved far enough in the favorable direction — it never clips in a way that prevents undoing a too-large step, so the policy can always claw back from a mistake. That single property is why PPO is forgiving where vanilla policy gradients are not.
# PPO clipped objective on toy advantages, swept over the ratio r (EQ R5.4)
import numpy as np
eps = 0.2
r = np.linspace(0.0, 2.0, 21) # candidate probability ratios
def ppo_obj(r, A, eps=0.2):
unclipped = r * A
clipped = np.clip(r, 1 - eps, 1 + eps) * A
return np.minimum(unclipped, clipped) # pessimistic lower bound
A_pos, A_neg = +1.0, -1.0
L_pos = ppo_obj(r, A_pos, eps)
L_neg = ppo_obj(r, A_neg, eps)
print(" r L(A=+1) L(A=-1)")
for ri, lp, ln in zip(r, L_pos, L_neg):
print(f"{ri:4.2f} {lp:+6.3f} {ln:+6.3f}")
print(f"\nclip interval at eps={eps}: [{1-eps:.2f}, {1+eps:.2f}]")
print("A>0: objective FLATTENS once r exceeds 1.20 (no reward for over-stepping).")
print("A<0: objective FLATTENS once r drops below 0.80 (no reward for over-correcting).")
plot_xy(r.tolist(), L_pos.tolist())
Continuous control — DDPG & SAC
DQN's \(\max_{a'}\) over actions is fine when there are four buttons; it is intractable when the action is a vector of continuous torques, because the maximization is itself an optimization problem at every step. Continuous control — robot arms, locomotion, autonomous driving — needs a different shape of algorithm. The dominant family is actor–critic, which keeps a learned policy (the actor) and a learned value (the critic) and lets them improve each other.
- DDPG (Deep Deterministic Policy Gradient). An off-policy actor–critic that you can read as "DQN for continuous actions". A deterministic actor \(\mu_\theta(s)\) outputs the action directly, so the critic's \(\max\) is replaced by \(Q(s, \mu_\theta(s))\) — no inner optimization. It inherits DQN's replay buffer and target networks (with Polyak updates, EQ R5.3) and adds exploration noise to the actor's output. Powerful, but notoriously sensitive to hyperparameters.
- TD3 (Twin Delayed DDPG). Three targeted fixes for DDPG's pathologies: twin critics (take the minimum of two Q-networks to fight the over-estimation bias DQN also suffers); delayed actor updates (update the policy less often than the critic, so it chases a more settled target); and target-policy smoothing (add noise to the target action so the critic cannot exploit sharp peaks). Together they make off-policy continuous control far more reliable.
- SAC (Soft Actor–Critic). The current default for continuous control. SAC is built on maximum-entropy RL: the objective adds an entropy bonus, so the agent is rewarded not only for return but for staying as random as it can while still doing well. This yields strong, automatic exploration, robustness to hyperparameters, and excellent sample efficiency.
A useful mental map: PPO is the robust, on-policy default when you can afford to throw away data after each batch (and it dominates RLHF for that simplicity). SAC is the sample-efficient, off-policy default when interactions are expensive — a real robot, a slow simulator. DQN and its descendants own discrete-action problems. There is no universal winner; the right choice is dictated by action space, sample budget, and how much tuning you can tolerate.
Stability & reproducibility
Deep RL works — and it is also, honestly, the least reproducible corner of mainstream machine learning. The reason traces straight back to §5.1: the agent generates its own data, so a tiny early difference in behavior steers it toward an entirely different region of experience, and the gap compounds. The most uncomfortable symptom is seed sensitivity: the same algorithm, the same code, the same hyperparameters, changing only the random seed, can produce wildly different learning curves — one seed solving the task, another never leaving the floor.
This is not a rumor; it was documented carefully. Henderson et al. (2018) showed that reported results in deep-RL papers were routinely driven by a handful of lucky seeds, that the choice of seed could matter as much as the choice of algorithm, and that comparisons drawn from too few runs were often statistically meaningless. The practical consequences are now widely accepted:
- Report many seeds, not one. A single learning curve is anecdote. Five to ten independent seeds, with the spread shown — not just the best or the mean — is the minimum honest unit of evidence.
- Show the distribution. Mean ± standard deviation, or better, the interquartile range and confidence intervals; aggregate protocols like RLiable exist precisely to stop cherry-picking. The variance across seeds is itself a result — a high-variance method may be worse in practice than a lower-mean but reliable one.
- Pin the stack. Environment version, library version, hardware, and every hyperparameter, because deep-RL outcomes are sensitive to all of them — implementation details that sound cosmetic (reward scaling, observation normalization, the exact advantage estimator) routinely swing final performance more than the headline algorithm does.
# Why one seed lies: variance of final return across seeds (EQ R5.6)
import numpy as np
# simulate 8 seeds of a high-variance deep-RL run: some solve it, some stall
rng = np.random.default_rng(7)
seeds = 8
# bimodal outcome: ~60% reach a good return ~180, ~40% get stuck near ~40
solved = rng.random(seeds) < 0.6
final = np.where(solved, rng.normal(180, 15, seeds),
rng.normal(40, 20, seeds)).clip(0)
mean = final.mean()
s = final.std(ddof=1) # sample std (N-1)
se = s / np.sqrt(seeds) # standard error of the mean
print("per-seed final return:", final.round(1).tolist())
print(f"mean G_bar = {mean:6.1f}")
print(f"std s = {s:6.1f} (this is what one seed cannot show you)")
print(f"std-err SE = {se:6.1f} (shrinks only as 1/sqrt(N))")
print(f"if you reported ONLY seed 0: {final[0]:.1f} <- anecdote, not evidence")
plot_scatter(list(range(seeds)), final.tolist(), solved.astype(int).tolist())
The deep-RL reproducibility checklist. (1) One-seed results are anecdotes — report ≥5, ideally with IQR/CIs. (2) The deadly triad can diverge silently; watch the Q-values, not just the reward. (3) Reward scaling and observation normalization swing outcomes more than the algorithm name — log them. (4) "Beats SOTA" from a different env version or evaluation protocol is not a comparison. (5) Tuning on the test environment is a contamination, exactly as in supervised learning.
The clip that stabilized policy gradients is about to stabilize something far larger. Chapter 06 turns PPO outward: the same clipped objective, with a language model as the policy and a learned reward model standing in for the environment, is the engine of RLHF — and its leaner successors, DPO and GRPO, that align the models you talk to every day.
References
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal Policy Optimization Algorithms.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. (2015). Trust Region Policy Optimization.
- Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor.
- van Hasselt, H., Guez, A. & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning.
- Lillicrap, T. P. et al. (2016). Continuous control with deep reinforcement learning.
- Fujimoto, S., van Hoof, H. & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods.
- Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters.