AI // ENCYCLOPEDIA / REINFORCEMENT LEARNING / 05 / DEEP RL INDEX NEXT: RL & LLMs →
REINFORCEMENT LEARNING · CHAPTER 05 / 06

Deep Reinforcement Learning — DQN & PPO

The tabular methods of the earlier chapters store one number per state, which is impractical the instant the state is a screen of pixels or a robot's joint angles. The fix is to replace the table with a neural network, and it introduces a new failure mode. Swapping the table for a neural net scales RL to Atari and robotics, at the cost of an instability that replay buffers and clipped objectives exist to tame. This chapter covers two algorithms that made deep RL work: DQN, which stabilized value learning with a replay buffer and a frozen target network, and PPO, whose clipped surrogate objective made policy gradients robust enough to become the field's default.

LEVELADVANCED READING TIME≈ 28 MIN BUILDS ONCH 04 · POLICY GRADIENTS INSTRUMENTSDQN STABILIZERS · PPO CLIP · SEED VARIANCE
5.1

Function approximation & the deadly triad

Tabular RL — a separate entry in a lookup table for every state-action pair — is exact and has clean convergence guarantees. It is also useless the moment the world is large. A 210×160 RGB Atari frame has more configurations than there are atoms in the universe; a tabular agent would never visit the same state twice, let alone learn from it. The escape is function approximation: parameterize the value function or policy with a model \(f_\theta\) — a neural network — that generalizes across states, so that what it learns in one state transfers to similar states it has never seen.

This single substitution is what "deep" reinforcement learning means. It is also where the guarantees fall apart. Tabular Q-learning converges; the same algorithm with a neural network in the loop can diverge spectacularly — values exploding to infinity, the policy collapsing to a single useless action. Sutton and Barto named the cause the deadly triad: instability is provoked when three ingredients are present at once.

EQ R5.1 — THE DEADLY TRIAD $$ \underbrace{\text{function approximation}}_{\text{generalize across states}} \;+\; \underbrace{\text{bootstrapping}}_{\text{target uses your own estimate}} \;+\; \underbrace{\text{off-policy learning}}_{\text{train on data from another policy}} \;\Longrightarrow\; \text{risk of divergence} $$
Each ingredient is individually benign — and individually almost indispensable. Function approximation is forced on us by large state spaces. Bootstrapping (a TD target \(r + \gamma \max_{a'} Q(s', a')\) that depends on the network's own output) is what makes learning sample-efficient. Off-policy learning lets us reuse old data instead of throwing it away after one gradient step. Present all three and the value estimates can chase their own moving target into divergence. Every algorithm in this chapter is, at heart, a recipe for keeping the triad's three forces in balance rather than letting them resonate.

Why does the combination misbehave? In Q-learning the regression target \(y = r + \gamma \max_{a'} Q_\theta(s', a')\) is computed using the same network \(Q_\theta\) we are updating. A gradient step that raises \(Q_\theta(s,a)\) also raises \(Q_\theta(s',a')\) for similar \((s',a')\) — function approximation guarantees the change leaks to neighbors — which raises the target, which raises the next estimate. The network is chasing a target it moves every time it takes a step toward it. With on-policy data and a fresh table this loop is damped; with off-policy data, generalization, and bootstrapping together, it can amplify without bound.

The triad is a diagnosis, not a theorem: it identifies the conditions under which divergence is possible, not a guarantee that it happens. In practice well-tuned deep agents are stable far more often than the worst case suggests — but the failure mode is real, it is hard to predict in advance, and the engineering of §5.2 and §5.3 is the field's accumulated wisdom for staying out of its way.

According to the deadly triad (EQ R5.1), how many ingredients must be present together for off-policy value learning with neural networks to risk divergence?
The triad names exactly three: function approximation, bootstrapping, and off-policy learning. The answer is 3. Remove any one — e.g. switch to on-policy Monte-Carlo targets (no bootstrapping) or a tabular value (no approximation) — and the convergence story is far safer.
5.2

Deep Q-Networks — replay & target nets

The 2015 DQN paper is the landmark: a single architecture, learning straight from raw pixels and a score, reached human-level play on most of 49 Atari games. The network itself is unremarkable — a small convnet mapping a stack of four frames to one Q-value per action. The two ideas that made it stable are the lesson, and both are direct countermeasures to the deadly triad.

Experience replay

Instead of learning from each transition the instant it occurs and then discarding it, DQN writes every transition \((s, a, r, s')\) into a large circular replay buffer and trains on random minibatches sampled from it. This buys two things. First, it breaks the temporal correlation between consecutive samples: successive frames of one episode are near-identical and violate the i.i.d. assumption every SGD convergence proof leans on; shuffling from a buffer of a million transitions restores approximate independence. Second, it reuses each experience many times, turning a precious environment interaction into many gradient updates — a large gain in sample efficiency.

The target network

The second stabilizer attacks the moving-target problem head on. DQN keeps a separate copy of the network, the target network \(Q_{\theta^-}\), whose weights are frozen and only periodically copied from the online network \(Q_\theta\) (every \(C\) steps in the original; modern code often uses a slow Polyak average instead). The regression target is computed with the frozen copy, so it does not move while the online network chases it.

EQ R5.2 — THE DQN LOSS (WITH A FROZEN TARGET) $$ \mathcal{L}(\theta) \;=\; \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\Big(\, \underbrace{r + \gamma \max_{a'} Q_{\theta^-}(s', a')}_{\text{target } y,\ \text{frozen}} \;-\; Q_\theta(s, a) \,\Big)^{2}\right] $$
\(\mathcal{D}\) is the replay buffer; \((s,a,r,s')\) a minibatch sampled uniformly from it. The target \(y\) uses the frozen parameters \(\theta^-\); the gradient flows only through \(Q_\theta(s,a)\) — never through the target. Stop-gradient on the bootstrap target plus a buffer that decorrelates samples is the whole stabilization recipe. For a terminal transition the \(\gamma \max\) term is dropped: \(y = r\). Double-DQN refines this by choosing the next action with the online net but evaluating it with the target net, which curbs the systematic over-estimation that a single \(\max\) introduces.

The target network's weights are refreshed by a hard copy every \(C\) steps, or by a soft Polyak (exponential) update applied every step — the form most continuous-control code now uses:

EQ R5.3 — POLYAK (SOFT) TARGET UPDATE $$ \theta^- \;\leftarrow\; \tau\, \theta \;+\; (1 - \tau)\, \theta^-, \qquad 0 < \tau \ll 1 $$
With \(\tau\) small (say \(0.005\)) the target net is a slowly-trailing exponential moving average of the online net: it moves, but far too slowly to resonate with the online updates. \(\tau = 1\) recovers a hard copy every step (no smoothing at all); \(\tau \to 0\) freezes the target forever. \(\tau\) trades stability against the speed at which the target tracks genuine improvement — too small and learning crawls, too large and the moving-target instability creeps back.
True or false: DQN's experience-replay buffer breaks the correlation between consecutive training samples by storing transitions and drawing random minibatches from the whole buffer rather than learning from each transition in order. (Answer true or false.)
Consecutive frames within an episode are highly correlated and badly violate the i.i.d. assumption SGD relies on. Sampling uniformly from a buffer of up to a million past transitions mixes experiences from many different times and episodes, restoring approximate independence — that decorrelation, together with sample reuse, is the buffer's whole purpose. The statement is true.
A target-network weight is updated by a Polyak step (EQ R5.3) with \(\tau = 0.005\). The online weight is \(\theta = 10\) and the current target weight is \(\theta^- = 2\). What is the new target weight \(\theta^-\)?
\(\theta^- \leftarrow \tau\,\theta + (1-\tau)\,\theta^- = 0.005 \times 10 + 0.995 \times 2 = 0.05 + 1.99 = \) 2.04. The target inches only \(0.04\) toward the online value — exactly the slow trailing average that keeps the bootstrap target from chasing itself.
PYTHON · RUNNABLE IN-BROWSER
# DQN target + replay on a toy 4-state chain MDP (EQ R5.2, EQ R5.3)
import numpy as np
rng = np.random.default_rng(0)

# states 0..3, action "go" advances one state; reward +1 only on reaching s3
nS, gamma = 4, 0.9
def step(s):                                   # deterministic toy dynamics
    ns = min(s + 1, 3); r = 1.0 if ns == 3 else 0.0; done = (ns == 3)
    return ns, r, done

# fill a replay buffer with transitions from random starts
buffer = [(s, *step(s)) for s in rng.integers(0, 3, size=400)]

Q  = np.zeros(nS)                              # "online" tabular value (one per state, greedy action)
Qt = Q.copy()                                  # frozen target network
lr, tau = 0.5, 0.1
for it in range(60):
    s, ns, r, done = buffer[rng.integers(len(buffer))]   # sample from replay
    y = r if done else r + gamma * Qt[ns]                # target uses FROZEN Qt
    Q[s] += lr * (y - Q[s])                              # gradient step on online Q only
    Qt = tau * Q + (1 - tau) * Qt                        # Polyak soft update (EQ R5.3)

true = np.array([gamma**2, gamma**1, gamma**0, 0.0])     # exact V from each state
print("learned Q :", Q.round(3).tolist())
print("true   V :", true.round(3).tolist())
print("max error :", float(np.abs(Q - true).max()).__round__(4))
print("\nfreezing the target (Qt) is what stops Q from chasing its own moving estimate.")
edits are live — break it on purpose
INSTRUMENT R5.1 — DQN STABILIZERSREPLAY BUFFER · TARGET NET · EQ R5.2
REGIME
STABILIZED
FINAL VALUE ERROR
OUTCOME
Each curve is the learned value of the start state over training on the toy chain, plotted against its true value (the dashed mint line). With both stabilizers ON the estimate climbs smoothly to the truth. Turn OFF the target network and the bootstrap target chases itself — the curve overshoots and oscillates. Turn OFF replay and learning from a single correlated stream becomes jagged and slow. Switch both off to watch the deadly triad's instability in miniature. Nothing here needs a click — it renders the stabilized run on load.
5.3

Proximal Policy Optimization

DQN learns a value and acts greedily; it is confined to discrete actions and is famously fiddly to tune. The other half of deep RL learns the policy directly (Chapter 04). Vanilla policy gradients are unbiased but high-variance and brittle: a single overlarge step can push the policy into a region where it collects no reward, and with no good data it never recovers. The fix that won the field is Proximal Policy Optimization (PPO) — robust, simple to implement, and the workhorse behind everything from robotics to the RLHF that aligns language models (Chapter 06).

PPO descends from Trust Region Policy Optimization (TRPO), whose principle is: improve the policy, but never step so far that the new policy is unrecognizably different from the old one, because the advantage estimates were collected under the old policy and stop being valid far from it. TRPO enforces this with a hard KL-divergence constraint and a second-order optimization — correct but heavy. PPO achieves nearly the same effect with a first-order trick that fits in a few lines: clip the probability ratio.

Let \(r_t(\theta) = \dfrac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\) be the ratio of the new policy's probability of the taken action to the old policy's. \(r_t = 1\) means no change; \(r_t > 1\) means the new policy is more likely to take that action. PPO maximizes the clipped surrogate objective:

EQ R5.4 — PPO CLIPPED SURROGATE OBJECTIVE $$ L^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\!\Big[\, \min\big(\, r_t(\theta)\, \hat{A}_t,\;\; \mathrm{clip}\big(r_t(\theta),\, 1 - \varepsilon,\, 1 + \varepsilon\big)\, \hat{A}_t \,\big) \Big] $$
\(\hat{A}_t\) is the estimated advantage (typically from GAE, Chapter 04) — how much better the action was than the policy's average. The \(\mathrm{clip}\) confines the ratio to \([1-\varepsilon,\, 1+\varepsilon]\) (the default \(\varepsilon = 0.2\) gives \([0.8,\, 1.2]\)). The outer \(\min\) takes the pessimistic of the clipped and unclipped terms, so the objective is a lower bound on the true improvement. The effect: once an update would move the policy too far in the rewarding direction, the gradient simply switches off — there is no incentive to step past the trust region. No KL constraint, no second-order solve: a one-line guardrail that made policy gradients dependable.

The asymmetry is the clever part. Read it case by case. When the advantage is positive (a good action, push its probability up) the objective stops rewarding any increase in \(r_t\) past \(1+\varepsilon\): the upside is capped, so the update cannot over-commit. When the advantage is negative (a bad action, push its probability down) the clipping floors the term at \(1-\varepsilon\), again removing the incentive to over-correct. Crucially, the \(\min\) only ever removes incentive when the policy has already moved far enough in the favorable direction — it never clips in a way that prevents undoing a too-large step, so the policy can always claw back from a mistake. That single property is why PPO is forgiving where vanilla policy gradients are not.

PPO clips the probability ratio \(r_t(\theta)\) to the interval \([1-\varepsilon,\, 1+\varepsilon]\). For the standard \(\varepsilon = 0.2\), what is the upper end of the clipping interval, \(1 + \varepsilon\)?
\(1 + \varepsilon = 1 + 0.2 = \) 1.2. (The lower end is \(1 - 0.2 = 0.8\).) Once the new policy is more than 20% more likely to take an advantageous action than the old policy was, the clipped objective stops rewarding any further increase — the trust region, made of arithmetic.
For one sample with ratio \(r_t = 1.5\), advantage \(\hat{A}_t = +2\), and \(\varepsilon = 0.2\), evaluate the per-sample PPO objective \(\min\!\big(r_t\hat{A}_t,\ \mathrm{clip}(r_t,\,0.8,\,1.2)\,\hat{A}_t\big)\) from EQ R5.4.
Unclipped term: \(r_t\hat{A}_t = 1.5 \times 2 = 3\). Clipped ratio: \(\mathrm{clip}(1.5, 0.8, 1.2) = 1.2\), so the clipped term is \(1.2 \times 2 = 2.4\). The objective is the minimum: \(\min(3,\ 2.4) = \) 2.4. The advantage is positive and the ratio has already exceeded \(1+\varepsilon\), so PPO caps the reward at the clipped value — pushing \(r_t\) higher would buy nothing.
PYTHON · RUNNABLE IN-BROWSER
# PPO clipped objective on toy advantages, swept over the ratio r (EQ R5.4)
import numpy as np

eps = 0.2
r = np.linspace(0.0, 2.0, 21)                  # candidate probability ratios

def ppo_obj(r, A, eps=0.2):
    unclipped = r * A
    clipped   = np.clip(r, 1 - eps, 1 + eps) * A
    return np.minimum(unclipped, clipped)      # pessimistic lower bound

A_pos, A_neg = +1.0, -1.0
L_pos = ppo_obj(r, A_pos, eps)
L_neg = ppo_obj(r, A_neg, eps)

print(" r     L(A=+1)   L(A=-1)")
for ri, lp, ln in zip(r, L_pos, L_neg):
    print(f"{ri:4.2f}   {lp:+6.3f}   {ln:+6.3f}")

print(f"\nclip interval at eps={eps}: [{1-eps:.2f}, {1+eps:.2f}]")
print("A>0: objective FLATTENS once r exceeds 1.20 (no reward for over-stepping).")
print("A<0: objective FLATTENS once r drops below 0.80 (no reward for over-correcting).")
plot_xy(r.tolist(), L_pos.tolist())
edits are live — break it on purpose
INSTRUMENT R5.2 — PPO CLIP VISUALIZERL^CLIP vs RATIO · EQ R5.4
CLIP INTERVAL
[0.80, 1.20]
OBJECTIVE FLATTENS AT r =
1.20
L^CLIP AT r = 1.5
The mint curve is the clipped objective \(L^{\text{CLIP}}\) as a function of the ratio \(r_t\); the faint grey line is the unclipped \(r_t\hat{A}_t\) that vanilla policy gradients would chase off to infinity. For a positive advantage the mint curve rises, then goes flat past \(1+\varepsilon\) — the gradient dies, so no update can over-step. Flip to a negative advantage and the flat shoulder appears below \(1-\varepsilon\) instead. Widen \(\varepsilon\) to loosen the trust region (bigger, riskier steps); narrow it for timid, stable ones. The default \(\varepsilon = 0.2\) is what most PPO code ships with — and what renders on load.
5.4

Continuous control — DDPG & SAC

DQN's \(\max_{a'}\) over actions is fine when there are four buttons; it is intractable when the action is a vector of continuous torques, because the maximization is itself an optimization problem at every step. Continuous control — robot arms, locomotion, autonomous driving — needs a different shape of algorithm. The dominant family is actor–critic, which keeps a learned policy (the actor) and a learned value (the critic) and lets them improve each other.

  • DDPG (Deep Deterministic Policy Gradient). An off-policy actor–critic that you can read as "DQN for continuous actions". A deterministic actor \(\mu_\theta(s)\) outputs the action directly, so the critic's \(\max\) is replaced by \(Q(s, \mu_\theta(s))\) — no inner optimization. It inherits DQN's replay buffer and target networks (with Polyak updates, EQ R5.3) and adds exploration noise to the actor's output. Powerful, but notoriously sensitive to hyperparameters.
  • TD3 (Twin Delayed DDPG). Three targeted fixes for DDPG's pathologies: twin critics (take the minimum of two Q-networks to fight the over-estimation bias DQN also suffers); delayed actor updates (update the policy less often than the critic, so it chases a more settled target); and target-policy smoothing (add noise to the target action so the critic cannot exploit sharp peaks). Together they make off-policy continuous control far more reliable.
  • SAC (Soft Actor–Critic). The current default for continuous control. SAC is built on maximum-entropy RL: the objective adds an entropy bonus, so the agent is rewarded not only for return but for staying as random as it can while still doing well. This yields strong, automatic exploration, robustness to hyperparameters, and excellent sample efficiency.
EQ R5.5 — MAXIMUM-ENTROPY OBJECTIVE (SAC) $$ J(\pi) \;=\; \sum_{t} \mathbb{E}_{(s_t, a_t) \sim \pi}\!\Big[\, R(s_t, a_t) \;+\; \alpha\, \mathcal{H}\big(\pi(\cdot \mid s_t)\big) \,\Big], \qquad \mathcal{H}(\pi) = -\!\sum_{a} \pi(a\mid s)\log \pi(a\mid s) $$
The familiar return, plus a per-step entropy bonus \(\alpha\,\mathcal{H}(\pi)\) that pays the agent to keep its action distribution spread out. The temperature \(\alpha\) sets the price of randomness: large \(\alpha\) keeps the policy exploratory and stochastic, \(\alpha \to 0\) recovers ordinary reward maximization. Entropy turns exploration from a bolt-on heuristic (the ε of DQN) into a first-class term of the objective — and modern SAC tunes \(\alpha\) automatically to hold entropy at a target, removing one of the most painful knobs in RL. The price: a continuous, off-policy method that, like DDPG/TD3, leans on replay buffers and target networks for stability.

A useful mental map: PPO is the robust, on-policy default when you can afford to throw away data after each batch (and it dominates RLHF for that simplicity). SAC is the sample-efficient, off-policy default when interactions are expensive — a real robot, a slow simulator. DQN and its descendants own discrete-action problems. There is no universal winner; the right choice is dictated by action space, sample budget, and how much tuning you can tolerate.

5.5

Stability & reproducibility

Deep RL works — and it is also, honestly, the least reproducible corner of mainstream machine learning. The reason traces straight back to §5.1: the agent generates its own data, so a tiny early difference in behavior steers it toward an entirely different region of experience, and the gap compounds. The most uncomfortable symptom is seed sensitivity: the same algorithm, the same code, the same hyperparameters, changing only the random seed, can produce wildly different learning curves — one seed solving the task, another never leaving the floor.

This is not a rumor; it was documented carefully. Henderson et al. (2018) showed that reported results in deep-RL papers were routinely driven by a handful of lucky seeds, that the choice of seed could matter as much as the choice of algorithm, and that comparisons drawn from too few runs were often statistically meaningless. The practical consequences are now widely accepted:

  • Report many seeds, not one. A single learning curve is anecdote. Five to ten independent seeds, with the spread shown — not just the best or the mean — is the minimum honest unit of evidence.
  • Show the distribution. Mean ± standard deviation, or better, the interquartile range and confidence intervals; aggregate protocols like RLiable exist precisely to stop cherry-picking. The variance across seeds is itself a result — a high-variance method may be worse in practice than a lower-mean but reliable one.
  • Pin the stack. Environment version, library version, hardware, and every hyperparameter, because deep-RL outcomes are sensitive to all of them — implementation details that sound cosmetic (reward scaling, observation normalization, the exact advantage estimator) routinely swing final performance more than the headline algorithm does.
EQ R5.6 — WHAT A SINGLE SEED HIDES $$ \bar{G} = \frac{1}{N}\sum_{i=1}^{N} G^{(i)}, \qquad \mathrm{SE} = \frac{s}{\sqrt{N}}, \qquad s^2 = \frac{1}{N-1}\sum_{i=1}^{N}\big(G^{(i)} - \bar{G}\big)^2 $$
\(G^{(i)}\) is the final return of seed \(i\); \(\bar{G}\) the mean across \(N\) seeds; \(s\) the sample standard deviation; \(\mathrm{SE}\) the standard error of the mean, which shrinks only as \(1/\sqrt{N}\). With \(N = 1\) there is no \(s\) and no \(\mathrm{SE}\) — the number you report has an error bar you simply cannot see. Because deep-RL seed variance is large, the \(\sqrt{N}\) in the denominator is brutal: halving your uncertainty costs four times the compute. This is the arithmetic behind "run more seeds".
PYTHON · RUNNABLE IN-BROWSER
# Why one seed lies: variance of final return across seeds (EQ R5.6)
import numpy as np

# simulate 8 seeds of a high-variance deep-RL run: some solve it, some stall
rng = np.random.default_rng(7)
seeds = 8
# bimodal outcome: ~60% reach a good return ~180, ~40% get stuck near ~40
solved = rng.random(seeds) < 0.6
final = np.where(solved, rng.normal(180, 15, seeds),
                          rng.normal(40, 20, seeds)).clip(0)

mean = final.mean()
s    = final.std(ddof=1)                        # sample std (N-1)
se   = s / np.sqrt(seeds)                        # standard error of the mean

print("per-seed final return:", final.round(1).tolist())
print(f"mean    G_bar = {mean:6.1f}")
print(f"std     s     = {s:6.1f}   (this is what one seed cannot show you)")
print(f"std-err SE    = {se:6.1f}   (shrinks only as 1/sqrt(N))")
print(f"if you reported ONLY seed 0: {final[0]:.1f}  <- anecdote, not evidence")
plot_scatter(list(range(seeds)), final.tolist(), solved.astype(int).tolist())
edits are live — break it on purpose
INSTRUMENT R5.3 — REWARD-CURVE VARIANCE ACROSS SEEDSSAME ALGORITHM · DIFFERENT SEEDS · EQ R5.6
MEAN FINAL RETURN
STD ACROSS SEEDS
STD ERROR (s/√N)
Every faint curve is one seed of the same deep-RL agent — identical code, identical hyperparameters, only the random seed differs. The bright mint line is the mean across them; the shaded band is ± one standard deviation. Drag N down to 1 and you are left with a single anecdotal curve that could be the lucky run or the doomed one — you cannot tell. Drag it up and the mean steadies while the band reveals the true spread the field learned to report. Crank the noise to feel why high-variance methods demand many seeds before any comparison means anything. Renders eight seeds on load — no interaction needed.
PITFALLS

The deep-RL reproducibility checklist. (1) One-seed results are anecdotes — report ≥5, ideally with IQR/CIs. (2) The deadly triad can diverge silently; watch the Q-values, not just the reward. (3) Reward scaling and observation normalization swing outcomes more than the algorithm name — log them. (4) "Beats SOTA" from a different env version or evaluation protocol is not a comparison. (5) Tuning on the test environment is a contamination, exactly as in supervised learning.

NEXT

The clip that stabilized policy gradients is about to stabilize something far larger. Chapter 06 turns PPO outward: the same clipped objective, with a language model as the policy and a learned reward model standing in for the environment, is the engine of RLHF — and its leaner successors, DPO and GRPO, that align the models you talk to every day.

5.R

References

  1. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — the DQN paper; experience replay and the frozen target network (EQ R5.2) learning Atari from pixels.
  2. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 — the clipped surrogate objective (EQ R5.4) at the heart of §5.3.
  3. Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. (2015). Trust Region Policy Optimization. arXiv:1502.05477 — the KL-constrained trust region PPO approximates with a first-order clip.
  4. Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor. arXiv:1801.01290 — the maximum-entropy objective (EQ R5.5) and the continuous-control default of §5.4.
  5. van Hasselt, H., Guez, A. & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016 (arXiv:1509.06461) — double-DQN, decoupling action selection from evaluation to curb over-estimation.
  6. Lillicrap, T. P. et al. (2016). Continuous control with deep reinforcement learning. arXiv:1509.02971 — DDPG, the deterministic actor–critic for continuous actions (§5.4).
  7. Fujimoto, S., van Hoof, H. & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. arXiv:1802.09477 — TD3; twin critics, delayed updates, and target smoothing.
  8. Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI 2018 (arXiv:1709.06560) — the reproducibility and seed-variance study behind §5.5 and EQ R5.6.