Model-Free Value Methods — TD & Q-Learning

3.1

Learning without a model

Chapter 02 was a luxury we will now give up. There, the agent knew the environment — the transition function $P(s' \mid s, a)$ and the reward $R(s, a)$ — and could compute the optimal policy by pure thought, sweeping the Bellman equations until they converged. That is planning, and it is exactly the engine behind the Gridworld instrument of Chapter 01. But knowing $P$ and $R$ is a strong assumption that almost never holds. A robot does not have a probability table for how its motors slip on a given floor. A game-playing agent is not handed the rules as equations. The model is missing, and the agent must work without it.

This is the regime of model-free reinforcement learning: the agent never builds or is given a model of the world. Instead it learns directly from samples — actual transitions $(s, a, r, s')$ it experiences by acting. Where dynamic programming computes an expectation over all possible next states by summing against $P$, model-free methods replace that expectation with sampled experience. They do not ask "what is the average outcome of this action?"; they take the action, see one outcome, and nudge their estimate toward it. Average enough samples and you recover the expectation the model would have given you — the law of large numbers doing the work the model used to do.

Same goal — the optimal value function and policy — reached two ways. Planning sums against a known model; learning averages sampled experience. This chapter lives entirely on the right.

Two questions organize everything that follows, and they map onto two classical problems. Prediction (also called policy evaluation): given a fixed policy $\pi$, estimate its value function $V^\pi$ — how good is this way of behaving? Control: find a good policy in the first place, typically by estimating action-values $Q^\pi$ and improving the policy toward them. Sections 3.2 and 3.3 solve prediction two different ways; Sections 3.4 and 3.5 turn the better of them into control.

One design choice cuts across all of it and deserves naming now. A method is on-policy if it learns about the very policy it is using to act, and off-policy if it can learn about one policy (say, the greedy optimal one) while behaving according to another (say, an exploratory one). It sounds like a technicality. It is the single most consequential distinction in this chapter — it is exactly what separates Q-learning from SARSA — and we will see it decide how an agent behaves on the edge of a cliff.

A useful sanity check on the whole enterprise: a model-free agent can become superhuman at a game without ever being able to describe the game. It learns which moves are worth what, not why. That is a strength — no modeling effort — and a weakness — no transfer, no planning ahead, every new world relearned from scratch. The model-based methods of later chapters trade sample efficiency back for exactly that missing structure.

3.2

Monte-Carlo prediction

The most direct way to estimate a value from experience is to take the definition literally. The value $V^\pi(s)$ is the expected return from $s$ (Chapter 01, EQ R1.6). So play out complete episodes under $\pi$, and for each state record the actual return that followed it. Average those returns and you have a sample estimate of the expectation. This is the Monte-Carlo (MC) method: estimate an expectation by averaging samples of it.

EQ R3.1 — MONTE-CARLO VALUE ESTIMATE $$ V(s) \;\leftarrow\; V(s) + \alpha\,\big[\, G_t - V(s) \,\big], \qquad G_t = \sum_{k=0}^{\infty} \gamma^k\, R_{t+k+1} $$

After an episode ends, compute the actual return $G_t$ that followed each visit to $s$, and move the estimate a fraction $\alpha$ of the way toward it. With $\alpha = 1/N(s)$ — one over the number of visits — this is exactly the running sample mean. The target $G_t$ is the true, observed return: no model, no bootstrap, nothing estimated stands in for it. That purity is MC's defining virtue and its defining limitation.

The error term $G_t - V(s)$ is worth dwelling on, because the same shape recurs in every update rule in this chapter. It is a prediction error: the gap between what actually happened ($G_t$) and what we predicted would happen ($V(s)$). Learning is nothing but repeatedly shrinking that gap, with $\alpha$ setting how aggressively. A large $\alpha$ chases the latest sample; a small $\alpha$ averages patiently over many. Set $\alpha$ too high and the estimate jitters with noise; too low and it crawls.

MC's strengths are real. It is unbiased — $G_t$ is an honest sample of the return, so the estimate converges to the true $V^\pi$ with no systematic error. It makes no Markov assumption — it never reasons about next-state values, only whole returns, so it works even where the state representation is imperfect. And it is simple to state. But it has a hard limitation that motivates the rest of the chapter: you must wait until the episode ends to compute $G_t$. In a long episode that is slow; in a continuing task that never terminates, it is impossible. MC also tends to have high variance, because $G_t$ is the sum of a long chain of random rewards and random transitions — one unlucky tail can swing it wildly.

An episode from state $s$ yields rewards $1, 1, 1$ on three successive steps and then terminates. With $\gamma = 0.9$, what is the Monte-Carlo target $G_t$ used to update $V(s)$ in EQ R3.1?

The MC target is the full observed return: $G_t = 1 + 0.9\cdot 1 + 0.9^2\cdot 1 = 1 + 0.9 + 0.81 = $ 2.71. MC waits for the whole episode and uses this actual number — never an estimate of it.

PYTHON · RUNNABLE IN-BROWSER

# Monte-Carlo prediction on the 5-state random walk (Sutton & Barto, ex. 6.2)
# True values are linear: V(s) = s / 6.  Start every estimate at 0.5.
import numpy as np
rng = np.random.default_rng(1)

def episode():                         # random walk; left of 1 -> 0 (r=0), right of 5 -> 6 (r=1)
    s, traj = 3, []
    while 1 <= s <= 5:
        s2 = s + (1 if rng.random() < 0.5 else -1)
        traj.append((s, 1.0 if s2 == 6 else 0.0))
        s = s2
    return traj

V = np.full(7, 0.5); V[0] = V[6] = 0.0    # terminals have value 0
alpha = 0.05
for _ in range(200):
    traj = episode()
    G = traj[-1][1]                       # gamma = 1, so every state's return = final reward
    for (s, _r) in traj:
        V[s] += alpha * (G - V[s])        # EQ R3.1: nudge toward the actual return

true = np.array([s / 6 for s in range(1, 6)])
print("MC estimate V[1..5] :", V[1:6].round(3))
print("true value  V[1..5] :", true.round(3))
print("mean abs error      :", round(float(np.abs(V[1:6] - true).mean()), 4))

edits are live — break it on purpose

3.3

Temporal-difference learning — TD(0)

Here is the idea this chapter is built around, and it is one of the most beautiful in machine learning. MC waits for the full return $G_t$. But recall the recursive form of the return (Chapter 01, EQ R1.4): $G_t = R_{t+1} + \gamma\, G_{t+1}$. We do not have $G_{t+1}$ — the episode has not finished — but we do have an estimate of it: $V(s_{t+1})$, our current guess for the value of the next state. So substitute the guess in for the unknown return. The update becomes:

EQ R3.2 — TD(0) UPDATE $$ V(s_t) \;\leftarrow\; V(s_t) + \alpha\,\big[\, \underbrace{R_{t+1} + \gamma\, V(s_{t+1})}_{\text{TD target}} - V(s_t) \,\big] $$

The target is no longer the full return but one real reward plus the discounted estimate of the rest. This is bootstrapping: updating an estimate from another estimate. It means you can learn from a single step, online, before the episode is anywhere near over — even in a task that never ends. The bracket is the TD error $\delta_t$, the surprise between what you expected and what one step of reality plus your own forecast now suggests.

The bracketed quantity is named: the temporal-difference error, $\delta_t = R_{t+1} + \gamma\, V(s_{t+1}) - V(s_t)$. It is the difference between two successive predictions of the same return — one made before seeing $R_{t+1}$, one after — hence "temporal difference". When $\delta_t = 0$ everywhere, predictions are self-consistent and learning stops; this is the sampled, online cousin of the Bellman fixed point from Chapter 02. The TD error is not merely an algorithmic device: dopamine neurons in the brain encode a signal strikingly close to $\delta_t$, which is part of why TD learning is one of the rare ideas that crossed from machine learning into neuroscience.

EQ R3.3 — THE TD ERROR $$ \delta_t \;=\; R_{t+1} + \gamma\, V(s_{t+1}) - V(s_t) $$

$\delta_t > 0$: the step went better than predicted — raise $V(s_t)$. $\delta_t < 0$: worse than predicted — lower it. Every value method in modern RL, up to and including the critics inside today's policy-gradient and RLHF stacks, is ultimately driving a TD error to zero. MC's error $G_t - V(s_t)$ is the special case where you wait for the whole return instead of bootstrapping after one step.

The trade-off between MC and TD is a genuine one, not a free lunch, and the honest framing is bias versus variance. The TD target $R_{t+1} + \gamma V(s_{t+1})$ is biased: it leans on $V(s_{t+1})$, which is only a guess and is wrong early in training, so TD is "learning from a guess". But it has much lower variance than MC, because it depends on only one random reward and one transition rather than an entire random trajectory. In practice TD's lower variance usually wins — it converges faster on most problems — but MC's lack of bias and its independence from the Markov assumption keep it relevant, and the two are endpoints of a spectrum (TD($\lambda$), $n$-step returns) that interpolates between them. There is no universal winner; which is better is genuinely problem-dependent, a point Sutton & Barto are careful to make and which remains true today.

Property	Monte-Carlo	TD(0)
Target	G_t (full return)	R + γ·V(s′) (one step + bootstrap)
Updates	at episode end only	every step, online
Continuing tasks	cannot (needs termination)	yes
Bias	unbiased	biased (bootstraps)
Variance	high	low
Needs Markov state	no	yes (relies on V(s′))

A TD(0) agent in state $s$ takes a step, gets reward $R_{t+1} = 0$, and lands in $s'$ with current estimate $V(s') = 1$. The old estimate is $V(s) = 0.5$ and $\gamma = 1$. What is the TD error $\delta_t$ (EQ R3.3)?

$\delta_t = R_{t+1} + \gamma\,V(s') - V(s) = 0 + 1\cdot 1 - 0.5 = $ 0.5. Positive, so the step beat expectations and $V(s)$ gets raised by $\alpha\cdot 0.5$ — even though the episode has not ended and no real return was ever observed.

PYTHON · RUNNABLE IN-BROWSER

# TD(0) value estimation vs Monte-Carlo on the same random walk
import numpy as np
rng = np.random.default_rng(1)

def episode():
    s, traj = 3, []
    while 1 <= s <= 5:
        s2 = s + (1 if rng.random() < 0.5 else -1)
        traj.append((s, 1.0 if s2 == 6 else 0.0, s2))
        s = s2
    return traj

true = np.array([s / 6 for s in range(1, 6)])
Vtd = np.full(7, 0.5); Vtd[0] = Vtd[6] = 0.0      # TD(0)
Vmc = np.full(7, 0.5); Vmc[0] = Vmc[6] = 0.0      # Monte-Carlo
a = 0.05
for _ in range(150):
    traj = episode()
    for (s, r, s2) in traj:                         # TD: bootstrap every step (EQ R3.2)
        Vtd[s] += a * (r + 1.0 * Vtd[s2] - Vtd[s])
    G = traj[-1][1]
    for (s, r, s2) in traj:                         # MC: wait for the return (EQ R3.1)
        Vmc[s] += a * (G - Vmc[s])

print("state         :", list(range(1, 6)))
print("TD(0) estimate:", Vtd[1:6].round(3).tolist())
print("MC    estimate:", Vmc[1:6].round(3).tolist())
print("true value    :", true.round(3).tolist())
plot_xy(list(range(1, 6)), Vtd[1:6].tolist())      # TD curve vs the linear truth

edits are live — break it on purpose

INSTRUMENT R3.1 — TD vs MC UPDATEONE STEP · TD TARGET = R + γV(s′) · MC TARGET = G

OLD ESTIMATE V(s) 0.50

REWARD R 0.00

NEXT-STATE V(s′) 1.00

FULL RETURN G 1.50

LEARNING RATE α 0.50

DISCOUNT γ 0.90

TD TARGET R+γV(s′)

—

TD ERROR δ

—

NEW V(s) — TD

—

NEW V(s) — MC

—

The two markers are the targets each method aims at from the same old estimate (the dashed line): mint is the bootstrapped TD target $R+\gamma V(s')$, blue is the full Monte-Carlo return $G$. The arrow shows the step $\alpha\,[\text{target}-V(s)]$ each takes. Slide $V(s')$ and only the TD target moves — that dependence on a guess is the bias TD pays for learning online. Slide $G$ and only MC moves — that swing is the variance MC pays for being unbiased. Set $\alpha = 1$ and each estimate jumps straight onto its target in a single step.

3.4

Q-learning (off-policy)

Prediction estimates the value of a given policy. Control finds a good policy. To do control without a model we need action-values $Q(s, a)$, not state-values $V(s)$ — because, as Chapter 01 argued, you cannot act greedily from $V$ without knowing where actions lead, but greedy action selection from $Q$ is trivial: take $\arg\max_a Q(s, a)$, no model required. So we run TD on $Q$ instead of $V$. The most famous instance is Q-learning (Watkins, 1989), and it has a remarkable property hidden in one symbol.

EQ R3.4 — Q-LEARNING UPDATE $$ Q(s_t, a_t) \;\leftarrow\; Q(s_t, a_t) + \alpha\,\Big[\, R_{t+1} + \gamma\, \underbrace{\max_{a'} Q(s_{t+1}, a')}_{\text{greedy bootstrap}} - Q(s_t, a_t) \,\Big] $$

The bootstrap uses $\max_{a'} Q(s_{t+1}, a')$ — the value of the best next action, regardless of what the agent actually does next. This is what makes Q-learning off-policy: it learns the optimal action-value function $Q^*$ while behaving with any sufficiently exploratory policy. The agent can blunder around ε-greedily, even act randomly, and still converge to $Q^*$ — provided every state–action pair keeps getting visited and $\alpha$ decays appropriately. This decoupling of behavior from learning is the property that made deep Q-networks possible.

The convergence guarantee is one of the cornerstone results of the field. Watkins & Dayan (1992) proved that for a finite MDP, tabular Q-learning converges to the optimal $Q^*$ with probability 1, under two conditions: every state–action pair is visited infinitely often, and the learning rate satisfies the Robbins–Monro conditions $\sum_t \alpha_t = \infty,\ \sum_t \alpha_t^2 < \infty$ (informally: $\alpha$ must shrink, but not too fast). It is worth being honest about the gap between theory and practice here: those conditions are about the tabular case. The moment $Q$ is approximated by a neural network — as in deep Q-learning — the guarantee evaporates, and the combination of bootstrapping, off-policy learning, and function approximation (the "deadly triad") can diverge. Much of deep RL engineering is heuristics to tame that triad; it is contested territory, not settled.

One more honest caveat that motivated a whole follow-up algorithm: the $\max$ operator introduces maximization bias. Because $\max$ over noisy estimates systematically picks the ones that happen to be overestimated, Q-learning tends to overestimate action-values. Double Q-learning (van Hasselt, 2010) fixes this by decoupling the action selected by the max from the value used to evaluate it — the idea behind Double DQN. Q-learning is foundational, not flawless.

Apply a single Q-learning update (EQ R3.4) with $\alpha = 0.5$, reward $r = 1$, discount $\gamma = 0$, current $Q(s,a) = 0$, and $\max_{a'} Q(s', a') = 0$. What is the new $Q(s,a)$?

$Q \leftarrow 0 + 0.5\big[\,1 + 0\cdot 0 - 0\,\big] = 0.5 \times 1 = $ 0.5. With $\gamma = 0$ the bootstrap term drops out entirely, so the update is purely a half-step toward the immediate reward — exactly the myopic, one-step learning a zero discount implies.

True or false: Q-learning is off-policy — it learns about the greedy optimal policy via the $\max_{a'}$ bootstrap while behaving with a different, exploratory policy — whereas SARSA is on-policy, bootstrapping from the action it actually takes next. (Answer true or false.)

Q-learning's target uses $\max_{a'} Q(s', a')$, the value of the best next action irrespective of what the agent does — so it learns $Q^*$ regardless of the behavior policy: off-policy. SARSA's target uses $Q(s', a')$ for the action $a'$ the agent will actually take under its current (e.g. ε-greedy) policy — so it learns the value of that very policy: on-policy. The statement is true.

PYTHON · RUNNABLE IN-BROWSER

# Tabular Q-learning on a 1x4 corridor: states 0..3, state 3 is the goal (+1)
import numpy as np
rng = np.random.default_rng(0)
n, gamma, alpha, eps = 4, 0.9, 0.5, 0.2
Q = np.zeros((n, 2))                        # actions: 0 = left, 1 = right

def step(s, a):
    s2 = min(s + 1, n - 1) if a == 1 else max(s - 1, 0)
    return s2, (1.0 if s2 == 3 else 0.0), s2 == 3

for ep in range(2000):                      # learn by acting eps-greedily
    s = 0
    for _ in range(50):
        a = rng.integers(2) if rng.random() < eps else int(np.argmax(Q[s]))
        s2, r, done = step(s, a)
        target = r + (0.0 if done else gamma * Q[s2].max())   # EQ R3.4: greedy bootstrap
        Q[s, a] += alpha * (target - Q[s, a])
        s = s2
        if done: break

print("learned Q-values (rows = states 0..3, cols = [left, right]):")
print(Q.round(3))
print("greedy policy   :", ["L R"[int(np.argmax(Q[s])) * 2] for s in range(n)])
print("optimal V*(0)   :", round(gamma ** 3, 3), "(= gamma^3, 3 steps to the goal)")

edits are live — break it on purpose

INSTRUMENT R3.2 — Q-LEARNING GRIDWORLDTHE Q-TABLE & GREEDY POLICY FORM LIVE · EQ R3.4

LEARNING RATE α 0.50

EXPLORATION ε 0.20

DISCOUNT γ 0.95

RUN

EPISODES TRAINED

max_a Q AT START

—

GREEDY PATH LENGTH

—

A 4×4 gridworld with a goal (+1, top-right), a trap (−1), a step cost, and stochastic-free moves. Each cell shows its greedy action and $\max_a Q(s,a)$; the shading is that value. Press TRAIN and watch the Q-table fill in from the goal outward — value propagates one cell per episode-batch, exactly as the bootstrap in EQ R3.4 carries reward backward through the chain. Drop ε toward 0 and the agent stops exploring: it may lock onto a decent-but-suboptimal path because it never tried the alternatives. Raise it and the table fills more completely but the agent wanders. Unlike Chapter 01's instrument, this learns purely from sampled steps — it is never told where the goal is.

3.5

SARSA (on-policy) & exploration

Change one symbol in the Q-learning update and you get a different algorithm with a different personality. Instead of bootstrapping from the best next action, bootstrap from the action the agent actually takes next under its current policy. The update now uses the quintuple $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$ — state, action, reward, state, action — which is exactly where the name SARSA comes from.

EQ R3.5 — SARSA UPDATE $$ Q(s_t, a_t) \;\leftarrow\; Q(s_t, a_t) + \alpha\,\Big[\, R_{t+1} + \gamma\, \underbrace{Q(s_{t+1}, a_{t+1})}_{\text{action actually taken}} - Q(s_t, a_t) \,\Big] $$

The only change from EQ R3.4 is $\max_{a'} Q(s_{t+1}, a') \to Q(s_{t+1}, a_{t+1})$, where $a_{t+1}$ is drawn from the agent's own (e.g. ε-greedy) policy. SARSA therefore learns the value of the policy it is actually following — including the cost of its own exploration — which is exactly what "on-policy" means. Q-learning evaluates a greedy policy it does not follow; SARSA evaluates the noisy policy it does.

That difference is not academic, and the cleanest illustration is the cliff-walking example from Sutton & Barto. An agent must walk from start to goal along the edge of a cliff; stepping off costs a large penalty. Both algorithms use ε-greedy exploration. Q-learning learns the optimal path — right along the cliff edge — because its $\max$ bootstrap evaluates the greedy policy, which never falls. But because it still explores while following that knife-edge route, its random ε-steps occasionally pitch it off the cliff, so its online reward during training is worse. SARSA learns a safer path one row back from the edge, because its on-policy target accounts for the fact that it sometimes takes random actions — so it learns the value of behaving exploratorily and routes around the risk. Q-learning finds the better policy; SARSA earns more reward while learning. Which you want depends on whether failures during training are cheap or catastrophic — a real engineering decision, not a theoretical curiosity.

INTUITION

Q-learning is an optimist; SARSA is a realist. Q-learning assumes it will act greedily from the next state on, so it learns the value of the best-case continuation. SARSA assumes it will keep exploring like it currently does, so it learns the value of its actual, imperfect behavior. As $\varepsilon \to 0$ the two converge — with no exploration there is no difference between "best next action" and "action actually taken".

Both algorithms lean on the exploration we met in Chapter 01: ε-greedy with an annealed $\varepsilon$. The annealing is not optional polish — it is what reconciles two requirements that pull in opposite directions. Convergence needs every state–action pair visited infinitely often (so $\varepsilon$ must stay positive), but a good final policy needs the agent to eventually stop throwing away reward on random moves (so $\varepsilon$ must vanish). The resolution is GLIE — Greedy in the Limit with Infinite Exploration — schedules that keep $\varepsilon > 0$ forever but send it to 0, the textbook example being $\varepsilon_t = 1/t$. A practical schedule decays $\varepsilon$ from near 1 toward a small floor; the shape of that decay is one of the most-tuned knobs in applied RL.

EQ R3.6 — EXPONENTIAL ε-DECAY $$ \varepsilon_t \;=\; \varepsilon_{\min} + (\varepsilon_0 - \varepsilon_{\min})\, e^{-t/\tau} $$

Start at $\varepsilon_0$ (often 1.0), decay with time constant $\tau$ toward a floor $\varepsilon_{\min}$ (often 0.01–0.05). Early on the agent explores widely and fills in its Q-table; late on it exploits what it has learned. The floor matters: a small permanent $\varepsilon$ hedges against a non-stationary world where the best action can change. Too fast a decay and the agent commits before it has seen enough — premature exploitation, the most common silent failure in applied value-based RL.

An exponential ε-decay schedule (EQ R3.6) uses $\varepsilon_0 = 1.0$, $\varepsilon_{\min} = 0.05$, and time constant $\tau = 500$. As $t \to \infty$, what value does $\varepsilon_t$ approach?

As $t \to \infty$, $e^{-t/\tau} \to 0$, so the decaying term $(\varepsilon_0 - \varepsilon_{\min})\,e^{-t/\tau} \to 0$ and only the floor survives: $\varepsilon_t \to \varepsilon_{\min} = $ 0.05. The agent never stops exploring entirely — it keeps a 5% random-action rate forever, a hedge against a changing world.

PYTHON · RUNNABLE IN-BROWSER

# Exponential epsilon-decay (EQ R3.6) and what fraction of moves stay random
import numpy as np

eps0, eps_min, tau = 1.0, 0.05, 500.0
t = np.arange(0, 3000)
eps = eps_min + (eps0 - eps_min) * np.exp(-t / tau)   # EQ R3.6

for step in (0, 250, 500, 1000, 3000 - 1):
    print(f"t = {step:5d}   epsilon = {eps[step]:.3f}   "
          f"P(random move) = {eps[step]:.1%}")

print(f"\nfloor approached as t -> inf : {eps_min}")
print(f"epsilon at one time-constant : {eps[int(tau)]:.3f}  "
      f"(~ eps_min + 0.368*(eps0-eps_min))")
plot_xy(t.tolist(), eps.tolist())                     # the classic decay curve

edits are live — break it on purpose

INSTRUMENT R3.3 — ε-DECAY EXPLOREREXPLORE-THEN-EXPLOIT · EQ R3.6

START ε₀ 1.00

FLOOR ε_min 0.05

TIME CONSTANT τ 500

ε AT STEP 0

—

ε AT τ (1 e-fold)

—

STEP TO REACH ε = 0.1

—

The curve is $\varepsilon_t$ over training steps: a high-exploration mint region early, decaying to a thin blue floor. The shaded area under the curve is roughly the total "exploration budget" the agent spends. Shrink $\tau$ and the agent commits fast — efficient if the world is simple, premature if it is not. Raise the floor and it never fully exploits — wasteful in a fixed world, prudent in a changing one. There is no universally right curve; this is a budget you allocate against how hard the problem is and how costly mistakes are.

Value methods learn what every action is worth, then act greedily — but they choke when actions are continuous or the state space is too vast to tabulate. Chapter 04 takes the other road: policy gradients, which parameterize the policy directly and push its parameters up the gradient of expected return. We will meet REINFORCE, the variance problem that nearly sinks it, and the actor–critic methods that put a TD-learned value function (everything you just built) back to work as the critic.

3.R

References

Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8 — the off-policy control algorithm of EQ R3.4 and its convergence proof to $Q^*$.
Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning 3 — the original TD(0) and TD(λ) prediction methods (EQ R3.2, EQ R3.3) and the bootstrapping idea at the heart of this chapter.
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — the canonical treatment of MC, TD, Q-learning, SARSA, the cliff-walking comparison, and GLIE exploration.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — deep Q-networks; off-policy Q-learning (EQ R3.4) with a neural Q-function, experience replay, and a target network.
van Hasselt, H. (2010). Double Q-learning. NeurIPS 23 — diagnoses and corrects the maximization bias of the $\max$ operator in EQ R3.4.
Singh, S., Jaakkola, T., Littman, M. L. & Szepesvári, C. (2000). Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. Machine Learning 38 — convergence of SARSA (EQ R3.5) and the GLIE exploration conditions of §3.5.