Learning without a model
Chapter 02 was a luxury we will now give up. There, the agent knew the environment — the transition function \(P(s' \mid s, a)\) and the reward \(R(s, a)\) — and could compute the optimal policy by pure thought, sweeping the Bellman equations until they converged. That is planning, and it is exactly the engine behind the Gridworld instrument of Chapter 01. But knowing \(P\) and \(R\) is a strong assumption that almost never holds. A robot does not have a probability table for how its motors slip on a given floor. A game-playing agent is not handed the rules as equations. The model is missing, and the agent must work without it.
This is the regime of model-free reinforcement learning: the agent never builds or is given a model of the world. Instead it learns directly from samples — actual transitions \((s, a, r, s')\) it experiences by acting. Where dynamic programming computes an expectation over all possible next states by summing against \(P\), model-free methods replace that expectation with sampled experience. They do not ask "what is the average outcome of this action?"; they take the action, see one outcome, and nudge their estimate toward it. Average enough samples and you recover the expectation the model would have given you — the law of large numbers doing the work the model used to do.
Two questions organize everything that follows, and they map onto two classical problems. Prediction (also called policy evaluation): given a fixed policy \(\pi\), estimate its value function \(V^\pi\) — how good is this way of behaving? Control: find a good policy in the first place, typically by estimating action-values \(Q^\pi\) and improving the policy toward them. Sections 3.2 and 3.3 solve prediction two different ways; Sections 3.4 and 3.5 turn the better of them into control.
One design choice cuts across all of it and deserves naming now. A method is on-policy if it learns about the very policy it is using to act, and off-policy if it can learn about one policy (say, the greedy optimal one) while behaving according to another (say, an exploratory one). It sounds like a technicality. It is the single most consequential distinction in this chapter — it is exactly what separates Q-learning from SARSA — and we will see it decide how an agent behaves on the edge of a cliff.
A useful sanity check on the whole enterprise: a model-free agent can become superhuman at a game without ever being able to describe the game. It learns which moves are worth what, not why. That is a strength — no modeling effort — and a weakness — no transfer, no planning ahead, every new world relearned from scratch. The model-based methods of later chapters trade sample efficiency back for exactly that missing structure.
Monte-Carlo prediction
The most direct way to estimate a value from experience is to take the definition literally. The value \(V^\pi(s)\) is the expected return from \(s\) (Chapter 01, EQ R1.6). So play out complete episodes under \(\pi\), and for each state record the actual return that followed it. Average those returns and you have a sample estimate of the expectation. This is the Monte-Carlo (MC) method: estimate an expectation by averaging samples of it.
The error term \(G_t - V(s)\) is worth dwelling on, because the same shape recurs in every update rule in this chapter. It is a prediction error: the gap between what actually happened (\(G_t\)) and what we predicted would happen (\(V(s)\)). Learning is nothing but repeatedly shrinking that gap, with \(\alpha\) setting how aggressively. A large \(\alpha\) chases the latest sample; a small \(\alpha\) averages patiently over many. Set \(\alpha\) too high and the estimate jitters with noise; too low and it crawls.
MC's strengths are real. It is unbiased — \(G_t\) is an honest sample of the return, so the estimate converges to the true \(V^\pi\) with no systematic error. It makes no Markov assumption — it never reasons about next-state values, only whole returns, so it works even where the state representation is imperfect. And it is simple to state. But it has a hard limitation that motivates the rest of the chapter: you must wait until the episode ends to compute \(G_t\). In a long episode that is slow; in a continuing task that never terminates, it is impossible. MC also tends to have high variance, because \(G_t\) is the sum of a long chain of random rewards and random transitions — one unlucky tail can swing it wildly.
# Monte-Carlo prediction on the 5-state random walk (Sutton & Barto, ex. 6.2)
# True values are linear: V(s) = s / 6. Start every estimate at 0.5.
import numpy as np
rng = np.random.default_rng(1)
def episode(): # random walk; left of 1 -> 0 (r=0), right of 5 -> 6 (r=1)
s, traj = 3, []
while 1 <= s <= 5:
s2 = s + (1 if rng.random() < 0.5 else -1)
traj.append((s, 1.0 if s2 == 6 else 0.0))
s = s2
return traj
V = np.full(7, 0.5); V[0] = V[6] = 0.0 # terminals have value 0
alpha = 0.05
for _ in range(200):
traj = episode()
G = traj[-1][1] # gamma = 1, so every state's return = final reward
for (s, _r) in traj:
V[s] += alpha * (G - V[s]) # EQ R3.1: nudge toward the actual return
true = np.array([s / 6 for s in range(1, 6)])
print("MC estimate V[1..5] :", V[1:6].round(3))
print("true value V[1..5] :", true.round(3))
print("mean abs error :", round(float(np.abs(V[1:6] - true).mean()), 4))
Temporal-difference learning — TD(0)
Here is the idea this chapter is built around, and it is one of the most beautiful in machine learning. MC waits for the full return \(G_t\). But recall the recursive form of the return (Chapter 01, EQ R1.4): \(G_t = R_{t+1} + \gamma\, G_{t+1}\). We do not have \(G_{t+1}\) — the episode has not finished — but we do have an estimate of it: \(V(s_{t+1})\), our current guess for the value of the next state. So substitute the guess in for the unknown return. The update becomes:
The bracketed quantity is named: the temporal-difference error, \(\delta_t = R_{t+1} + \gamma\, V(s_{t+1}) - V(s_t)\). It is the difference between two successive predictions of the same return — one made before seeing \(R_{t+1}\), one after — hence "temporal difference". When \(\delta_t = 0\) everywhere, predictions are self-consistent and learning stops; this is the sampled, online cousin of the Bellman fixed point from Chapter 02. The TD error is not merely an algorithmic device: dopamine neurons in the brain encode a signal strikingly close to \(\delta_t\), which is part of why TD learning is one of the rare ideas that crossed from machine learning into neuroscience.
The trade-off between MC and TD is a genuine one, not a free lunch, and the honest framing is bias versus variance. The TD target \(R_{t+1} + \gamma V(s_{t+1})\) is biased: it leans on \(V(s_{t+1})\), which is only a guess and is wrong early in training, so TD is "learning from a guess". But it has much lower variance than MC, because it depends on only one random reward and one transition rather than an entire random trajectory. In practice TD's lower variance usually wins — it converges faster on most problems — but MC's lack of bias and its independence from the Markov assumption keep it relevant, and the two are endpoints of a spectrum (TD(\(\lambda\)), \(n\)-step returns) that interpolates between them. There is no universal winner; which is better is genuinely problem-dependent, a point Sutton & Barto are careful to make and which remains true today.
| Property | Monte-Carlo | TD(0) |
|---|---|---|
| Target | G_t (full return) | R + γ·V(s′) (one step + bootstrap) |
| Updates | at episode end only | every step, online |
| Continuing tasks | cannot (needs termination) | yes |
| Bias | unbiased | biased (bootstraps) |
| Variance | high | low |
| Needs Markov state | no | yes (relies on V(s′)) |
# TD(0) value estimation vs Monte-Carlo on the same random walk
import numpy as np
rng = np.random.default_rng(1)
def episode():
s, traj = 3, []
while 1 <= s <= 5:
s2 = s + (1 if rng.random() < 0.5 else -1)
traj.append((s, 1.0 if s2 == 6 else 0.0, s2))
s = s2
return traj
true = np.array([s / 6 for s in range(1, 6)])
Vtd = np.full(7, 0.5); Vtd[0] = Vtd[6] = 0.0 # TD(0)
Vmc = np.full(7, 0.5); Vmc[0] = Vmc[6] = 0.0 # Monte-Carlo
a = 0.05
for _ in range(150):
traj = episode()
for (s, r, s2) in traj: # TD: bootstrap every step (EQ R3.2)
Vtd[s] += a * (r + 1.0 * Vtd[s2] - Vtd[s])
G = traj[-1][1]
for (s, r, s2) in traj: # MC: wait for the return (EQ R3.1)
Vmc[s] += a * (G - Vmc[s])
print("state :", list(range(1, 6)))
print("TD(0) estimate:", Vtd[1:6].round(3).tolist())
print("MC estimate:", Vmc[1:6].round(3).tolist())
print("true value :", true.round(3).tolist())
plot_xy(list(range(1, 6)), Vtd[1:6].tolist()) # TD curve vs the linear truth
Q-learning (off-policy)
Prediction estimates the value of a given policy. Control finds a good policy. To do control without a model we need action-values \(Q(s, a)\), not state-values \(V(s)\) — because, as Chapter 01 argued, you cannot act greedily from \(V\) without knowing where actions lead, but greedy action selection from \(Q\) is trivial: take \(\arg\max_a Q(s, a)\), no model required. So we run TD on \(Q\) instead of \(V\). The most famous instance is Q-learning (Watkins, 1989), and it has a remarkable property hidden in one symbol.
The convergence guarantee is one of the cornerstone results of the field. Watkins & Dayan (1992) proved that for a finite MDP, tabular Q-learning converges to the optimal \(Q^*\) with probability 1, under two conditions: every state–action pair is visited infinitely often, and the learning rate satisfies the Robbins–Monro conditions \(\sum_t \alpha_t = \infty,\ \sum_t \alpha_t^2 < \infty\) (informally: \(\alpha\) must shrink, but not too fast). It is worth being honest about the gap between theory and practice here: those conditions are about the tabular case. The moment \(Q\) is approximated by a neural network — as in deep Q-learning — the guarantee evaporates, and the combination of bootstrapping, off-policy learning, and function approximation (the "deadly triad") can diverge. Much of deep RL engineering is heuristics to tame that triad; it is contested territory, not settled.
One more honest caveat that motivated a whole follow-up algorithm: the \(\max\) operator introduces maximization bias. Because \(\max\) over noisy estimates systematically picks the ones that happen to be overestimated, Q-learning tends to overestimate action-values. Double Q-learning (van Hasselt, 2010) fixes this by decoupling the action selected by the max from the value used to evaluate it — the idea behind Double DQN. Q-learning is foundational, not flawless.
# Tabular Q-learning on a 1x4 corridor: states 0..3, state 3 is the goal (+1)
import numpy as np
rng = np.random.default_rng(0)
n, gamma, alpha, eps = 4, 0.9, 0.5, 0.2
Q = np.zeros((n, 2)) # actions: 0 = left, 1 = right
def step(s, a):
s2 = min(s + 1, n - 1) if a == 1 else max(s - 1, 0)
return s2, (1.0 if s2 == 3 else 0.0), s2 == 3
for ep in range(2000): # learn by acting eps-greedily
s = 0
for _ in range(50):
a = rng.integers(2) if rng.random() < eps else int(np.argmax(Q[s]))
s2, r, done = step(s, a)
target = r + (0.0 if done else gamma * Q[s2].max()) # EQ R3.4: greedy bootstrap
Q[s, a] += alpha * (target - Q[s, a])
s = s2
if done: break
print("learned Q-values (rows = states 0..3, cols = [left, right]):")
print(Q.round(3))
print("greedy policy :", ["L R"[int(np.argmax(Q[s])) * 2] for s in range(n)])
print("optimal V*(0) :", round(gamma ** 3, 3), "(= gamma^3, 3 steps to the goal)")
SARSA (on-policy) & exploration
Change one symbol in the Q-learning update and you get a different algorithm with a different personality. Instead of bootstrapping from the best next action, bootstrap from the action the agent actually takes next under its current policy. The update now uses the quintuple \((s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})\) — state, action, reward, state, action — which is exactly where the name SARSA comes from.
That difference is not academic, and the cleanest illustration is the cliff-walking example from Sutton & Barto. An agent must walk from start to goal along the edge of a cliff; stepping off costs a large penalty. Both algorithms use ε-greedy exploration. Q-learning learns the optimal path — right along the cliff edge — because its \(\max\) bootstrap evaluates the greedy policy, which never falls. But because it still explores while following that knife-edge route, its random ε-steps occasionally pitch it off the cliff, so its online reward during training is worse. SARSA learns a safer path one row back from the edge, because its on-policy target accounts for the fact that it sometimes takes random actions — so it learns the value of behaving exploratorily and routes around the risk. Q-learning finds the better policy; SARSA earns more reward while learning. Which you want depends on whether failures during training are cheap or catastrophic — a real engineering decision, not a theoretical curiosity.
Q-learning is an optimist; SARSA is a realist. Q-learning assumes it will act greedily from the next state on, so it learns the value of the best-case continuation. SARSA assumes it will keep exploring like it currently does, so it learns the value of its actual, imperfect behavior. As \(\varepsilon \to 0\) the two converge — with no exploration there is no difference between "best next action" and "action actually taken".
Both algorithms lean on the exploration we met in Chapter 01: ε-greedy with an annealed \(\varepsilon\). The annealing is not optional polish — it is what reconciles two requirements that pull in opposite directions. Convergence needs every state–action pair visited infinitely often (so \(\varepsilon\) must stay positive), but a good final policy needs the agent to eventually stop throwing away reward on random moves (so \(\varepsilon\) must vanish). The resolution is GLIE — Greedy in the Limit with Infinite Exploration — schedules that keep \(\varepsilon > 0\) forever but send it to 0, the textbook example being \(\varepsilon_t = 1/t\). A practical schedule decays \(\varepsilon\) from near 1 toward a small floor; the shape of that decay is one of the most-tuned knobs in applied RL.
# Exponential epsilon-decay (EQ R3.6) and what fraction of moves stay random
import numpy as np
eps0, eps_min, tau = 1.0, 0.05, 500.0
t = np.arange(0, 3000)
eps = eps_min + (eps0 - eps_min) * np.exp(-t / tau) # EQ R3.6
for step in (0, 250, 500, 1000, 3000 - 1):
print(f"t = {step:5d} epsilon = {eps[step]:.3f} "
f"P(random move) = {eps[step]:.1%}")
print(f"\nfloor approached as t -> inf : {eps_min}")
print(f"epsilon at one time-constant : {eps[int(tau)]:.3f} "
f"(~ eps_min + 0.368*(eps0-eps_min))")
plot_xy(t.tolist(), eps.tolist()) # the classic decay curve
Value methods learn what every action is worth, then act greedily — but they choke when actions are continuous or the state space is too vast to tabulate. Chapter 04 takes the other road: policy gradients, which parameterize the policy directly and push its parameters up the gradient of expected return. We will meet REINFORCE, the variance problem that nearly sinks it, and the actor–critic methods that put a TD-learned value function (everything you just built) back to work as the critic.
References
- Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning.
- Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences.
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning.
- van Hasselt, H. (2010). Double Q-learning.
- Singh, S., Jaakkola, T., Littman, M. L. & Szepesvári, C. (2000). Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms.