The Bellman equations
A value function answers one question: starting here, and acting in some way forever after, how much reward do I expect to collect? The trick that makes that infinite sum tractable is recursion. The value of a state is the immediate reward plus the discounted value of wherever you land — the future is just another instance of the same problem, one step smaller. Writing that observation down for a fixed policy \(\pi\) gives the Bellman expectation equation.
For control we want the best policy, not a fixed one. Replace the average over the policy with a maximum over actions and you get the Bellman optimality equation — the equation this whole chapter exists to solve.
It is worth being precise about why a solution even exists. Define the Bellman optimality operator \(\mathcal{T}\) acting on any value table \(V\):
Policy evaluation
Before improving a policy you must score it. Given a fixed \(\pi\), policy evaluation computes \(V^\pi\) — the answer to EQ R2.1. You could solve the linear system directly (invert an \(|\mathcal{S}| \times |\mathcal{S}|\) matrix), but for anything beyond toy sizes the iterative method is cheaper and is the template for everything that follows. Turn the Bellman expectation equation into an update rule: take your current estimate, plug it into the right-hand side, and read off a new estimate.
The canonical example is Sutton & Barto's \(4 \times 4\) gridworld: an agent moves up/down/left/right, the two opposite corners are terminal, and every move costs \(-1\) until the agent escapes (\(\gamma = 1\)). Under the equiprobable random policy — each direction chosen with probability \(0.25\) — iterative policy evaluation converges to a value surface that grows more negative the farther a cell sits from a terminal. The corner-adjacent cells settle near \(-14\); the cells deepest in the interior reach \(-20\) to \(-22\). Those numbers are not arbitrary; they are the expected number of steps a random walker takes to stumble out, negated.
# Iterative policy evaluation (EQ R2.4): 4x4 gridworld, random policy
import numpy as np
TERM = {0, 15} # two terminal corners
def step(s, a): # 0=up 1=down 2=left 3=right; off-edge = stay
r, c = divmod(s, 4)
if a == 0: r = max(r - 1, 0)
elif a == 1: r = min(r + 1, 3)
elif a == 2: c = max(c - 1, 0)
else: c = min(c + 1, 3)
return r * 4 + c
V, gamma = np.zeros(16), 1.0
for sweep in range(1000):
delta = 0.0
for s in range(16):
if s in TERM: # value of a terminal state is 0
continue
v = sum(0.25 * (-1 + gamma * V[step(s, a)]) for a in range(4))
delta = max(delta, abs(v - V[s]))
V[s] = v
if delta < 1e-4:
break
print(f"converged in {sweep + 1} sweeps")
print(np.round(V, 1).reshape(4, 4)) # matches Sutton & Barto Fig 4.1
What policy evaluation cannot do. It scores a policy; it does not improve one. By itself it would just confirm that wandering randomly is expensive. The leverage comes from pairing evaluation with a greedy step — the subject of §2.4 — or from folding the two together, which is §2.3.
Value iteration
Why fully evaluate a policy you are about to throw away? Value iteration short-circuits the loop: do a single backup, but use the max over actions instead of the policy average. That is the Bellman optimality operator \(\mathcal{T}\) (EQ R2.3) applied as an update rule, and iterating it drives \(V\) straight to \(V^{*}\) without ever naming an intermediate policy.
The convergence guarantee rests on the contraction property promised in §2.1. For any two value tables \(U, V\), the max-norm distance after a backup shrinks:
# Value iteration (EQ R2.5): converged optimal value function
import numpy as np
TERM = {0, 15}
def step(s, a):
r, c = divmod(s, 4)
if a == 0: r = max(r - 1, 0)
elif a == 1: r = min(r + 1, 3)
elif a == 2: c = max(c - 1, 0)
else: c = min(c + 1, 3)
return r * 4 + c
V, gamma = np.zeros(16), 1.0
for sweep in range(1000):
delta = 0.0
for s in range(16):
if s in TERM:
continue
best = max(-1 + gamma * V[step(s, a)] for a in range(4))
delta = max(delta, abs(best - V[s]))
V[s] = best # in-place (Gauss-Seidel) backup
if delta < 1e-9:
break
print(f"value iteration converged in {sweep + 1} sweeps")
print("V* = -(steps to nearest terminal):")
print(V.reshape(4, 4))
Policy iteration
Policy iteration takes the opposite tack to value iteration. Rather than interleaving one tiny backup with one tiny improvement, it alternates two full-strength phases: evaluate the current policy completely (§2.2), then make it greedy with respect to the values you just computed. Repeat until the greedy step changes nothing.
The two algorithms are endpoints of a spectrum that the generalized view unifies. Policy iteration runs evaluation to convergence before improving; value iteration improves after a single evaluation backup; modified policy iteration sits in between, running \(m\) evaluation sweeps per improvement. All three are instances of generalized policy iteration (GPI): evaluation and improvement chasing each other until they agree — and where they agree, the policy is greedy with respect to its own value, which is precisely the Bellman optimality condition.
Two processes, one fixed point. Evaluation makes the value consistent with the policy; improvement makes the policy greedy with respect to the value. Each step undoes a little of the other's work — until the only place they both rest is the optimal pair \((V^{*}, \pi^{*})\). Almost every RL algorithm in the rest of this volume, model-free included, is a flavor of GPI with the exact backup of EQ R2.5 replaced by a sampled estimate.
# Policy iteration (EQ R2.4 + EQ R2.7): print the optimal policy
import numpy as np
TERM = {0, 15}
def step(s, a):
r, c = divmod(s, 4)
if a == 0: r = max(r - 1, 0)
elif a == 1: r = min(r + 1, 3)
elif a == 2: c = max(c - 1, 0)
else: c = min(c + 1, 3)
return r * 4 + c
gamma = 1.0
pi = np.zeros(16, dtype=int) # start: everyone goes "up"
def evaluate(pi): # iterative policy evaluation
V = np.zeros(16)
for _ in range(2000):
d = 0.0
for s in range(16):
if s in TERM: continue
v = -1 + gamma * V[step(s, pi[s])]
d = max(d, abs(v - V[s])); V[s] = v
if d < 1e-10: break
return V
for rounds in range(50):
V = evaluate(pi)
stable = True
for s in range(16):
if s in TERM: continue
old = pi[s]
pi[s] = int(np.argmax([-1 + gamma * V[step(s, a)] for a in range(4)]))
if pi[s] != old: stable = False
if stable: break
arrows = np.array(list("^v<>")) # up down left right
g = arrows[pi]; g[0] = "*"; g[15] = "*" # mark terminals
print(f"policy iteration converged in {rounds + 1} improvement rounds")
print(g.reshape(4, 4))
Why DP needs the model
Every backup in this chapter contains the same fingerprint: \(\sum_{s', r} p(s', r \mid s, a)[\cdot]\). That sum is an expectation over the environment's dynamics, and to compute it you must know \(p(s', r \mid s, a)\) — the full transition and reward model. This is the defining assumption of dynamic programming, and it is exactly what the next chapter abandons.
- You rarely have the model. A robot does not ship with a transition table; a game agent is not handed the opponent's policy; a recommender does not know how a user will react. The whole field of model-free RL exists because \(p\) is usually unavailable, and learning it well enough to plan against can be harder than learning to act directly.
- Even with the model, the sweep is expensive. Each sweep touches every state and, per state, every action and every possible successor: \(O(|\mathcal{S}|^2 |\mathcal{A}|)\) per sweep for a dense model. This is the curse of dimensionality — a robot arm with ten joints discretized into a hundred positions each has \(100^{10}\) states. Exact DP is a guarantee that does not scale; its value is conceptual and as a subroutine inside approximate methods.
- The escape route is sampling. Replace the exact expectation \(\sum_{s'} p(s'\mid s,a)[\cdot]\) with a sample drawn by actually taking action \(a\) and observing where you land. That single substitution — backup over a sampled transition instead of the known distribution — turns value iteration into Q-learning and policy evaluation into temporal-difference learning. Everything keeps the GPI skeleton of this chapter; only the backup target changes from a model expectation to a sampled estimate.
So DP is not a dead end — it is the ground truth the rest of RL approximates. When you see TD learning's update \(V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]\) in the next chapter, recognize it as EQ R2.4 with the expectation replaced by one sample and a learning rate \(\alpha\) to smooth the noise. The Bellman equation is the destination; sampling is how you get there without a map.
| Method | Backup | Improvement | Converges in | Needs model? |
|---|---|---|---|---|
| Policy evaluation | expected, π-average | none | → Vπ, geometric | yes |
| Value iteration | expected, max | implicit, every sweep | → V*, geometric (γ) | yes |
| Policy iteration | expected, max | full greedy step | → π*, finite rounds | yes |
| TD / Q-learning | sampled | greedy / ε-greedy | → approx, asymptotic | no |
Drop the model, keep the Bellman equation. Chapter 03 takes the exact expectations you just iterated and replaces them with samples from experience — Monte-Carlo returns and temporal-difference learning, the first algorithms that learn a value function from interaction alone, with no transition table in sight.
References
- Bellman, R. (1957). Dynamic Programming.
- Bellman, R. (1957). A Markovian Decision Process.
- Howard, R. A. (1960). Dynamic Programming and Markov Processes.
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
- Puterman, M. L. & Shin, M. C. (1978). Modified Policy Iteration Algorithms for Discounted Markov Decision Problems.
- Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control (4th ed.).