Dynamic Programming — Value & Policy Iteration

2.1

The Bellman equations

A value function answers one question: starting here, and acting in some way forever after, how much reward do I expect to collect? The trick that makes that infinite sum tractable is recursion. The value of a state is the immediate reward plus the discounted value of wherever you land — the future is just another instance of the same problem, one step smaller. Writing that observation down for a fixed policy $\pi$ gives the Bellman expectation equation.

EQ R2.1 — BELLMAN EXPECTATION EQUATION $$ V^\pi(s) \;=\; \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V^\pi(s') \,\Big] $$

Read it right-to-left as an average over what could happen: pick an action from the policy $\pi(a\mid s)$, let the environment roll the transition $p(s', r \mid s, a)$, collect reward $r$, and add the discounted value of the next state. Because $V^\pi$ appears on both sides, this is not a definition you evaluate once — it is a system of $|\mathcal{S}|$ linear equations whose unique solution is the value function. Everything in this chapter is a way of solving it.

For control we want the best policy, not a fixed one. Replace the average over the policy with a maximum over actions and you get the Bellman optimality equation — the equation this whole chapter exists to solve.

EQ R2.2 — BELLMAN OPTIMALITY EQUATION $$ V^{*}(s) \;=\; \max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V^{*}(s') \,\Big] \;=\; \max_a Q^{*}(s, a) $$

The optimal value of a state is achieved by the single best action, assuming you continue optimally thereafter. This is no longer linear — the $\max$ makes it a nonlinear fixed-point equation — but it still has a unique solution $V^{*}$, and once you have it the optimal policy falls out for free: act greedily, $\pi^{*}(s) = \arg\max_a Q^{*}(s, a)$. The deterministic-environment special case is the familiar $V^{*}(s) = \max_a\,[\,r + \gamma\, V^{*}(s')\,]$.

It is worth being precise about why a solution even exists. Define the Bellman optimality operator $\mathcal{T}$ acting on any value table $V$:

EQ R2.3 — THE BELLMAN OPTIMALITY OPERATOR $$ (\mathcal{T}V)(s) \;=\; \max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V(s') \,\Big] $$

$\mathcal{T}$ takes a guess at the value function and returns a better guess. $V^{*}$ is exactly its fixed point: $\mathcal{T}V^{*} = V^{*}$. The decisive fact — proven in §2.3 — is that for $\gamma < 1$ the operator is a $\gamma$-contraction in the max-norm: it shrinks the distance between any two value tables by at least a factor $\gamma$. The Banach fixed-point theorem then guarantees a unique fixed point that you reach by iterating from anywhere. That single property is why every algorithm below works.

True or false: in a deterministic environment the optimal value satisfies $ V^{*}(s) = \max_a\,[\,r + \gamma\, V^{*}(s')\,] $, where $s'$ and $r$ are the state and reward that action $a$ produces. (Answer true or false.)

This is exactly EQ R2.2 specialized to a deterministic transition, where the sum $\sum_{s',r} p(s',r\mid s,a)[\cdot]$ collapses to a single term because each action leads to one $(s', r)$ with probability 1. The answer is true.

INSTRUMENT R2.1 — BELLMAN-BACKUP EXPLORERONE STATE · TWO ACTIONS · EQ R2.2

A single state with two actions. Each action gives an immediate reward and lands in a successor whose value you already estimate. The backup computes $Q(a) = r_a + \gamma\,V(s'_a)$ for each action and takes the max — one cell of the table that value iteration sweeps over.

DISCOUNT γ 0.90

ACTION A · reward r 5

ACTION A · next-state value V(s′) 10

ACTION B · reward r 1

ACTION B · next-state value V(s′) 20

Q(A) = r + γ·V(s′)

—

Q(B) = r + γ·V(s′)

—

V(s) = max Q · GREEDY ACTION

—

Push γ to 0 and the backup becomes myopic — only the immediate reward matters, so action A (higher r) wins. Push γ toward 1 and the future dominates: action B's richer successor takes over. The crossover is the whole story of discounting. The greedy arrow lights up the winning action; this single cell, evaluated for every state, is one sweep of value iteration.

2.2

Policy evaluation

Before improving a policy you must score it. Given a fixed $\pi$, policy evaluation computes $V^\pi$ — the answer to EQ R2.1. You could solve the linear system directly (invert an $|\mathcal{S}| \times |\mathcal{S}|$ matrix), but for anything beyond toy sizes the iterative method is cheaper and is the template for everything that follows. Turn the Bellman expectation equation into an update rule: take your current estimate, plug it into the right-hand side, and read off a new estimate.

EQ R2.4 — ITERATIVE POLICY EVALUATION $$ V_{k+1}(s) \;\leftarrow\; \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V_k(s') \,\Big] $$

Start from any $V_0$ (zeros are fine), sweep this backup over every state, repeat. This is the Bellman expectation operator $\mathcal{T}^\pi$, and it too is a $\gamma$-contraction, so $V_k \to V^\pi$ geometrically. A sweep is one full pass over the state space; convergence is declared when the largest change $\max_s |V_{k+1}(s) - V_k(s)|$ drops below a tolerance $\theta$. In place updates — overwriting $V(s)$ as you go rather than buffering a fresh copy — converge faster and use half the memory; this is Gauss–Seidel style and is what production code does.

The canonical example is Sutton & Barto's $4 \times 4$ gridworld: an agent moves up/down/left/right, the two opposite corners are terminal, and every move costs $-1$ until the agent escapes ($\gamma = 1$). Under the equiprobable random policy — each direction chosen with probability $0.25$ — iterative policy evaluation converges to a value surface that grows more negative the farther a cell sits from a terminal. The corner-adjacent cells settle near $-14$; the cells deepest in the interior reach $-20$ to $-22$. Those numbers are not arbitrary; they are the expected number of steps a random walker takes to stumble out, negated.

PYTHON · RUNNABLE IN-BROWSER

# Iterative policy evaluation (EQ R2.4): 4x4 gridworld, random policy
import numpy as np
TERM = {0, 15}                      # two terminal corners
def step(s, a):                     # 0=up 1=down 2=left 3=right; off-edge = stay
    r, c = divmod(s, 4)
    if   a == 0: r = max(r - 1, 0)
    elif a == 1: r = min(r + 1, 3)
    elif a == 2: c = max(c - 1, 0)
    else:        c = min(c + 1, 3)
    return r * 4 + c

V, gamma = np.zeros(16), 1.0
for sweep in range(1000):
    delta = 0.0
    for s in range(16):
        if s in TERM:               # value of a terminal state is 0
            continue
        v = sum(0.25 * (-1 + gamma * V[step(s, a)]) for a in range(4))
        delta = max(delta, abs(v - V[s]))
        V[s] = v
    if delta < 1e-4:
        break

print(f"converged in {sweep + 1} sweeps")
print(np.round(V, 1).reshape(4, 4))  # matches Sutton & Barto Fig 4.1

edits are live — break it on purpose

What policy evaluation cannot do. It scores a policy; it does not improve one. By itself it would just confirm that wandering randomly is expensive. The leverage comes from pairing evaluation with a greedy step — the subject of §2.4 — or from folding the two together, which is §2.3.

2.3

Value iteration

Why fully evaluate a policy you are about to throw away? Value iteration short-circuits the loop: do a single backup, but use the max over actions instead of the policy average. That is the Bellman optimality operator $\mathcal{T}$ (EQ R2.3) applied as an update rule, and iterating it drives $V$ straight to $V^{*}$ without ever naming an intermediate policy.

EQ R2.5 — VALUE ITERATION $$ V_{k+1}(s) \;\leftarrow\; \max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V_k(s') \,\Big] $$

One greedy backup per state per sweep; no inner evaluation loop. You can read it as policy evaluation truncated to a single step before re-improving. When the max change across a sweep falls below $\theta$, stop and read off the greedy policy $\pi(s) = \arg\max_a Q(s, a)$ once. On the gridworld with $\gamma = 1$ this converges in just four sweeps to $V^{*}(s) = -(\text{shortest-path distance to the nearest terminal})$ — information propagates outward from the goals one ring per sweep, exactly like a flood fill.

The convergence guarantee rests on the contraction property promised in §2.1. For any two value tables $U, V$, the max-norm distance after a backup shrinks:

EQ R2.6 — $\mathcal{T}$ IS A γ-CONTRACTION $$ \lVert \mathcal{T}U - \mathcal{T}V \rVert_\infty \;\le\; \gamma\, \lVert U - V \rVert_\infty $$

Because the operator differs between $U$ and $V$ only through the discounted future term $\gamma V(s')$, and $|\max_a f(a) - \max_a g(a)| \le \max_a |f(a) - g(a)|$, every backup multiplies the worst-case error by at most $\gamma$. So after $k$ sweeps the error is bounded by $\gamma^{k}$ times the initial error — geometric convergence, with the rate set entirely by $\gamma$. At $\gamma = 0.9$ you lose roughly one digit of error every ~22 sweeps; the closer $\gamma$ creeps to 1, the slower the crawl, which is why long-horizon problems are genuinely hard. (At $\gamma = 1$, as in the episodic gridworld, contraction is not strict — convergence instead relies on every state reaching a terminal, i.e. a proper policy.)

True or false: value iteration converges because the Bellman optimality operator $\mathcal{T}$ is a contraction (for $\gamma < 1$), so the Banach fixed-point theorem guarantees a unique fixed point reached from any starting $V_0$. (Answer true or false.)

EQ R2.6 establishes $\lVert \mathcal{T}U - \mathcal{T}V \rVert_\infty \le \gamma \lVert U - V \rVert_\infty$ with $\gamma < 1$, which is exactly the definition of a contraction mapping. The Banach fixed-point theorem then guarantees a unique fixed point $V^{*}$ and convergence to it from any initialization. The answer is true.

PYTHON · RUNNABLE IN-BROWSER

# Value iteration (EQ R2.5): converged optimal value function
import numpy as np
TERM = {0, 15}
def step(s, a):
    r, c = divmod(s, 4)
    if   a == 0: r = max(r - 1, 0)
    elif a == 1: r = min(r + 1, 3)
    elif a == 2: c = max(c - 1, 0)
    else:        c = min(c + 1, 3)
    return r * 4 + c

V, gamma = np.zeros(16), 1.0
for sweep in range(1000):
    delta = 0.0
    for s in range(16):
        if s in TERM:
            continue
        best = max(-1 + gamma * V[step(s, a)] for a in range(4))
        delta = max(delta, abs(best - V[s]))
        V[s] = best                 # in-place (Gauss-Seidel) backup
    if delta < 1e-9:
        break

print(f"value iteration converged in {sweep + 1} sweeps")
print("V*  =  -(steps to nearest terminal):")
print(V.reshape(4, 4))

edits are live — break it on purpose

INSTRUMENT R2.2 — VALUE-ITERATION STEPPER4×4 GRIDWORLD · γ=1 · EQ R2.5

Two terminal corners (mint). Every move costs $-1$. Step the backup one sweep at a time and watch value flood outward from the goals — exactly one ring per sweep — then settle into $V^{*}(s) = -(\text{distance to nearest terminal})$.

CONTROL

SWEEP k

MAX CHANGE Δ THIS SWEEP

—

STATUS

INITIAL

At sweep 0 every non-terminal cell reads 0. After sweep 1, cells one step from a goal know they are worth $-1$; after sweep 2, the next ring learns $-2$; the deepest cells lock in by sweep 4 and Δ hits 0 — converged. The arrows show the greedy policy implied by the current values: even mid-iteration they already point roughly toward the exits.

2.4

Policy iteration

Policy iteration takes the opposite tack to value iteration. Rather than interleaving one tiny backup with one tiny improvement, it alternates two full-strength phases: evaluate the current policy completely (§2.2), then make it greedy with respect to the values you just computed. Repeat until the greedy step changes nothing.

EQ R2.7 — POLICY IMPROVEMENT (GREEDY STEP) $$ \pi'(s) \;=\; \arg\max_a \sum_{s', r} p(s', r \mid s, a)\,\Big[\, r + \gamma\, V^\pi(s') \,\Big] $$

Given $V^\pi$, act greedily one step then follow $\pi$ — this can only help. The policy improvement theorem guarantees $V^{\pi'}(s) \ge V^\pi(s)$ for every state, with strict improvement somewhere unless $\pi$ is already optimal. So the sequence of policies is monotonically non-decreasing in value, and since a finite MDP has only finitely many deterministic policies, the loop must terminate at $\pi^{*}$ in a finite number of steps — no tolerance, no approximation. On the $4\times 4$ gridworld it converges in four improvement rounds.

The two algorithms are endpoints of a spectrum that the generalized view unifies. Policy iteration runs evaluation to convergence before improving; value iteration improves after a single evaluation backup; modified policy iteration sits in between, running $m$ evaluation sweeps per improvement. All three are instances of generalized policy iteration (GPI): evaluation and improvement chasing each other until they agree — and where they agree, the policy is greedy with respect to its own value, which is precisely the Bellman optimality condition.

GPI

Two processes, one fixed point. Evaluation makes the value consistent with the policy; improvement makes the policy greedy with respect to the value. Each step undoes a little of the other's work — until the only place they both rest is the optimal pair $(V^{*}, \pi^{*})$. Almost every RL algorithm in the rest of this volume, model-free included, is a flavor of GPI with the exact backup of EQ R2.5 replaced by a sampled estimate.

PYTHON · RUNNABLE IN-BROWSER

# Policy iteration (EQ R2.4 + EQ R2.7): print the optimal policy
import numpy as np
TERM = {0, 15}
def step(s, a):
    r, c = divmod(s, 4)
    if   a == 0: r = max(r - 1, 0)
    elif a == 1: r = min(r + 1, 3)
    elif a == 2: c = max(c - 1, 0)
    else:        c = min(c + 1, 3)
    return r * 4 + c

gamma = 1.0
pi = np.zeros(16, dtype=int)                 # start: everyone goes "up"
def evaluate(pi):                            # iterative policy evaluation
    V = np.zeros(16)
    for _ in range(2000):
        d = 0.0
        for s in range(16):
            if s in TERM: continue
            v = -1 + gamma * V[step(s, pi[s])]
            d = max(d, abs(v - V[s])); V[s] = v
        if d < 1e-10: break
    return V

for rounds in range(50):
    V = evaluate(pi)
    stable = True
    for s in range(16):
        if s in TERM: continue
        old = pi[s]
        pi[s] = int(np.argmax([-1 + gamma * V[step(s, a)] for a in range(4)]))
        if pi[s] != old: stable = False
    if stable: break

arrows = np.array(list("^v<>"))           # up down left right
g = arrows[pi]; g[0] = "*"; g[15] = "*"      # mark terminals
print(f"policy iteration converged in {rounds + 1} improvement rounds")
print(g.reshape(4, 4))

edits are live — break it on purpose

INSTRUMENT R2.3 — POLICY-ITERATION VISUALIZEREVALUATE ⇄ IMPROVE · EQ R2.7

The same gridworld. Each round runs full policy evaluation, then a greedy improvement. Watch the arrows snap toward the exits and the values lock in. The two phases alternate; convergence is when an improvement step changes no arrow.

CONTROL

ROUND

PHASE

INIT

ARROWS CHANGED LAST IMPROVE

—

Start: every cell points up — a deliberately bad policy whose values are deeply negative. Hit EVALUATE to score it, then IMPROVE to make it greedy; alternate. The policy is optimal the first time an improvement changes zero arrows. Notice the values are already exact for the current policy after each EVALUATE — that is what separates this from value iteration's single-backup steps.

2.5

Why DP needs the model

Every backup in this chapter contains the same fingerprint: $\sum_{s', r} p(s', r \mid s, a)[\cdot]$. That sum is an expectation over the environment's dynamics, and to compute it you must know $p(s', r \mid s, a)$ — the full transition and reward model. This is the defining assumption of dynamic programming, and it is exactly what the next chapter abandons.

You rarely have the model. A robot does not ship with a transition table; a game agent is not handed the opponent's policy; a recommender does not know how a user will react. The whole field of model-free RL exists because $p$ is usually unavailable, and learning it well enough to plan against can be harder than learning to act directly.
Even with the model, the sweep is expensive. Each sweep touches every state and, per state, every action and every possible successor: $O(|\mathcal{S}|^2 |\mathcal{A}|)$ per sweep for a dense model. This is the curse of dimensionality — a robot arm with ten joints discretized into a hundred positions each has $100^{10}$ states. Exact DP is a guarantee that does not scale; its value is conceptual and as a subroutine inside approximate methods.
The escape route is sampling. Replace the exact expectation $\sum_{s'} p(s'\mid s,a)[\cdot]$ with a sample drawn by actually taking action $a$ and observing where you land. That single substitution — backup over a sampled transition instead of the known distribution — turns value iteration into Q-learning and policy evaluation into temporal-difference learning. Everything keeps the GPI skeleton of this chapter; only the backup target changes from a model expectation to a sampled estimate.

So DP is not a dead end — it is the ground truth the rest of RL approximates. When you see TD learning's update $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$ in the next chapter, recognize it as EQ R2.4 with the expectation replaced by one sample and a learning rate $\alpha$ to smooth the noise. The Bellman equation is the destination; sampling is how you get there without a map.

Method	Backup	Improvement	Converges in	Needs model?
Policy evaluation	expected, π-average	none	→ V^π, geometric	yes
Value iteration	expected, max	implicit, every sweep	→ V*, geometric (γ)	yes
Policy iteration	expected, max	full greedy step	→ π*, finite rounds	yes
TD / Q-learning	sampled	greedy / ε-greedy	→ approx, asymptotic	no

Drop the model, keep the Bellman equation. Chapter 03 takes the exact expectations you just iterated and replaces them with samples from experience — Monte-Carlo returns and temporal-difference learning, the first algorithms that learn a value function from interaction alone, with no transition table in sight.

2.R

References

Bellman, R. (1957). Dynamic Programming. Princeton University Press — the founding text; the principle of optimality and the recursive value relation behind EQ R2.1–R2.3.
Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics 6(5) — the MDP formalization and the optimality equation in its original form.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press — introduced policy iteration (EQ R2.7) and the policy improvement theorem.
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press, Ch. 4 — iterative policy evaluation, value iteration, policy iteration, and the 4×4 gridworld (Fig. 4.1) used throughout this chapter.
Puterman, M. L. & Shin, M. C. (1978). Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science 24(11) — the m-sweep interpolation between value and policy iteration cited in §2.4.
Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control (4th ed.). Athena Scientific — the contraction-mapping convergence analysis (EQ R2.6) in full rigor.