The Reinforcement Learning Problem

1.1

Agents, environments & rewards

Reinforcement learning is the study of learning by interaction. There is an agent — the thing that learns and decides — and an environment — everything else, the world the agent acts in and cannot directly control. They meet in a loop that runs forever, or until the episode ends. At each step the agent observes a state, chooses an action, and the environment responds with a reward and a new state. That single sentence is the whole game; everything in this volume is built on it.

The interaction loop. The agent emits an action; the environment returns a reward and the next state. Nothing else passes between them — no labels, no gradient, no ground-truth "correct action".

The contrast with supervised learning is sharper than it first appears, and it is the reason RL is its own field rather than a corner of classification. Three differences matter:

The signal is evaluative, not instructive. A label says "the answer was 7." A reward says only "that was worth +1" — it never reveals what the best action would have been. The agent must infer the better action by trying alternatives, which it can only do by acting differently.
Feedback is delayed. The move that loses a chess game may have been made twenty turns earlier. Reward arrives long after the action that earned it, so the agent must solve a credit-assignment problem: which of my past decisions deserves the blame or the praise?
The data is not i.i.d. — it is generated by the agent. A supervised dataset sits still while you fit it. In RL the distribution of states the agent encounters is produced by its own policy; change the policy and you change the data. An agent that never drives into the city never learns city driving. This feedback between policy and data is the defining difficulty, and it is exactly the bold idea this chapter is built around.

The entire framework rests on one deceptively strong assumption, the reward hypothesis: that any goal we care about can be expressed as the maximization of expected cumulative scalar reward. Win the game; minimize fuel; keep the robot upright; finish the task a human would approve of. It is a remarkably general claim, and most of the field's hardest practical failures — reward hacking, specification gaming, agents that maximize the literal number while violating its intent — are failures of writing down the reward, not of optimizing it. We will return to that honesty repeatedly.

A word on what "reward" is not. It is not a label and not a loss. It is part of the environment's definition, chosen by the problem designer, and the agent treats it as given and immutable. The agent's job is never to question the reward — only to collect as much of it, over time, as it can.

1.2

The Markov Decision Process

To do anything rigorous we need to formalize that loop, and the standard formalization is the Markov Decision Process (MDP). An MDP is a Markov chain (see Stats · Markov Chains) with two additions: the transitions now depend on an action the agent chooses, and every transition emits a reward. It is the bridge between "an agent acting in a world" and "a problem we can solve with mathematics".

EQ R1.1 — THE MDP TUPLE $$ \mathcal{M} = \langle\, \mathcal{S},\ \mathcal{A},\ P,\ R,\ \gamma \,\rangle $$

$\mathcal{S}$ is the set of states; $\mathcal{A}$ the set of actions; $P(s' \mid s, a)$ the transition probability of landing in $s'$ after taking $a$ in $s$; $R(s, a)$ the expected immediate reward; and $\gamma \in [0, 1]$ the discount factor (§1.3). Five objects fully specify the world. Everything an RL algorithm computes — values, policies, plans — is a function of this tuple alone.

The word Markov carries the load. The state is assumed to be a sufficient statistic of the history: given the current state, the future is conditionally independent of the past. Where you go next depends only on where you are and what you do — not on how you got here.

EQ R1.2 — THE MARKOV PROPERTY $$ P\big(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0\big) \;=\; P\big(s_{t+1} \mid s_t, a_t\big) $$

The full history collapses into the current state. This is not a law of nature — it is a property of how you choose to define the state. If a single video frame is not Markov (you cannot tell which way the ball is moving), stack four frames and it becomes Markov. Most of the art of applying RL is engineering a state representation that makes this assumption true enough to be useful. When the agent cannot observe the full state — only a partial observation — the problem becomes a POMDP, strictly harder, and is the subject of later chapters.

The transition function $P$ and reward function $R$ together are called the model of the environment. A crucial fork in the field follows from whether the agent knows them. If $P$ and $R$ are known, the agent can plan — compute the best policy by pure thought, no interaction required, which is the dynamic programming of Chapter 02. If they are unknown, the agent must learn from samples, which is the model-free reinforcement learning of the chapters after. The MDP is the common language for both.

Symbol	Name	What it is	Gridworld example
𝒮	states	every distinguishable situation	each cell of the grid
𝒜	actions	the agent's choices in a state	up, down, left, right
P(s′\|s,a)	transitions	where actions lead (maybe stochastic)	"move up" → cell above (90%), slip sideways (10%)
R(s,a)	reward	scalar feedback per step	+1 at the goal, −1 in a trap, −0.04 per step
γ	discount	how much the future counts	0.9 — distant rewards matter, but less

PYTHON · RUNNABLE IN-BROWSER

# Define a tiny 3-state MDP and compute a trajectory's discounted return
import numpy as np

# states 0,1,2 ; action "go" moves you forward; reward is given on arrival
# rewards collected along one episode: s0 -> s1 -> s2 (terminal)
rewards = np.array([0.0, 1.0, 1.0, 1.0])   # r_1, r_2, r_3, r_4 received over time
gamma   = 0.9

# discounted return G = sum_t gamma^t * r_{t+1}   (EQ R1.3)
discounts = gamma ** np.arange(len(rewards))
G = float(np.sum(discounts * rewards))

print("reward sequence :", rewards.tolist())
print("discount weights:", discounts.round(4).tolist())
print(f"discount factor : gamma = {gamma}")
print(f"discounted return G = {G:.4f}")

# the famous case: rewards [1,1,1] at t=0,1,2 with gamma=0.9
g3 = sum(gamma**t * 1.0 for t in range(3))
print(f"\nreturn of [1,1,1] at gamma=0.9 : {g3:.2f}   (= 1 + 0.9 + 0.81)")

edits are live — break it on purpose

INSTRUMENT R1.1 — MDP ANATOMYSTATES · ACTIONS · TRANSITIONS · REWARDS

SELECTED STATE

SLIP PROBABILITY 0.10

INTENDED ACTION

→

P(INTENDED)

—

REWARD ON ARRIVAL

—

A four-state chain laid bare. Pick a state and watch its action edges fan out: the mint arrow is where the agent intends to go, the faint grey arrows are where it might slip. Raise the slip probability and the world grows stochastic — the same action no longer guarantees the same outcome, which is exactly what makes the agent need a policy over states rather than a fixed plan of moves. The GOAL state is terminal; its only edge loops back to itself with the terminal reward.

1.3

Returns & discounting

The agent does not maximize the immediate reward — that would be greedy and shortsighted. It maximizes the return: the total reward accumulated from now to the end of time. For an episode that terminates, the return is simply the sum of rewards. But many problems never terminate, and an infinite sum of rewards is itself infinite and meaningless to compare. The fix is to discount: weight a reward received $k$ steps in the future by $\gamma^k$.

EQ R1.3 — THE DISCOUNTED RETURN $$ G_t \;=\; R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \;=\; \sum_{k=0}^{\infty} \gamma^k\, R_{t+k+1} $$

$G_t$ is the return from time $t$ onward. The discount $\gamma \in [0, 1]$ sets the agent's horizon: a reward $k$ steps away is worth $\gamma^k$ of an immediate one. This single number encodes how far-sighted the agent is. If every reward is bounded by $R_{\max}$, the geometric series guarantees the return is finite — $|G_t| \le R_{\max}/(1-\gamma)$ — which is the whole reason discounting exists for continuing tasks.

Discounting earns its place for three independent reasons, and it helps to keep them distinct. Mathematically, $\gamma < 1$ makes the infinite sum converge, so returns are comparable numbers rather than diverging infinities. Economically, it mirrors the time value of reward — a unit now is worth more than a unit later, exactly as in interest and inflation. Behaviorally, it expresses uncertainty about the future: at each step there is effectively a $(1-\gamma)$ chance the world ends, so $\gamma$ is the per-step survival probability. The return is the expected total reward under that geometric lifetime.

EQ R1.4 — THE RECURSIVE FORM (WHY EVERYTHING IS BELLMAN) $$ G_t \;=\; R_{t+1} + \gamma\, G_{t+1} $$

Factor one $\gamma$ out of EQ R1.3 and the return splits cleanly: this step's reward, plus the discounted return of everything after. This one-line recursion is the seed of every value-function method in the rest of the volume — the Bellman equations of Chapter 02 are nothing but EQ R1.4 with an expectation wrapped around it. Recognizing returns as self-referential is the conceptual leap from "sum up rewards" to "reinforcement learning".

The endpoints of $\gamma$ sharpen the intuition. At $\gamma = 0$ the return collapses to $R_{t+1}$: the agent cares only about the immediate reward and is purely myopic, blind to consequences. As $\gamma \to 1$ every future reward counts almost fully and the agent becomes far-sighted, willing to suffer many small penalties now for a large payoff later — the patience a maze or a game of chess demands. Most problems live in between, and the choice of $\gamma$ genuinely changes the optimal policy, not just the numbers: a maze-solver at $\gamma = 0.99$ will take a longer, safer route that a $\gamma = 0.8$ agent rejects as too distant to be worth it.

An agent receives a reward of $1$ at each of three consecutive steps and then the episode ends. With discount $\gamma = 0.9$, what is the discounted return $G$ from the first step? (Use EQ R1.3.)

$G = \gamma^0\cdot 1 + \gamma^1\cdot 1 + \gamma^2\cdot 1 = 1 + 0.9 + 0.81 = $ 2.71. Compare the undiscounted sum, which would be exactly 3 — discounting shaves off the value of the two later rewards.

True or false: setting $\gamma = 0$ makes the agent purely myopic — it optimizes only the immediate reward $R_{t+1}$ and ignores all future consequences. (Answer true or false.)

In EQ R1.3 every term except the first carries a factor of $\gamma^k$ with $k \ge 1$; at $\gamma = 0$ all of those vanish ($0^k = 0$), leaving $G_t = R_{t+1}$ alone. The agent values only the next reward and is blind to everything after it. The statement is true.

PYTHON · RUNNABLE IN-BROWSER

# How the discount factor reshapes the same reward stream (EQ R1.3)
import numpy as np

rewards = np.ones(40)                      # a steady +1 every step, 40 steps
horizons = []
for gamma in (0.0, 0.5, 0.9, 0.99):
    w = gamma ** np.arange(len(rewards))   # discount weights
    G = float((w * rewards).sum())
    # "effective horizon" 1/(1-gamma): steps that meaningfully count
    eff = np.inf if gamma >= 1 else 1.0 / (1.0 - gamma)
    horizons.append((gamma, G, eff))
    print(f"gamma={gamma:4} -> return {G:7.3f}   effective horizon ~ {eff:6.1f} steps")

print("\ngamma=0 is myopic: return is exactly the first reward (1.0).")
print("near gamma=1 the agent sums almost all 40 rewards and plans far ahead.")
# closed form for an infinite stream of +1: 1/(1-gamma)
print("infinite-stream limits 1/(1-g):",
      [round(1/(1-g),2) for g in (0.5,0.9,0.99)])
plot_xy([h[0] for h in horizons], [h[1] for h in horizons])

edits are live — break it on purpose

INSTRUMENT R1.2 — DISCOUNT-FACTOR EXPLORERγ AND THE AGENT'S HORIZON · EQ R1.3

DISCOUNT γ 0.90

EFFECTIVE HORIZON 1/(1−γ)

—

RETURN OF STEADY +1

—

WEIGHT ON STEP 10

—

Each bar is the weight $\gamma^k$ the agent places on the reward $k$ steps ahead. At $\gamma = 0$ only the first bar survives — pure myopia. Slide toward 1 and the bars flatten into a long, slowly-decaying tail: the agent's gaze stretches further into the future. The effective horizon $1/(1-\gamma)$ is a useful rule of thumb for how many steps actually matter — γ = 0.9 sees about 10 steps, γ = 0.99 about 100. Notice how violently the horizon stretches as γ approaches 1; that sensitivity is why γ is one of the most consequential knobs in all of RL.

1.4

Policies & value functions

A policy is the agent's behavior: a rule that maps states to actions. It is the object RL ultimately searches for — the solution to the MDP. A policy can be deterministic, $a = \pi(s)$, always taking the same action in a state; or stochastic, $\pi(a \mid s)$, a probability distribution over actions. Stochastic policies matter more than they look: they are how an agent explores, and in some problems (and in every adversarial game) the optimal policy is irreducibly random.

EQ R1.5 — THE POLICY $$ \pi(a \mid s) \;=\; \Pr\big(A_t = a \mid S_t = s\big), \qquad \sum_{a \in \mathcal{A}} \pi(a \mid s) = 1 $$

A policy is a conditional distribution over actions given the state. The entire goal of reinforcement learning is to find a policy that maximizes expected return. Crucially the policy depends on the state only — not on the time step or the history — which is exactly what the Markov property (EQ R1.2) buys us: in a Markov world, an optimal policy that depends only on the current state always exists.

To improve a policy we need to know how good it is, and "good" means expected return. This gives the two central quantities of the field. The state-value function $V^\pi(s)$ is the expected return from starting in state $s$ and following $\pi$ thereafter. The action-value function $Q^\pi(s, a)$ is the expected return from taking action $a$ in state $s$ and then following $\pi$.

EQ R1.6 — STATE-VALUE AND ACTION-VALUE $$ V^\pi(s) \;=\; \mathbb{E}_\pi\!\big[\, G_t \mid S_t = s \,\big], \qquad Q^\pi(s, a) \;=\; \mathbb{E}_\pi\!\big[\, G_t \mid S_t = s,\, A_t = a \,\big] $$

$V^\pi(s)$ answers "how much reward can I expect from here, behaving like this?" and $Q^\pi(s,a)$ answers "and how much if I commit to action $a$ first?". They are linked by $V^\pi(s) = \sum_a \pi(a\mid s)\,Q^\pi(s,a)$. The difference $Q^\pi(s,a) - V^\pi(s)$ — the advantage — is the single most useful quantity in modern policy-gradient RL, because it says whether an action beats the policy's own average without you needing to know the absolute scale.

Why two functions? Because of what each one lets you do without a model. If you know $V^\pi$ but not the transitions $P$, you cannot act greedily — you would need to know where each action leads to compare states. But if you know $Q^\pi$, choosing the best action is trivial: take $\arg\max_a Q(s,a)$, no model required. This is precisely why model-free control methods (Q-learning, SARSA) learn $Q$ rather than $V$. The value functions are the connective tissue of the whole field: estimate them, and a good policy falls out.

A subtlety experts will insist on: values are always relative to a policy. There is no such thing as "the value of a state" in the abstract — only its value under some way of behaving. The exception is the optimal value function $V^{*}(s) = \max_\pi V^\pi(s)$, the best achievable from each state, which Chapter 02 shows how to compute. Solving an MDP means finding $V^{*}$ (or $Q^{*}$) and reading the optimal policy off it.

In state $s$ the policy is $\pi(\text{up}\mid s) = 0.7$ and $\pi(\text{down}\mid s) = 0.3$. The action-values are $Q(s,\text{up}) = 10$ and $Q(s,\text{down}) = 7$. What is the state-value $V^\pi(s) = \sum_a \pi(a\mid s)\,Q^\pi(s,a)$?

$V^\pi(s) = 0.7 \times 10 + 0.3 \times 7 = 7 + 2.1 = $ 9.1. The state-value is just the policy-weighted average of the action-values — which is why a policy that puts more weight on the better action raises $V$.

PYTHON · RUNNABLE IN-BROWSER

# V from Q, the advantage, and which action a greedy agent would pick (EQ R1.6)
import numpy as np

actions = ["up", "down", "left", "right"]
Q  = np.array([10.0, 7.0, 4.0, 9.0])       # Q(s, a) for one state s
pi = np.array([0.70, 0.10, 0.05, 0.15])    # current stochastic policy pi(a|s)
assert np.isclose(pi.sum(), 1.0)

V = float(pi @ Q)                          # V(s) = sum_a pi(a|s) Q(s,a)
advantage = Q - V                          # A(s,a) = Q(s,a) - V(s)

print(f"state value V(s)        : {V:.3f}")
for a, q, adv, p in zip(actions, Q, advantage, pi):
    flag = "  <- above average" if adv > 0 else ""
    print(f"  {a:5s}  Q={q:5.1f}  A={adv:+5.2f}  pi={p:.2f}{flag}")

greedy = actions[int(np.argmax(Q))]
print(f"\ngreedy action (argmax Q): {greedy}")
print("a policy gradient step pushes pi toward actions with positive advantage.")

edits are live — break it on purpose

1.5

Exploration vs exploitation

Here is the dilemma that has no counterpart in supervised learning. To collect reward, the agent should exploit — take the action it currently believes is best. But its beliefs are estimates built from limited experience, and the only way to improve them is to explore — try actions it is unsure about, which may be worse. Every step forces a choice between cashing in what you know and gathering information that might let you do better later. Lean too far toward exploitation and you lock onto a mediocre habit, never discovering the better path you never tried. Lean too far toward exploration and you squander reward forever testing options you already know are bad.

This tension is fundamental, not an artifact of any algorithm, and it is sharpened by the feedback we opened the chapter with: because the agent's actions decide what data it sees, an action never taken is an action never learned from. A supervised learner sees every labeled example in the dataset whether it likes it or not. An RL agent sees only the consequences of what it chose to do — so under-exploration is self-reinforcing, a blind spot the agent cannot detect from inside.

The simplest workable strategy is ε-greedy: with probability $1-\varepsilon$ take the current best (greedy) action, and with probability $\varepsilon$ take a uniformly random action instead. It is crude but remarkably effective, and it is the exploration rule baked into the first generation of deep-RL agents.

EQ R1.7 — ε-GREEDY ACTION SELECTION $$ \pi(a \mid s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & \text{if } a = \arg\max_{a'} Q(s, a') \\[1.2em] \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases} $$

With probability $1-\varepsilon$ the agent exploits the greedy action; with probability $\varepsilon$ it picks uniformly at random (so even the greedy action keeps a small $\varepsilon/|\mathcal{A}|$ slice of the random mass). $\varepsilon$ is the exploration rate, and it is almost always annealed — started high so the agent samples widely, then decayed toward zero as its value estimates sharpen and exploitation becomes safe. At $\varepsilon = 0$ the policy is purely greedy; at $\varepsilon = 1$ it is a uniformly random walk.

ε-greedy is not the only tool, and it has a real weakness worth stating: it explores blindly, treating a clearly-terrible action and a plausibly-good-but-untested one as equally worth a random visit. Smarter schemes — optimism in the face of uncertainty (UCB, which adds a bonus for actions tried less often), Boltzmann / softmax exploration (sample in proportion to estimated value), and posterior sampling (Thompson sampling) — direct exploration toward what is genuinely uncertain rather than uniformly random. The multi-armed bandit, the simplest possible RL problem (one state, no transitions, just the explore–exploit trade-off in isolation), is where these are studied cleanly and where the regret bounds that quantify "how much you lose by not knowing the best arm" are proved. We treat bandits in their own chapter.

An ε-greedy agent uses $\varepsilon = 0.1$ over $|\mathcal{A}| = 4$ actions. Using EQ R1.7, what total probability does it place on the single greedy action $\arg\max_a Q(s,a)$?

$1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} = 1 - 0.1 + \dfrac{0.1}{4} = 0.9 + 0.025 = $ 0.925. The remaining $0.075$ is split evenly across the other three actions ($0.025$ each), so the four probabilities sum to 1.

PYTHON · RUNNABLE IN-BROWSER

# Epsilon-greedy action selection, with an annealed exploration rate (EQ R1.7)
import numpy as np
rng = np.random.default_rng(0)

Q = np.array([1.0, 5.0, 2.0, 4.0])         # estimated action-values, 4 actions
nA = len(Q)
greedy = int(np.argmax(Q))                 # action index 1 here (value 5.0)

def eps_greedy(Q, eps):
    if rng.random() < eps:                 # explore: uniform random
        return rng.integers(len(Q))
    return int(np.argmax(Q))               # exploit: the greedy action

# anneal epsilon from 1.0 down to a 0.05 floor, and report the rate
print(" step   epsilon   P(greedy)=1-eps+eps/|A|   empirical greedy share")
for step in (0, 50, 200, 1000):
    eps = max(0.05, 1.0 * (0.995 ** step)) # exponential decay with a floor
    p_greedy = 1 - eps + eps / nA
    picks = [eps_greedy(Q, eps) for _ in range(4000)]
    share = np.mean(np.array(picks) == greedy)
    print(f" {step:5d}   {eps:6.3f}        {p_greedy:6.3f}                {share:6.3f}")

print(f"\ngreedy action = index {greedy} (Q = {Q[greedy]}).")
print("as epsilon decays, the agent shifts from exploring to exploiting it.")

edits are live — break it on purpose

INSTRUMENT R1.3 — GRIDWORLD EXPLORERSET REWARDS · WATCH A POLICY EMERGE · VALUE ITERATION

DISCOUNT γ 0.90

STEP COST −0.04

PAINT

VIEW

VALUES + POLICY

SWEEPS TO CONVERGE

—

V AT START CELL

—

Click cells to paint a goal (+1), a trap (−1), or a wall, then watch value iteration flood the grid: the mint shading is the state-value $V(s)$, and the arrows are the greedy policy it implies — a plan the agent never wrote, only computed. Make the step cost more negative and the policy grows impatient, hugging the shortest path and accepting risk near the trap; soften it and the agent takes longer, safer detours. Raise γ and distant goals pull harder, reshaping arrows two and three cells away. This is the planning of Chapter 02, run live on whatever world you paint. (This solves a known MDP by dynamic programming — the agent here is given $P$ and $R$; the model-free chapters drop that luxury.)

We can now state the problem exactly — but not yet solve it. Chapter 02 supplies the first machinery: the Bellman equations, which turn the recursive return (EQ R1.4) into a fixed-point system, and the dynamic-programming algorithms — policy evaluation, policy iteration, and value iteration (the engine behind the Gridworld instrument above) — that compute the optimal value function and policy when the model $P$ and $R$ is known. Everything after that is the harder, real-world case: learning the same answers from experience alone.

1.R

References

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press — the canonical text; the agent–environment loop, MDPs, returns, value functions, and exploration as framed in this chapter.
Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics 6(5) — the formal origin of the MDP (EQ R1.1) and the recursive value relation behind EQ R1.4.
Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning 8 — the action-value function $Q^\pi$ (EQ R1.6) and the convergence result that grounds model-free control.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 — deep Q-networks; the modern demonstration that ε-greedy exploration (EQ R1.7) scales to high-dimensional state spaces.
Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 — UCB and the regret framework for principled exploration beyond ε-greedy (§1.5).
Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 — an early, lucid survey of the problem formulation used throughout this chapter.