Agents, environments & rewards
Reinforcement learning is the study of learning by interaction. There is an agent — the thing that learns and decides — and an environment — everything else, the world the agent acts in and cannot directly control. They meet in a loop that runs forever, or until the episode ends. At each step the agent observes a state, chooses an action, and the environment responds with a reward and a new state. That single sentence is the whole game; everything in this volume is built on it.
The contrast with supervised learning is sharper than it first appears, and it is the reason RL is its own field rather than a corner of classification. Three differences matter:
- The signal is evaluative, not instructive. A label says "the answer was 7." A reward says only "that was worth +1" — it never reveals what the best action would have been. The agent must infer the better action by trying alternatives, which it can only do by acting differently.
- Feedback is delayed. The move that loses a chess game may have been made twenty turns earlier. Reward arrives long after the action that earned it, so the agent must solve a credit-assignment problem: which of my past decisions deserves the blame or the praise?
- The data is not i.i.d. — it is generated by the agent. A supervised dataset sits still while you fit it. In RL the distribution of states the agent encounters is produced by its own policy; change the policy and you change the data. An agent that never drives into the city never learns city driving. This feedback between policy and data is the defining difficulty, and it is exactly the bold idea this chapter is built around.
The entire framework rests on one deceptively strong assumption, the reward hypothesis: that any goal we care about can be expressed as the maximization of expected cumulative scalar reward. Win the game; minimize fuel; keep the robot upright; finish the task a human would approve of. It is a remarkably general claim, and most of the field's hardest practical failures — reward hacking, specification gaming, agents that maximize the literal number while violating its intent — are failures of writing down the reward, not of optimizing it. We will return to that honesty repeatedly.
A word on what "reward" is not. It is not a label and not a loss. It is part of the environment's definition, chosen by the problem designer, and the agent treats it as given and immutable. The agent's job is never to question the reward — only to collect as much of it, over time, as it can.
The Markov Decision Process
To do anything rigorous we need to formalize that loop, and the standard formalization is the Markov Decision Process (MDP). An MDP is a Markov chain (see Stats · Markov Chains) with two additions: the transitions now depend on an action the agent chooses, and every transition emits a reward. It is the bridge between "an agent acting in a world" and "a problem we can solve with mathematics".
The word Markov carries the load. The state is assumed to be a sufficient statistic of the history: given the current state, the future is conditionally independent of the past. Where you go next depends only on where you are and what you do — not on how you got here.
The transition function \(P\) and reward function \(R\) together are called the model of the environment. A crucial fork in the field follows from whether the agent knows them. If \(P\) and \(R\) are known, the agent can plan — compute the best policy by pure thought, no interaction required, which is the dynamic programming of Chapter 02. If they are unknown, the agent must learn from samples, which is the model-free reinforcement learning of the chapters after. The MDP is the common language for both.
| Symbol | Name | What it is | Gridworld example |
|---|---|---|---|
| 𝒮 | states | every distinguishable situation | each cell of the grid |
| 𝒜 | actions | the agent's choices in a state | up, down, left, right |
| P(s′|s,a) | transitions | where actions lead (maybe stochastic) | "move up" → cell above (90%), slip sideways (10%) |
| R(s,a) | reward | scalar feedback per step | +1 at the goal, −1 in a trap, −0.04 per step |
| γ | discount | how much the future counts | 0.9 — distant rewards matter, but less |
# Define a tiny 3-state MDP and compute a trajectory's discounted return
import numpy as np
# states 0,1,2 ; action "go" moves you forward; reward is given on arrival
# rewards collected along one episode: s0 -> s1 -> s2 (terminal)
rewards = np.array([0.0, 1.0, 1.0, 1.0]) # r_1, r_2, r_3, r_4 received over time
gamma = 0.9
# discounted return G = sum_t gamma^t * r_{t+1} (EQ R1.3)
discounts = gamma ** np.arange(len(rewards))
G = float(np.sum(discounts * rewards))
print("reward sequence :", rewards.tolist())
print("discount weights:", discounts.round(4).tolist())
print(f"discount factor : gamma = {gamma}")
print(f"discounted return G = {G:.4f}")
# the famous case: rewards [1,1,1] at t=0,1,2 with gamma=0.9
g3 = sum(gamma**t * 1.0 for t in range(3))
print(f"\nreturn of [1,1,1] at gamma=0.9 : {g3:.2f} (= 1 + 0.9 + 0.81)")
Returns & discounting
The agent does not maximize the immediate reward — that would be greedy and shortsighted. It maximizes the return: the total reward accumulated from now to the end of time. For an episode that terminates, the return is simply the sum of rewards. But many problems never terminate, and an infinite sum of rewards is itself infinite and meaningless to compare. The fix is to discount: weight a reward received \(k\) steps in the future by \(\gamma^k\).
Discounting earns its place for three independent reasons, and it helps to keep them distinct. Mathematically, \(\gamma < 1\) makes the infinite sum converge, so returns are comparable numbers rather than diverging infinities. Economically, it mirrors the time value of reward — a unit now is worth more than a unit later, exactly as in interest and inflation. Behaviorally, it expresses uncertainty about the future: at each step there is effectively a \((1-\gamma)\) chance the world ends, so \(\gamma\) is the per-step survival probability. The return is the expected total reward under that geometric lifetime.
The endpoints of \(\gamma\) sharpen the intuition. At \(\gamma = 0\) the return collapses to \(R_{t+1}\): the agent cares only about the immediate reward and is purely myopic, blind to consequences. As \(\gamma \to 1\) every future reward counts almost fully and the agent becomes far-sighted, willing to suffer many small penalties now for a large payoff later — the patience a maze or a game of chess demands. Most problems live in between, and the choice of \(\gamma\) genuinely changes the optimal policy, not just the numbers: a maze-solver at \(\gamma = 0.99\) will take a longer, safer route that a \(\gamma = 0.8\) agent rejects as too distant to be worth it.
# How the discount factor reshapes the same reward stream (EQ R1.3)
import numpy as np
rewards = np.ones(40) # a steady +1 every step, 40 steps
horizons = []
for gamma in (0.0, 0.5, 0.9, 0.99):
w = gamma ** np.arange(len(rewards)) # discount weights
G = float((w * rewards).sum())
# "effective horizon" 1/(1-gamma): steps that meaningfully count
eff = np.inf if gamma >= 1 else 1.0 / (1.0 - gamma)
horizons.append((gamma, G, eff))
print(f"gamma={gamma:4} -> return {G:7.3f} effective horizon ~ {eff:6.1f} steps")
print("\ngamma=0 is myopic: return is exactly the first reward (1.0).")
print("near gamma=1 the agent sums almost all 40 rewards and plans far ahead.")
# closed form for an infinite stream of +1: 1/(1-gamma)
print("infinite-stream limits 1/(1-g):",
[round(1/(1-g),2) for g in (0.5,0.9,0.99)])
plot_xy([h[0] for h in horizons], [h[1] for h in horizons])
Policies & value functions
A policy is the agent's behavior: a rule that maps states to actions. It is the object RL ultimately searches for — the solution to the MDP. A policy can be deterministic, \(a = \pi(s)\), always taking the same action in a state; or stochastic, \(\pi(a \mid s)\), a probability distribution over actions. Stochastic policies matter more than they look: they are how an agent explores, and in some problems (and in every adversarial game) the optimal policy is irreducibly random.
To improve a policy we need to know how good it is, and "good" means expected return. This gives the two central quantities of the field. The state-value function \(V^\pi(s)\) is the expected return from starting in state \(s\) and following \(\pi\) thereafter. The action-value function \(Q^\pi(s, a)\) is the expected return from taking action \(a\) in state \(s\) and then following \(\pi\).
Why two functions? Because of what each one lets you do without a model. If you know \(V^\pi\) but not the transitions \(P\), you cannot act greedily — you would need to know where each action leads to compare states. But if you know \(Q^\pi\), choosing the best action is trivial: take \(\arg\max_a Q(s,a)\), no model required. This is precisely why model-free control methods (Q-learning, SARSA) learn \(Q\) rather than \(V\). The value functions are the connective tissue of the whole field: estimate them, and a good policy falls out.
A subtlety experts will insist on: values are always relative to a policy. There is no such thing as "the value of a state" in the abstract — only its value under some way of behaving. The exception is the optimal value function \(V^{*}(s) = \max_\pi V^\pi(s)\), the best achievable from each state, which Chapter 02 shows how to compute. Solving an MDP means finding \(V^{*}\) (or \(Q^{*}\)) and reading the optimal policy off it.
# V from Q, the advantage, and which action a greedy agent would pick (EQ R1.6)
import numpy as np
actions = ["up", "down", "left", "right"]
Q = np.array([10.0, 7.0, 4.0, 9.0]) # Q(s, a) for one state s
pi = np.array([0.70, 0.10, 0.05, 0.15]) # current stochastic policy pi(a|s)
assert np.isclose(pi.sum(), 1.0)
V = float(pi @ Q) # V(s) = sum_a pi(a|s) Q(s,a)
advantage = Q - V # A(s,a) = Q(s,a) - V(s)
print(f"state value V(s) : {V:.3f}")
for a, q, adv, p in zip(actions, Q, advantage, pi):
flag = " <- above average" if adv > 0 else ""
print(f" {a:5s} Q={q:5.1f} A={adv:+5.2f} pi={p:.2f}{flag}")
greedy = actions[int(np.argmax(Q))]
print(f"\ngreedy action (argmax Q): {greedy}")
print("a policy gradient step pushes pi toward actions with positive advantage.")
Exploration vs exploitation
Here is the dilemma that has no counterpart in supervised learning. To collect reward, the agent should exploit — take the action it currently believes is best. But its beliefs are estimates built from limited experience, and the only way to improve them is to explore — try actions it is unsure about, which may be worse. Every step forces a choice between cashing in what you know and gathering information that might let you do better later. Lean too far toward exploitation and you lock onto a mediocre habit, never discovering the better path you never tried. Lean too far toward exploration and you squander reward forever testing options you already know are bad.
This tension is fundamental, not an artifact of any algorithm, and it is sharpened by the feedback we opened the chapter with: because the agent's actions decide what data it sees, an action never taken is an action never learned from. A supervised learner sees every labeled example in the dataset whether it likes it or not. An RL agent sees only the consequences of what it chose to do — so under-exploration is self-reinforcing, a blind spot the agent cannot detect from inside.
The simplest workable strategy is ε-greedy: with probability \(1-\varepsilon\) take the current best (greedy) action, and with probability \(\varepsilon\) take a uniformly random action instead. It is crude but remarkably effective, and it is the exploration rule baked into the first generation of deep-RL agents.
ε-greedy is not the only tool, and it has a real weakness worth stating: it explores blindly, treating a clearly-terrible action and a plausibly-good-but-untested one as equally worth a random visit. Smarter schemes — optimism in the face of uncertainty (UCB, which adds a bonus for actions tried less often), Boltzmann / softmax exploration (sample in proportion to estimated value), and posterior sampling (Thompson sampling) — direct exploration toward what is genuinely uncertain rather than uniformly random. The multi-armed bandit, the simplest possible RL problem (one state, no transitions, just the explore–exploit trade-off in isolation), is where these are studied cleanly and where the regret bounds that quantify "how much you lose by not knowing the best arm" are proved. We treat bandits in their own chapter.
# Epsilon-greedy action selection, with an annealed exploration rate (EQ R1.7)
import numpy as np
rng = np.random.default_rng(0)
Q = np.array([1.0, 5.0, 2.0, 4.0]) # estimated action-values, 4 actions
nA = len(Q)
greedy = int(np.argmax(Q)) # action index 1 here (value 5.0)
def eps_greedy(Q, eps):
if rng.random() < eps: # explore: uniform random
return rng.integers(len(Q))
return int(np.argmax(Q)) # exploit: the greedy action
# anneal epsilon from 1.0 down to a 0.05 floor, and report the rate
print(" step epsilon P(greedy)=1-eps+eps/|A| empirical greedy share")
for step in (0, 50, 200, 1000):
eps = max(0.05, 1.0 * (0.995 ** step)) # exponential decay with a floor
p_greedy = 1 - eps + eps / nA
picks = [eps_greedy(Q, eps) for _ in range(4000)]
share = np.mean(np.array(picks) == greedy)
print(f" {step:5d} {eps:6.3f} {p_greedy:6.3f} {share:6.3f}")
print(f"\ngreedy action = index {greedy} (Q = {Q[greedy]}).")
print("as epsilon decays, the agent shifts from exploring to exploiting it.")
We can now state the problem exactly — but not yet solve it. Chapter 02 supplies the first machinery: the Bellman equations, which turn the recursive return (EQ R1.4) into a fixed-point system, and the dynamic-programming algorithms — policy evaluation, policy iteration, and value iteration (the engine behind the Gridworld instrument above) — that compute the optimal value function and policy when the model \(P\) and \(R\) is known. Everything after that is the harder, real-world case: learning the same answers from experience alone.
References
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.).
- Bellman, R. (1957). A Markovian Decision Process.
- Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning.
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning.
- Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem.
- Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996). Reinforcement Learning: A Survey.