A definition that survives the hype
“Agent” is the most marketing-soaked word in the field, applied with equal confidence to a cron job with an API key and to a system that ships production code unsupervised. A definition that survives contact with both vendors and reality has exactly four components:
| Component | Role | Without it you have… |
|---|---|---|
| LLM | the policy — picks the next action from everything seen so far | ordinary software |
| Tools | actuators — the only way the model touches the world | a chatbot: all talk, no hands |
| Loop | feedback — observations return as input to the next decision | a one-shot pipeline: open-loop, no recovery |
| Goal | termination — defines what “done” means and who decides it | a screensaver that bills by the token |
Formally, the agent is a fixed policy unrolled against an environment. The state is nothing more exotic than the transcript so far:
The boundary worth defending is the one between workflows and agents. In a workflow, your code owns control flow and the model fills in slots — summarize this, classify that — along paths fixed before the run started. In an agent, the model owns control flow: it decides what happens next, how many steps to take, and when the job is done. Everything in between is a gradient, which §1.3 turns into a ladder. The distinction matters because the two fail differently: workflows fail like software (loudly, reproducibly, at a known step), agents fail like employees (plausibly, variably, sometimes silently) — and everything in this volume is about engineering around the second failure style.
What “judgment” buys. A while-loop with judgment is not a put-down — the judgment is the whole product. Fixed automation handles enumerable cases; the agent's bet is that a strong policy over open-ended situations beats an exhaustive case analysis nobody can actually write. You pay for that bet in variance. The discipline of this volume is deciding, task by task, whether the bet is worth it.
Anatomy of the loop
Here is the object of study for the next five chapters — the canonical loop, essentially as it appears inside every production coding agent, stripped of error handling:
# The canonical agentic loop — the ~10 lines under every agent product
context = [system_prompt, tool_schemas, user_goal]
while turns < MAX_TURNS and spend < BUDGET:
reply = llm(context) # the only intelligent step
if reply.tool_calls:
results = [harness.execute(c) for c in reply.tool_calls]
context += [reply, results] # the transcript IS the state
else:
return reply.text # the model decided it is done
return escalate("budget exhausted — hand back to a human")
One turn = one model emission. The model returns either tool calls — structured, schema-conforming action requests (Vol III · Ch 05) — or plain text, which the loop reads as “finished.” The harness executes the calls it approves, in a sandbox it controls, and appends whatever came back — stdout, an error trace, a screenshot, a search result — as an observation. Then the model is called again on the longer transcript. That is the entire trick: the model never touches the world, and the world never touches the model; they only exchange tokens through the context.
# a complete working agent: mock LLM policy + two tools + the loop
def search(q): return {"speed of light km/s": "299792.458"}.get(q, "no results")
def calc(expr): return str(eval(expr, {"__builtins__": {}}, {}))
TOOLS = {"search": search, "calc": calc}
def llm(context): # rule-based stand-in for pi_theta
seen = " ".join(context)
if "1079252848" in seen:
return ("answer observed and verified — stop", None,
"light covers 1,079,252,848.8 km in one hour")
if "299792.458" in seen:
return ("have km/s, need km/h: multiply by 3600", "calc", "299792.458 * 3600")
return ("no constant in context yet — look it up", "search", "speed of light km/s")
context, turn = ["GOAL: how far does light travel in one hour, in km?"], 0
while turn < 6: # the harness's hard budget
turn += 1
thought, tool, arg = llm(context)
print(f"turn {turn} | THOUGHT {thought}")
if tool is None:
print(f"turn {turn} | ANSWER {arg}")
break
obs = TOOLS[tool](arg)
print(f"turn {turn} | ACT {tool}({arg!r}) -> OBS {obs}")
context.append(f"OBS: {obs}")
print()
print("an agent is a while-loop with judgment")
Watch the loop run. The task is the smallest real agentic episode there is — a failing test, a config file, and a model that has to find the bug, fix it, and prove the fix:
Why does the verification turn matter so much? Because an open-loop system multiplies its per-step reliability across the whole horizon:
Degrees of autonomy: pick the lowest rung that works
Autonomy is not a binary; it is a ladder of who owns control flow. Each rung hands the model more of the run — and hands you more variance, more cost, and a harder evaluation problem:
| Rung | Control flow owned by | Model decides | Evaluates like |
|---|---|---|---|
| R0 · Workflow | your code, fully | content of each slot | software — unit tests per step |
| R1 · Router | your code, one branch point | one classification | a classifier — precision / recall |
| R2 · Single-tool agent | model, inside one loop · one tool | each call + when to stop | task success under a turn cap |
| R3 · Multi-step agent | model, open toolset | plan, actions, ordering, stop | end-state verifier on trajectories |
| R4 · Multi-agent | an orchestrating model | decomposition + everything below | per-subagent verifiers + a merge gate |
The design rule is unfashionable and correct: take the lowest rung that solves the task. Every rung you climb without needing to converts a debuggable system into a stochastic one. A fixed sequence of LLM calls is still rung 0 — multiple steps are not autonomy; autonomy begins when the number or identity of the steps is decided at runtime. And rung 4 is justified by exactly two things — parallelism across independent subtasks, and context isolation when one window can't hold the job — not by the theater of models “collaborating.” The evidence on multi-agent debate and role-play is mixed at best: at matched token budgets, a single strong agent frequently wins. When someone proposes rung 4, ask which of the two real justifications applies.
# the cost of climbing the ladder: single-shot vs a 5-turn agent
base = 800 + 150 # system prompt + user goal, tokens
out_per_turn, obs_per_turn = 120, 350 # action text + tool observation
single = base + 300 # rung 0: one call, one answer
total_in = total_out = 0
ctx = base
print("turn context resent output")
for t in range(1, 6):
total_in += ctx
total_out += out_per_turn
print(f" {t} {ctx:6,d} {out_per_turn}")
ctx += out_per_turn + obs_per_turn # this turn's action + obs ride along forever
agent = total_in + total_out
print(f"\nsingle-shot call : {single:6,d} tokens")
print(f"5-turn agent : {agent:6,d} tokens")
print(f"multiplier : x{agent/single:.1f}")
print("the transcript is the state, so every turn re-buys all previous turns —")
print("autonomy compounds cost quadratically, and the model picks the turn count")
Autonomy is also a permission grant. The ladder above is about who decides; in production it is mirrored by what the system is allowed to do — read-only vs write access, sandboxed vs live, human-approved vs autonomous actions. The two ladders should climb together: a rung-3 agent with rung-0 permissions (everything gated) is a safe way to earn trust; a rung-1 router with production write access is how incidents happen. Chapter 04 makes this precise.
What changed in 2024–26
The loop itself is old — ReAct (Yao et al., 2022) ran reason-act-observe cycles by pure prompting, parsing actions out of free text with a regex and a prayer. What changed is that every link in the loop got trained instead of prompted:
- Tool use moved into the weights. Function calling stopped being a parsing convention and became a post-training target: models are fine-tuned and RL-trained to emit schema-valid tool calls in dedicated formats, to choose between tools, and to decide when no tool is needed. Reliability went from “mostly parses” to a substrate you can build on (Vol III · Ch 05). On the supply side, MCP (late 2024) standardized how tools describe themselves, so any agent can discover and call any conforming tool — the USB moment for actuators.
- RLVR went long-horizon. The recipe that built reasoning models (Vol II · Ch 05) — sample, verify the outcome, reinforce the trajectory — was extended from single-turn math to entire tool-using episodes: reward arrives at the end of a multi-turn rollout (did the tests pass? was the file produced?), and credit flows back through every intermediate decision. Out of this came trained agentic behaviors nobody prompted: decomposing before acting, checking work mid-stream, recovering from a failed call instead of repeating it.
- Computer use made pixels a tool. From late 2024, frontier models ship with screenshot-in, click/keystroke-out interfaces — the actuator of last resort that turns any GUI into an agent environment, no API required. It remains the slowest and most fragile rung of the tool stack, but the trendline is steep: on OSWorld, success rates went from roughly 15% at launch to above 60% within two years, against a human baseline near 72%.
- Coding agents became the proof case. Software is the perfect agent habitat: rich tools (read, grep, edit, run), a verifiable reward signal (compilers and tests are free oracles), and unbounded demand. On SWE-bench Verified — real GitHub issues, graded by held-out tests — resolution rates went from low single digits in late 2023 to above 70% by 2025. Whatever agents become elsewhere, they became real in code first, because code is where EQ A1.2's verifier is built into the environment.
The most useful single number for tracking all of this is METR's horizon: take tasks humans need minutes-to-hours to do, and measure the longest task length (in human time) the model completes at 50% reliability. Fit across 2019–2025 frontier models, it doubles on a startlingly steady clock:
What did not change: the ten lines of §1.2. The 2022 prompted loop and the 2026 trained one are structurally identical — better policy, same plumbing. That is exactly why the plumbing is worth a volume: the model improves on someone else's schedule; the harness improves on yours.
The four hard problems
Everything difficult about agents is downstream of one fact: the loop runs unattended, accumulating state, spending money, and touching the world, on a policy you cannot inspect. Four problems fall out, and they fill the rest of this volume:
Notice that none of the four is “make the model smarter.” The model arrives with its capabilities fixed; agent engineering is everything you wrap around EQ A1.1 so that a fallible policy produces reliable work. The encyclopedias of 2020 would have called this prompt engineering; it has grown into systems engineering with a stochastic component in the middle.
The loop's state is the context, and the context is always running out. Chapter 02: context engineering — what actually belongs in the window, compaction without amnesia, memory that survives the episode, and why the best agents read less than you think.
Further reading
- Russell, S. & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). — defines the rational agent / percept–act loop this whole volume builds on.
- Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction (2nd ed.). — the canonical treatment of agents acting in an environment to maximize return.
- Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. — the paper that fused chain-of-thought with tool actions into the modern LLM loop.
- Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. — an early, vivid demonstration of long-horizon autonomy and skill acquisition.
- Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. — shows self-reflection on outcomes as a way to improve across loop iterations.
- Anthropic (2024). Building Effective Agents. — a practitioner's taxonomy of workflows versus agents and when autonomy is worth its cost.