01 · From Chat to Agents: The Loop

1.1

A definition that survives the hype

“Agent” is the most marketing-soaked word in the field, applied with equal confidence to a cron job with an API key and to a system that ships production code unsupervised. A definition that survives contact with both vendors and reality has exactly four components:

Component	Role	Without it you have…
LLM	the policy — picks the next action from everything seen so far	ordinary software
Tools	actuators — the only way the model touches the world	a chatbot: all talk, no hands
Loop	feedback — observations return as input to the next decision	a one-shot pipeline: open-loop, no recovery
Goal	termination — defines what “done” means and who decides it	a screensaver that bills by the token

Formally, the agent is a fixed policy unrolled against an environment. The state is nothing more exotic than the transcript so far:

EQ A1.1 — THE LOOP AS A POLICY $$ s_t \;=\; \big(\,g;\; a_1, o_1,\; \ldots,\; a_{t-1}, o_{t-1}\big), \qquad a_t \,\sim\, \pi_\theta(\cdot \mid s_t), \qquad o_t \,=\, E(a_t), \qquad \text{until } a_t \in \mathcal{A}_{\mathrm{stop}} $$

$g$ the goal, $a_t$ an action (a tool call, or a final answer), $o_t$ the observation the environment $E$ returns, $\pi_\theta$ the frozen LLM. Two things deserve a stare. First, the weights never change at runtime — every scrap of within-episode “learning” lives in $s_t$, the context, which is why Chapter 02 exists. Second, the model itself emits the stop action: termination is a decision, sampled from the same distribution as everything else, and it can be wrong in both directions — quitting early or looping forever (Chapter 05).

The boundary worth defending is the one between workflows and agents. In a workflow, your code owns control flow and the model fills in slots — summarize this, classify that — along paths fixed before the run started. In an agent, the model owns control flow: it decides what happens next, how many steps to take, and when the job is done. Everything in between is a gradient, which §1.3 turns into a ladder. The distinction matters because the two fail differently: workflows fail like software (loudly, reproducibly, at a known step), agents fail like employees (plausibly, variably, sometimes silently) — and everything in this volume is about engineering around the second failure style.

What “judgment” buys. A while-loop with judgment is not a put-down — the judgment is the whole product. Fixed automation handles enumerable cases; the agent's bet is that a strong policy over open-ended situations beats an exhaustive case analysis nobody can actually write. You pay for that bet in variance. The discipline of this volume is deciding, task by task, whether the bet is worth it.

1.2

Anatomy of the loop

Here is the object of study for the next five chapters — the canonical loop, essentially as it appears inside every production coding agent, stripped of error handling:

# The canonical agentic loop — the ~10 lines under every agent product
context = [system_prompt, tool_schemas, user_goal]
while turns < MAX_TURNS and spend < BUDGET:
    reply = llm(context)                          # the only intelligent step
    if reply.tool_calls:
        results = [harness.execute(c) for c in reply.tool_calls]
        context += [reply, results]               # the transcript IS the state
    else:
        return reply.text                      # the model decided it is done
return escalate("budget exhausted — hand back to a human")

One turn = one model emission. The model returns either tool calls — structured, schema-conforming action requests (Vol III · Ch 05) — or plain text, which the loop reads as “finished.” The harness executes the calls it approves, in a sandbox it controls, and appends whatever came back — stdout, an error trace, a screenshot, a search result — as an observation. Then the model is called again on the longer transcript. That is the entire trick: the model never touches the world, and the world never touches the model; they only exchange tokens through the context.

PYTHON · RUNNABLE IN-BROWSER

# a complete working agent: mock LLM policy + two tools + the loop
def search(q):  return {"speed of light km/s": "299792.458"}.get(q, "no results")
def calc(expr): return str(eval(expr, {"__builtins__": {}}, {}))
TOOLS = {"search": search, "calc": calc}

def llm(context):                              # rule-based stand-in for pi_theta
    seen = " ".join(context)
    if "1079252848" in seen:
        return ("answer observed and verified — stop", None,
                "light covers 1,079,252,848.8 km in one hour")
    if "299792.458" in seen:
        return ("have km/s, need km/h: multiply by 3600", "calc", "299792.458 * 3600")
    return ("no constant in context yet — look it up", "search", "speed of light km/s")

context, turn = ["GOAL: how far does light travel in one hour, in km?"], 0
while turn < 6:                                # the harness's hard budget
    turn += 1
    thought, tool, arg = llm(context)
    print(f"turn {turn} | THOUGHT {thought}")
    if tool is None:
        print(f"turn {turn} | ANSWER  {arg}")
        break
    obs = TOOLS[tool](arg)
    print(f"turn {turn} | ACT     {tool}({arg!r}) -> OBS {obs}")
    context.append(f"OBS: {obs}")

print()
print("an agent is a while-loop with judgment")

edits are live — break it on purpose

FIG A1.ATHE AGENTIC LOOP — DATA PATH

Tokens are the only interface. The model's sole output is text; the environment's sole input to the model is text appended to context. Every agent failure mode in Chapter 05 is ultimately a corruption of this picture: bad state in, bad action out, repeat.

Watch the loop run. The task is the smallest real agentic episode there is — a failing test, a config file, and a model that has to find the bug, fix it, and prove the fix:

INSTRUMENT A1.1 — LOOP STEP-SIMSCRIPTED EPISODE · 6 TURNS · EQ A1.1 LIVE

GOAL g

“A test started failing after yesterday's config change. Find the bug in config/ and fix it. The suite must pass.”

CONTROLS

TURN

—

CONTEXT (SIM. TOKENS)

—

TOOL CALLS

—

STATUS

—

STEP advances one event: model reasoning (grey), tool call (mint), observation (blue) — or hit AUTO and watch the whole episode. Three lessons hide in plain sight: the model never sees the repo, only observations its own calls produced; turn 5 re-runs the tests because an unverified fix is a guess; and the token counter only ever goes up — the loop's state grows monotonically, which is the problem Chapter 02 inherits. The transcript is scripted; real episodes differ run to run.

Why does the verification turn matter so much? Because an open-loop system multiplies its per-step reliability across the whole horizon:

EQ A1.2 — THE COMPOUNDING-ERROR BOUND $$ P(\text{episode succeeds}) \;=\; \prod_{t=1}^{n} p_t \;\overset{\text{indep.}}{=}\; p^{\,n}, \qquad 0.99^{60} \approx 0.55, \qquad 0.95^{60} \approx 0.046 $$

With per-step success $p$ and no feedback, a 60-step task collapses: 99% steps give a coin flip, 95% steps give near-certain failure. Real agents sit on both sides of this bound. They beat it because the loop lets observed errors be repaired — a red test is not a failure, it is information — and they undershoot it because their errors correlate (one wrong belief poisons every subsequent step). The engineering consequence: verifiable feedback is worth more than raw per-step accuracy. A tool that turns silent errors into visible observations (run the tests, render the page, validate the schema) is the cheapest reliability you will ever buy.

An open-loop agent runs a 40-step task with per-step success $p = 0.99$ and no feedback. By EQ A1.2, what is $P(\text{episode succeeds}) = p^{\,n}$? Give a probability between 0 and 1.

$P = 0.99^{40}$. Take logs: $40 \ln 0.99 = 40 \times (-0.01005) = -0.4020$, so $P = e^{-0.4020} \approx 0.669$. A 1% per-step error rate already costs a third of all episodes at 40 steps. The answer is 0.669.

Now drop the per-step success to $p = 0.97$ over $n = 20$ steps. What is $p^{\,n} = 0.97^{20}$? Give a probability between 0 and 1.

$0.97^{20}$: square up — $0.97^2 = 0.9409$, $0.97^4 = 0.8853$, $0.97^8 = 0.7837$, $0.97^{16} = 0.6143$; then $0.97^{20} = 0.97^{16} \times 0.97^4 = 0.6143 \times 0.8853 \approx 0.544$. Even a "good" 97% step is a coin-flippy 54% at twenty steps — which is why the loop, not the per-step number, decides reliability. The answer is 0.544.

1.3

Degrees of autonomy: pick the lowest rung that works

Autonomy is not a binary; it is a ladder of who owns control flow. Each rung hands the model more of the run — and hands you more variance, more cost, and a harder evaluation problem:

Rung	Control flow owned by	Model decides	Evaluates like
R0 · Workflow	your code, fully	content of each slot	software — unit tests per step
R1 · Router	your code, one branch point	one classification	a classifier — precision / recall
R2 · Single-tool agent	model, inside one loop · one tool	each call + when to stop	task success under a turn cap
R3 · Multi-step agent	model, open toolset	plan, actions, ordering, stop	end-state verifier on trajectories
R4 · Multi-agent	an orchestrating model	decomposition + everything below	per-subagent verifiers + a merge gate

The design rule is unfashionable and correct: take the lowest rung that solves the task. Every rung you climb without needing to converts a debuggable system into a stochastic one. A fixed sequence of LLM calls is still rung 0 — multiple steps are not autonomy; autonomy begins when the number or identity of the steps is decided at runtime. And rung 4 is justified by exactly two things — parallelism across independent subtasks, and context isolation when one window can't hold the job — not by the theater of models “collaborating.” The evidence on multi-agent debate and role-play is mixed at best: at matched token budgets, a single strong agent frequently wins. When someone proposes rung 4, ask which of the two real justifications applies.

PYTHON · RUNNABLE IN-BROWSER

# the cost of climbing the ladder: single-shot vs a 5-turn agent
base = 800 + 150                 # system prompt + user goal, tokens
out_per_turn, obs_per_turn = 120, 350    # action text + tool observation

single = base + 300              # rung 0: one call, one answer
total_in = total_out = 0
ctx = base
print("turn   context resent   output")
for t in range(1, 6):
    total_in += ctx
    total_out += out_per_turn
    print(f"  {t}          {ctx:6,d}      {out_per_turn}")
    ctx += out_per_turn + obs_per_turn   # this turn's action + obs ride along forever

agent = total_in + total_out
print(f"\nsingle-shot call : {single:6,d} tokens")
print(f"5-turn agent     : {agent:6,d} tokens")
print(f"multiplier       : x{agent/single:.1f}")
print("the transcript is the state, so every turn re-buys all previous turns —")
print("autonomy compounds cost quadratically, and the model picks the turn count")

edits are live — break it on purpose

A single-shot (rung 0) call sends 1,250 tokens. The same job as a 5-turn agent re-sends its growing transcript every turn, totaling 6,000 tokens. What is the token multiplier, agent ÷ single-shot?

Multiplier $= 6000 / 1250 = 4.8$. Because the transcript is the state, every turn re-buys all previous turns — autonomy compounds cost super-linearly, and the model, not you, picks the turn count. The answer is 4.8.

INSTRUMENT A1.2 — AUTONOMY LADDER6 USE CASES · HAND-MAPPED · TEACHES RESTRAINT

USE CASE

RECOMMENDED RUNG

—

EVALUATE IT AS

—

Pick a use case; the recommended rung lights up, everything above it is flagged OVERKILL. The mapping is hand-written judgment, not an algorithm — the point is the habit: before reaching for an agent, ask what the cheapest structure is that still solves the task. The default case is the trap: it feels agentic (steps! tools! a schedule!) and is a plain pipeline.

Autonomy is also a permission grant. The ladder above is about who decides; in production it is mirrored by what the system is allowed to do — read-only vs write access, sandboxed vs live, human-approved vs autonomous actions. The two ladders should climb together: a rung-3 agent with rung-0 permissions (everything gated) is a safe way to earn trust; a rung-1 router with production write access is how incidents happen. Chapter 04 makes this precise.

1.4

What changed in 2024–26

The loop itself is old — ReAct (Yao et al., 2022) ran reason-act-observe cycles by pure prompting, parsing actions out of free text with a regex and a prayer. What changed is that every link in the loop got trained instead of prompted:

Tool use moved into the weights. Function calling stopped being a parsing convention and became a post-training target: models are fine-tuned and RL-trained to emit schema-valid tool calls in dedicated formats, to choose between tools, and to decide when no tool is needed. Reliability went from “mostly parses” to a substrate you can build on (Vol III · Ch 05). On the supply side, MCP (late 2024) standardized how tools describe themselves, so any agent can discover and call any conforming tool — the USB moment for actuators.
RLVR went long-horizon. The recipe that built reasoning models (Vol II · Ch 05) — sample, verify the outcome, reinforce the trajectory — was extended from single-turn math to entire tool-using episodes: reward arrives at the end of a multi-turn rollout (did the tests pass? was the file produced?), and credit flows back through every intermediate decision. Out of this came trained agentic behaviors nobody prompted: decomposing before acting, checking work mid-stream, recovering from a failed call instead of repeating it.
Computer use made pixels a tool. From late 2024, frontier models ship with screenshot-in, click/keystroke-out interfaces — the actuator of last resort that turns any GUI into an agent environment, no API required. It remains the slowest and most fragile rung of the tool stack, but the trendline is steep: on OSWorld, success rates went from roughly 15% at launch to above 60% within two years, against a human baseline near 72%.
Coding agents became the proof case. Software is the perfect agent habitat: rich tools (read, grep, edit, run), a verifiable reward signal (compilers and tests are free oracles), and unbounded demand. On SWE-bench Verified — real GitHub issues, graded by held-out tests — resolution rates went from low single digits in late 2023 to above 70% by 2025. Whatever agents become elsewhere, they became real in code first, because code is where EQ A1.2's verifier is built into the environment.

The most useful single number for tracking all of this is METR's horizon: take tasks humans need minutes-to-hours to do, and measure the longest task length (in human time) the model completes at 50% reliability. Fit across 2019–2025 frontier models, it doubles on a startlingly steady clock:

EQ A1.3 — THE HORIZON FIT (EMPIRICAL, ILLUSTRATIVE) $$ h(t) \;=\; h_0 \cdot 2^{\,(t - t_0)/T_d}, \qquad T_d \,\approx\, 7\ \text{months} $$

$h$ the 50%-success task horizon in human time, $T_d$ the doubling period (Kwa et al., 2025). Honest caveats, all load-bearing: the task suite is software-heavy; the 2024–25 segment ran faster than the fit (≈4 months); and at an 80% success bar the horizon shrinks ~5×, which is the gap between a demo and a product. This is an empirical fit, not a law — but it is the cleanest quantitative statement of why this volume exists: the loop's economics improve on a schedule, and harness engineering decides who gets to cash that in.

Take $T_d = 7$ months and a model whose horizon today ($t = t_0$) is $h_0 = 15$ minutes. Using EQ A1.3, what is the projected 50%-success horizon $h$ 21 months later, in minutes?

Doublings $= (t - t_0)/T_d = 21/7 = 3$, so $h = h_0 \cdot 2^{3} = 15 \times 8 = 120$ minutes. Three doubling periods turn a 15-minute horizon into a two-hour one — if the fit holds, which the eq-note's caveats warn it may not. The answer is 120.

What did not change: the ten lines of §1.2. The 2022 prompted loop and the 2026 trained one are structurally identical — better policy, same plumbing. That is exactly why the plumbing is worth a volume: the model improves on someone else's schedule; the harness improves on yours.

1.5

The four hard problems

Everything difficult about agents is downstream of one fact: the loop runs unattended, accumulating state, spending money, and touching the world, on a policy you cannot inspect. Four problems fall out, and they fill the rest of this volume:

CH 02 · CONTEXT

state

The transcript grows monotonically; the window and the model's attention do not. Compaction, memory, sub-agent isolation — engineering what the policy gets to see.

CH 03–04 · TOOLS & HARNESS

body

Tool design, permissions, sandboxing, budgets. The model decides; the harness does — and the harness is the part you control completely.

CH 05 · WHEN LOOPS GO WRONG

failure

Doom loops, derailment, runaway spend, prompt injection through observations. Closed-loop systems fail in closed-loop ways.

CH 06 · EVALS

proof

Trajectories are nondeterministic and expensive. pass@k versus pass^k, end-state verifiers, and how to know your agent works before your users do.

Notice that none of the four is “make the model smarter.” The model arrives with its capabilities fixed; agent engineering is everything you wrap around EQ A1.1 so that a fallible policy produces reliable work. The encyclopedias of 2020 would have called this prompt engineering; it has grown into systems engineering with a stochastic component in the middle.

The loop's state is the context, and the context is always running out. Chapter 02: context engineering — what actually belongs in the window, compaction without amnesia, memory that survives the episode, and why the best agents read less than you think.

§