The reliability problem
Every agent demo that dazzles in five steps and dies in fifty is the same chart. A task that takes \(n\) sequential steps, each succeeding with probability \(p\), completes with probability \(p^n\) — and exponentials are merciless to multi-step work:
The independence assumption in \(p^n\) is the optimistic case. Real agent failures corrupt state: a wrong file edit, a hallucinated API response accepted as fact, a misread error message — each one lowers the conditional \(p_i\) for every step that follows, because later steps now reason from a poisoned context. Uncaught errors don't just subtract one step; they bend the whole remaining curve downward. The practical reading of EQ A5.1 is therefore a floor on pessimism, not a ceiling.
Three levers exist, and this chapter is about the third:
- Lower \(n\) — fewer, bigger steps. Mostly a harness problem (Chapter 04): one well-designed tool that does in one verified call what five primitive calls did in sequence.
- Raise \(p\) — better models, better prompts, better tool ergonomics. Necessary, but no realistic \(p\) survives large \(n\) raw: even 99.9% per step is only 90.5% at a hundred steps.
- Break the compounding — stop multiplying raw step probabilities by catching failures before they propagate. This is what verification and retries do, and it is the only lever that changes the shape of the curve rather than its constants.
Retries done right
The standard answer to flaky steps is retries, and with one crucial precondition it works spectacularly. If a failed attempt can be detected and retried, per-step success stops being \(p\) and becomes the probability that at least one of \(k\) attempts lands:
# compounding failure (EQ A5.1) and the verified-retry rescue (EQ A5.2)
import numpy as np
steps = np.arange(1, 101)
print(" p/step P(50 steps) P(100 steps)")
for p in (0.95, 0.99, 0.999):
plot_xy(steps, p ** steps)
print(f" {p:5.3f} {p**50:10.1%} {p**100:11.1%}")
# the rescue: 2 retries behind a perfect verifier, base p = 0.95
p, k = 0.95, 3 # k attempts per step
p_eff = 1 - (1 - p) ** k # EQ A5.2
plot_xy(steps, p_eff ** steps)
print(f"\nrescued: p = 0.95 with {k - 1} verified retries -> p_eff = {p_eff:.6f}")
print(f" P(50 steps) = {p_eff**50:.1%} vs {0.95**50:.1%} raw")
print(f"\npunchline: 0.99^50 = {0.99**50:.3f} — a 1% per-step error rate is")
print("roughly a coin flip at fifty steps; the verifier, not the model,")
print("is what bends the curve back toward 1")
The formula hides two assumptions that fail independently in practice. First, attempts must be detectably wrong. An agent with no verifier cannot trigger a retry on a step it believes succeeded — and a model that just produced a wrong answer usually believes exactly that. Blind regeneration without a selection signal leaves you sampling from the same marginal distribution: success probability \(p\), no matter how many times you roll. Second, attempts must be independent-ish. Same model, same prompt, same poisoned context — the second attempt fails for the same reason the first did. You retry into the same wall.
With an imperfect verifier of accuracy \(v\) (probability it labels a given output correctly), the algebra is honest about both failure directions — good work wrongly rejected, bad work wrongly approved:
Vary the approach, not just the seed. Because failures are correlated, the highest-value retry changes something structural: a different decomposition of the step, a different tool (read the file instead of trusting the summary), a fresh context that drops the transcript of the failed attempt (failure text in context actively steers regeneration toward the same hole), a different model. A useful escalation ladder for attempt \(j\): same approach with the verifier's rejection reason appended → same goal, new strategy, clean context → escalate to a stronger model → escalate to a human. And retry at the smallest failing unit — re-running one tool call is cheap; re-running the task because verification only happens at the end converts a step failure into a task failure, which is precisely the compounding you were trying to escape.
Stop conditions: budgets, progress, loops
Retries fix steps that fail loudly. The more expensive pathology is the loop that never fails at all — it just stops going anywhere. Every production agent needs an explicit answer to "when does this loop end?", and "when the model decides it's done" is not an answer: the model's judgment is the thing being supervised.
| Stop condition | Trigger | Implementation notes |
|---|---|---|
| Budgets | tokens · tool calls · wall-clock · dollars | Hard caps enforced by the harness, not the prompt. Set per-step and per-task; an agent that is told its remaining budget often self-corrects, but the cap must hold either way. |
| Progress detection | no measurable progress in W steps | Requires defining a progress signal up front: tests passing, items checked off, diff distance to goal. No signal moving for a window of W steps → stop or escalate. Token consumption is not progress. |
| Loop detection | repeated state or action | Hash the last few (tool, arguments) pairs; the same call with the same arguments returning the same result N times is a cycle, period. Also catch A→B→A→B oscillation (edit, revert, edit, revert). |
| Watchdog | external supervisor trips any of the above | Lives outside the agent's context — a process or a cheap second model reading the trace. On trip: kill, snapshot state, summarize for post-mortem or human handoff. |
The agent that greps forever. A classic trace: the agent greps for a symbol, gets no match, and concludes — reasonably — that it should search differently. Then it greps eleven more times, varying the casing, the directory, the regex flavor. Each call is locally sensible; the trajectory is a flat line. The model is the last to know it's looping, because its own context normalizes the repetition — by call eight, a transcript full of greps makes another grep look like the established procedure. This is why watchdogs are external by definition: you do not ask the loop whether it is a loop.
One reframe makes teams much better at this: a clean stop is a success mode. An agent that halts at budget with a structured summary — what was attempted, what's verified-done, what failed, what it would try next — has produced a resumable artifact. An agent that thrashes until someone kills the process has produced a forensic exercise. Design the abort path with the same care as the happy path; Chapter 06 makes both observable.
Plan–act–verify–revise: the canonical inner loop
Sections 5.1–5.3 assemble into one structure, and nearly every serious agent system converges on it independently: plan the next move, act on the smallest meaningful unit, verify the result against something the actor doesn't control, revise on failure. The loop's power is where it puts verification: after every act, not at the end. Per-step verification turns one long chain of \(n\) multiplied probabilities into \(n\) short, independently recoverable chains — it is EQ A5.2 applied at the finest grain available.
Plan gates are the cheap insurance on the front edge: before the first expensive action, check the plan mechanically. Do the files it references exist? Does it cover every stated constraint? Is its step count inside budget? Does it touch anything on the do-not-touch list? A plan gate is a verifier for intentions — it costs one cheap model call or a few assertions, and it catches the class of failure that no amount of step-level retrying can fix, because every step can succeed while the plan marches confidently toward the wrong goal.
Re-planning triggers formalize the dashed edge. The revise stage must diagnose, not just retry: an execution failure (right idea, flaky step) routes back to ACT with a varied approach; a plan failure routes back to PLAN. Concrete triggers that production systems use: the same step rejected \(N\) times despite varied approaches (the plan assumed something false); verification revealing the world differs from the plan's premise (the API the plan depends on is deprecated); burn rate — actual cost per completed step exceeding the plan's implicit estimate by a multiple. Re-planning from a summarized state is cheap; discovering at step 40 that step 3's plan was wrong is not.
Multi-agent topologies
Multi-agent is not a virtue; it is a topology decision, and the null hypothesis — one agent, one loop — wins more often than the conference talks suggest. Splitting work across agents pays only when it buys one of three things: parallelism over genuinely independent subtasks, context isolation (each worker gets a clean, focused window instead of one bloated one), or independence of judgment (diversity or adversarial pressure that a single context cannot produce, because one context anchors itself). Five recurring shapes:
| Topology | Shape | Wins when… | Fails when… |
|---|---|---|---|
| Orchestrator–workers | one planner fans out, owns synthesis | work decomposes cleanly and results must merge coherently in one place | subtasks are coupled; orchestrator becomes the bottleneck and the context hog |
| Pipeline | serial stages, artifact handoff | staged transforms with machine-checkable interfaces between stages | early-stage errors amplify downstream; stage latencies add up serially |
| Council + judge | parallel independent opinions → aggregator | diverse judgments are the product: review, ranking, curation | a ground-truth verifier exists — tests beat votes, always |
| Debate | adversaries argue before a judge | one contested claim, high cost of being wrong | open-ended generation; rewards persuasiveness, which is not truth |
| Swarm | homogeneous workers, shared queue | many independent, near-identical units of work | shared mutable state — coupled edits become merge conflicts |
# what a topology costs: one task, four shapes, tokens + wall-clock
UNIT = 30_000 # tokens a single agent burns solving the task alone
RESULT = 300 # a structured result handed back (never a transcript)
shapes = {"single agent": (UNIT, 1.00)}
# orchestrator + 3 parallel workers, each ~40% of the exploring + handback
shapes["orchestrator-workers"] = (int(0.25*UNIT + 3*0.4*UNIT + 3*RESULT), 0.25 + 0.40 + 0.10)
# pipeline: 4 serial stages at ~30% each, artifact checks at the seams
shapes["pipeline"] = (int(4*0.3*UNIT + 3*RESULT), 4 * 0.30)
# council: 3 full independent attempts + a judge reading three results
shapes["council + judge"] = (int(3*UNIT + 3*RESULT + 2_000), 1.00 + 0.10)
print(f"{'topology':22s}{'tokens':>8s}{'vs single':>10s}{'wall-clock':>11s}")
for name, (tok, wall) in shapes.items():
print(f"{name:22s}{tok:8,d}{tok/UNIT:9.2f}x{wall:10.2f}")
o_tok, o_wall = shapes["orchestrator-workers"]
c_tok = shapes["council + judge"][0]
print(f"\nfan-out buys wall-clock, never tokens: the orchestrator runs "
f"{1 - o_wall:.0%} faster for {o_tok/UNIT - 1:.0%} more tokens;")
print(f"the council pays {c_tok/UNIT:.1f}x for independent judgment — worth it only")
print("when no ground-truth verifier exists, because tests beat votes")
Two rules govern every topology, and violating either converts multi-agent from a speedup into a liability:
- Hand off results, not transcripts. A worker returns an artifact plus a structured summary — what was done, what's verified, what's unresolved — never its raw conversation. Transcripts carry the worker's dead ends, hallucinated intermediates, and tone into the consumer's context, where they poison downstream reasoning and burn the window. The interface between agents is a contract, exactly like a function signature; Chapter 03's tool-design discipline applies to agents talking to agents.
- Parallelize only independent subtasks. Two agents editing the same file is a merge-conflict generator with extra steps; two agents researching with a shared mutable notes doc will overwrite each other's reasoning. Enforce a single-writer rule per resource, and remember EQ A5.1 cuts both ways: every handoff is itself a step that can fail. Adding agents adds steps — the topology must remove more failure surface from the critical path than its own coordination adds.
Long-horizon patterns
Past a few hundred steps, the enemy stops being step failure and becomes state amnesia: the context window fills, compaction or restarts shed detail, and the agent forgets what it decided and why. The long-horizon patterns all share one move — get the program state out of the context window and into something durable, so the model becomes a stateless worker against external state.
Compaction checkpoints. Compaction at an arbitrary moment amputates mid-thought reasoning, and the successor context inherits a summary of confusion. Checkpoint deliberately instead: at clean boundaries — typically right after a verify-pass — write a durable record of goal, decisions made (with reasons), verified state, next action, then compact or restart from that record. The agent should be able to die at any checkpoint and a fresh instance continue from the file alone. If it can't, the checkpoint is decorative.
External task lists as program counters. The oldest idea in computing, rediscovered: keep the loop variable outside the loop.
# tasks.md — the program counter lives outside the context window
[x] 01 inventory call sites of legacy API # done · 312 sites
[x] 02 write codemod + unit tests # done · tests green
[>] 03 migrate src/billing/** (shard 3/9) # in progress — resume here
[ ] 04 migrate src/auth/**
[ ] 05 run full suite · bisect any failures
invariant: every beat → read list · do ONE unchecked item · verify · update list · exit
Marking an item done only after verification makes the list a record of truth, not of intention — and making each item idempotent (safe to re-run if the agent died mid-item) makes crashes cost one item instead of one mission.
Heartbeat loops. For work that outlives any session — monitoring, week-long migrations, slow external dependencies — invert the architecture: instead of one agent that must survive, schedule a recurring re-entry. Each beat: wake, read durable state, do one bounded unit of work, write state back, exit. Reliability comes from the boundedness: each beat is a short chain with small \(n\) and full verification, so EQ A5.1 never gets room to compound. A long-horizon agent is a chain of short reliable sessions, not one heroic context.
Cost & latency engineering
Once the loop is reliable, it is usually overpaying: a verified retry loop happily runs the flagship model on steps a model a tenth the price handles identically. Model-tier routing assigns each step the cheapest tier that clears its required \(p\) — mechanical steps (formatting, extraction, glue) to a fast cheap model, judgment steps (planning, diagnosis, revision after repeated failure) to the strong one. The verifier is what makes this safe: routing without verification is gambling with a smaller bankroll; routing with verification is an asymmetry you can price exactly:
If this looks familiar, it should: it is speculative decoding (Vol II · Ch 08) lifted from the token level to the task level. There, a small draft model proposes tokens and the large model verifies them in one cheap parallel pass, keeping the large model's exact distribution while shifting most of the work to the cheap one. Here, a cheap agent proposes a step result and a verifier accepts or escalates. Same theorem, same precondition: the scheme only pays because verification is cheaper than generation. That asymmetry is real for tests, compilers, schema checks, and constrained judges with rubrics; it is contested for open-ended quality judgments, where the LLM-as-judge has biases of its own — Chapter 06 measures exactly how much you can trust it.
Latency obeys different algebra than cost: it follows the critical path, not the sum. Fan-out across independent subtasks costs more tokens but collapses wall-clock to the slowest branch plus synthesis; verification adds latency only if it serializes — so run cheap checks concurrently with the next step's draft when steps are independent, batch verifications where they aren't, and remember the orchestrator that must read every worker's output is a serial drain at the end of every parallel fan-out. Topology, routing, retries, and stop conditions are all one budget in three currencies — success probability, dollars, and seconds — and loop engineering is the art of spending each where its marginal return is highest.
A loop you cannot measure is a loop you cannot trust. Chapter 06: evals for agents — pass@k versus pass^k, trajectory scoring, the observability traces that catch the grep-forever loop in minute two instead of hour two, and the cost dashboards that tell you whether any of this engineering paid for itself.
Further reading
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. — the reasoning substrate underneath plan–act–verify loops.
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. — branching search over plans, the basis of revise-and-retry strategies.
- Madaan, A. et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. — formalizes the verify-then-revise inner loop without extra training.
- Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. — a reference framework for multi-agent topologies and orchestration.
- Hong, S. et al. (2023). MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. — role-based agent teams with structured handoffs for long-horizon work.
- Anthropic (2025). How We Built Our Multi-Agent Research System. — practitioner account of orchestrator–worker patterns, token cost, and coordination failure modes.