05 · Loop Engineering & Multi-Agent Patterns

5.1

The reliability problem

Every agent demo that dazzles in five steps and dies in fifty is the same chart. A task that takes $n$ sequential steps, each succeeding with probability $p$, completes with probability $p^n$ — and exponentials are merciless to multi-step work:

EQ A5.1 — COMPOUNDING FAILURE $$ P(\text{task}) \;=\; \prod_{i=1}^{n} p_i \;\approx\; p^{\,n} \qquad\Longrightarrow\qquad 0.99^{50} \approx 0.605, \qquad 0.95^{50} \approx 0.077 $$

Each $p_i$ is the probability step $i$ succeeds given that everything before it succeeded. A 1% per-step error rate is a 40% task failure rate at fifty steps. At 95% per step — a flattering number for a nontrivial tool call — fifty steps succeed less than 8% of the time. This single equation explains why "the demo worked" and "it works" are different claims.

An agent takes the right action $p = 0.98$ of the time over a $n = 30$-step task, errors fatal and independent. By EQ A5.1, what is $P(\text{task}) = p^{\,n} = 0.98^{30}$?

$0.98^{30}$: $30 \ln 0.98 = 30 \times (-0.02020) = -0.6061$, so $P = e^{-0.6061} \approx 0.545$. A 98%-per-step agent is barely better than a coin flip at thirty steps — and since real errors corrupt state, this is the optimistic floor, not the expectation. The answer is 0.545.

The independence assumption in $p^n$ is the optimistic case. Real agent failures corrupt state: a wrong file edit, a hallucinated API response accepted as fact, a misread error message — each one lowers the conditional $p_i$ for every step that follows, because later steps now reason from a poisoned context. Uncaught errors don't just subtract one step; they bend the whole remaining curve downward. The practical reading of EQ A5.1 is therefore a floor on pessimism, not a ceiling.

Three levers exist, and this chapter is about the third:

Lower $n$ — fewer, bigger steps. Mostly a harness problem (Chapter 04): one well-designed tool that does in one verified call what five primitive calls did in sequence.
Raise $p$ — better models, better prompts, better tool ergonomics. Necessary, but no realistic $p$ survives large $n$ raw: even 99.9% per step is only 90.5% at a hundred steps.
Break the compounding — stop multiplying raw step probabilities by catching failures before they propagate. This is what verification and retries do, and it is the only lever that changes the shape of the curve rather than its constants.

5.2

Retries done right

The standard answer to flaky steps is retries, and with one crucial precondition it works spectacularly. If a failed attempt can be detected and retried, per-step success stops being $p$ and becomes the probability that at least one of $k$ attempts lands:

EQ A5.2 — RETRY WITH A VERIFIER $$ P_{\text{step}} \;=\; 1 - (1 - p)^{k} $$

$k$ attempts per step (so $k-1$ retries), under a perfect verifier — something that always tells success from failure: a test suite, a schema validator, a compiler. At $p = 0.9$, three attempts give $P_{\text{step}} = 0.999$; the exponential now works for you. The fine print is the verifier. Without one, this equation is fiction — see EQ A5.3.

A step succeeds $p = 0.9$ per attempt, behind a perfect verifier, with $k = 3$ attempts. By EQ A5.2, what is $P_{\text{step}} = 1 - (1-p)^k$?

$(1-p)^k = 0.1^3 = 0.001$, so $P_{\text{step}} = 1 - 0.001 = 0.999$. The same exponential that mauled you in EQ A5.1 now works for you — each retry multiplies the failure probability down. The answer is 0.999.

PYTHON · RUNNABLE IN-BROWSER

# compounding failure (EQ A5.1) and the verified-retry rescue (EQ A5.2)
import numpy as np
steps = np.arange(1, 101)

print(" p/step    P(50 steps)   P(100 steps)")
for p in (0.95, 0.99, 0.999):
    plot_xy(steps, p ** steps)
    print(f"  {p:5.3f}    {p**50:10.1%}   {p**100:11.1%}")

# the rescue: 2 retries behind a perfect verifier, base p = 0.95
p, k = 0.95, 3                       # k attempts per step
p_eff = 1 - (1 - p) ** k             # EQ A5.2
plot_xy(steps, p_eff ** steps)
print(f"\nrescued: p = 0.95 with {k - 1} verified retries -> p_eff = {p_eff:.6f}")
print(f"         P(50 steps) = {p_eff**50:.1%}   vs   {0.95**50:.1%} raw")
print(f"\npunchline: 0.99^50 = {0.99**50:.3f} — a 1% per-step error rate is")
print("roughly a coin flip at fifty steps; the verifier, not the model,")
print("is what bends the curve back toward 1")

edits are live — break it on purpose

The formula hides two assumptions that fail independently in practice. First, attempts must be detectably wrong. An agent with no verifier cannot trigger a retry on a step it believes succeeded — and a model that just produced a wrong answer usually believes exactly that. Blind regeneration without a selection signal leaves you sampling from the same marginal distribution: success probability $p$, no matter how many times you roll. Second, attempts must be independent-ish. Same model, same prompt, same poisoned context — the second attempt fails for the same reason the first did. You retry into the same wall.

With an imperfect verifier of accuracy $v$ (probability it labels a given output correctly), the algebra is honest about both failure directions — good work wrongly rejected, bad work wrongly approved:

EQ A5.3 — IMPERFECT VERIFIER $$ P_{\text{step}} \;=\; \underbrace{p\,v\;\frac{1 - c^{\,k-1}}{1 - c}}_{\text{approved before the last attempt}} \;+\; \underbrace{c^{\,k-1}\, p}_{\text{shipped on the final attempt}}\,, \qquad c \;=\; p\,(1-v) + (1-p)\,v $$

$c$ is the per-attempt rejection probability (correct work wrongly rejected, plus incorrect work rightly rejected); the model assumes the agent ships its final attempt when the retry budget runs out. At $v = 1$ this collapses to EQ A5.2. At $v = 0.5$ it collapses to $P_{\text{step}} = p$ exactly: a coin-flip verifier makes every retry worthless — it rejects good work as often as bad, and the retries cancel to nothing. Verifier quality is not a tuning detail; it is the term that decides whether retries exist at all.

A step succeeds $p = 0.8$ per attempt, but the verifier is a coin flip, $v = 0.5$, with $k = 3$ attempts. By EQ A5.3 (which collapses to $P_{\text{step}} = p$ when $v = 0.5$), what is $P_{\text{step}}$?

At $v = 0.5$ the verifier rejects good work as often as it rejects bad work, so the retries cancel and $P_{\text{step}} = p = 0.8$ exactly — the three attempts buy nothing. Verifier quality, not retry count, is what makes retries worth running. The answer is 0.8.

INSTRUMENT A5.1 — RELIABILITY CALCULATOREQ A5.1–A5.3 · LIVE

PER-STEP SUCCESS p 99.0%

STEPS n 50

VERIFIER

RETRIES k 2

VERIFIER ACCURACY v 90%

EFFECTIVE PER-STEP P

—

P(TASK) AT n STEPS

—

STEPS UNTIL P < 50%

—

EXPECTED ATTEMPTS / STEP

—

Defaults: p = 99%, two retries, a 90%-accurate verifier → ≈ 94% task success at 50 steps, versus ≈ 61% raw. Now toggle the verifier OFF: the formula switches from EQ A5.3 to bare $p^n$ and the retry slider goes dead — without a success signal, retries are blind re-rolls that change nothing. Turn it back ON and walk v down to 50%: the curves merge again. That convergence is the chapter's thesis in one gesture.

Vary the approach, not just the seed. Because failures are correlated, the highest-value retry changes something structural: a different decomposition of the step, a different tool (read the file instead of trusting the summary), a fresh context that drops the transcript of the failed attempt (failure text in context actively steers regeneration toward the same hole), a different model. A useful escalation ladder for attempt $j$: same approach with the verifier's rejection reason appended → same goal, new strategy, clean context → escalate to a stronger model → escalate to a human. And retry at the smallest failing unit — re-running one tool call is cheap; re-running the task because verification only happens at the end converts a step failure into a task failure, which is precisely the compounding you were trying to escape.

5.3

Stop conditions: budgets, progress, loops

Retries fix steps that fail loudly. The more expensive pathology is the loop that never fails at all — it just stops going anywhere. Every production agent needs an explicit answer to "when does this loop end?", and "when the model decides it's done" is not an answer: the model's judgment is the thing being supervised.

Stop condition	Trigger	Implementation notes
Budgets	tokens · tool calls · wall-clock · dollars	Hard caps enforced by the harness, not the prompt. Set per-step and per-task; an agent that is told its remaining budget often self-corrects, but the cap must hold either way.
Progress detection	no measurable progress in W steps	Requires defining a progress signal up front: tests passing, items checked off, diff distance to goal. No signal moving for a window of W steps → stop or escalate. Token consumption is not progress.
Loop detection	repeated state or action	Hash the last few (tool, arguments) pairs; the same call with the same arguments returning the same result N times is a cycle, period. Also catch A→B→A→B oscillation (edit, revert, edit, revert).
Watchdog	external supervisor trips any of the above	Lives outside the agent's context — a process or a cheap second model reading the trace. On trip: kill, snapshot state, summarize for post-mortem or human handoff.

FIELD NOTE

The agent that greps forever. A classic trace: the agent greps for a symbol, gets no match, and concludes — reasonably — that it should search differently. Then it greps eleven more times, varying the casing, the directory, the regex flavor. Each call is locally sensible; the trajectory is a flat line. The model is the last to know it's looping, because its own context normalizes the repetition — by call eight, a transcript full of greps makes another grep look like the established procedure. This is why watchdogs are external by definition: you do not ask the loop whether it is a loop.

One reframe makes teams much better at this: a clean stop is a success mode. An agent that halts at budget with a structured summary — what was attempted, what's verified-done, what failed, what it would try next — has produced a resumable artifact. An agent that thrashes until someone kills the process has produced a forensic exercise. Design the abort path with the same care as the happy path; Chapter 06 makes both observable.

5.4

Plan–act–verify–revise: the canonical inner loop

Sections 5.1–5.3 assemble into one structure, and nearly every serious agent system converges on it independently: plan the next move, act on the smallest meaningful unit, verify the result against something the actor doesn't control, revise on failure. The loop's power is where it puts verification: after every act, not at the end. Per-step verification turns one long chain of $n$ multiplied probabilities into $n$ short, independently recoverable chains — it is EQ A5.2 applied at the finest grain available.

FIG A5.1THE INNER LOOP — GATES, RETRY PATH, RE-PLAN PATH, EXTERNAL WATCHDOG

Two distinct failure edges. The solid edge retries the act with a varied approach; the dashed edge abandons the plan itself. Conflating them — retrying forever under a broken plan — is the single most common loop pathology. The watchdog supervises from outside the context window.

Plan gates are the cheap insurance on the front edge: before the first expensive action, check the plan mechanically. Do the files it references exist? Does it cover every stated constraint? Is its step count inside budget? Does it touch anything on the do-not-touch list? A plan gate is a verifier for intentions — it costs one cheap model call or a few assertions, and it catches the class of failure that no amount of step-level retrying can fix, because every step can succeed while the plan marches confidently toward the wrong goal.

Re-planning triggers formalize the dashed edge. The revise stage must diagnose, not just retry: an execution failure (right idea, flaky step) routes back to ACT with a varied approach; a plan failure routes back to PLAN. Concrete triggers that production systems use: the same step rejected $N$ times despite varied approaches (the plan assumed something false); verification revealing the world differs from the plan's premise (the API the plan depends on is deprecated); burn rate — actual cost per completed step exceeding the plan's implicit estimate by a multiple. Re-planning from a summarized state is cheap; discovering at step 40 that step 3's plan was wrong is not.

5.5

Multi-agent topologies

Multi-agent is not a virtue; it is a topology decision, and the null hypothesis — one agent, one loop — wins more often than the conference talks suggest. Splitting work across agents pays only when it buys one of three things: parallelism over genuinely independent subtasks, context isolation (each worker gets a clean, focused window instead of one bloated one), or independence of judgment (diversity or adversarial pressure that a single context cannot produce, because one context anchors itself). Five recurring shapes:

Topology	Shape	Wins when…	Fails when…
Orchestrator–workers	one planner fans out, owns synthesis	work decomposes cleanly and results must merge coherently in one place	subtasks are coupled; orchestrator becomes the bottleneck and the context hog
Pipeline	serial stages, artifact handoff	staged transforms with machine-checkable interfaces between stages	early-stage errors amplify downstream; stage latencies add up serially
Council + judge	parallel independent opinions → aggregator	diverse judgments are the product: review, ranking, curation	a ground-truth verifier exists — tests beat votes, always
Debate	adversaries argue before a judge	one contested claim, high cost of being wrong	open-ended generation; rewards persuasiveness, which is not truth
Swarm	homogeneous workers, shared queue	many independent, near-identical units of work	shared mutable state — coupled edits become merge conflicts

PYTHON · RUNNABLE IN-BROWSER

# what a topology costs: one task, four shapes, tokens + wall-clock
UNIT = 30_000     # tokens a single agent burns solving the task alone
RESULT = 300      # a structured result handed back (never a transcript)

shapes = {"single agent": (UNIT, 1.00)}
# orchestrator + 3 parallel workers, each ~40% of the exploring + handback
shapes["orchestrator-workers"] = (int(0.25*UNIT + 3*0.4*UNIT + 3*RESULT), 0.25 + 0.40 + 0.10)
# pipeline: 4 serial stages at ~30% each, artifact checks at the seams
shapes["pipeline"] = (int(4*0.3*UNIT + 3*RESULT), 4 * 0.30)
# council: 3 full independent attempts + a judge reading three results
shapes["council + judge"] = (int(3*UNIT + 3*RESULT + 2_000), 1.00 + 0.10)

print(f"{'topology':22s}{'tokens':>8s}{'vs single':>10s}{'wall-clock':>11s}")
for name, (tok, wall) in shapes.items():
    print(f"{name:22s}{tok:8,d}{tok/UNIT:9.2f}x{wall:10.2f}")

o_tok, o_wall = shapes["orchestrator-workers"]
c_tok = shapes["council + judge"][0]
print(f"\nfan-out buys wall-clock, never tokens: the orchestrator runs "
      f"{1 - o_wall:.0%} faster for {o_tok/UNIT - 1:.0%} more tokens;")
print(f"the council pays {c_tok/UNIT:.1f}x for independent judgment — worth it only")
print("when no ground-truth verifier exists, because tests beat votes")

edits are live — break it on purpose

A single agent solves a task in 30,000 tokens. A council + judge runs 3 full independent attempts (30,000 each), hands back 3 results of 300 tokens, and the judge reads them with 2,000 tokens of overhead. Roughly how many × the single-agent token cost is the council?

Council tokens $= 3 \times 30{,}000 + 3 \times 300 + 2{,}000 = 90{,}000 + 900 + 2{,}000 = 92{,}900$. Multiplier $= 92{,}900 / 30{,}000 \approx 3.1$. You pay ~3× for independent judgment — worth it only when no ground-truth verifier exists, because tests beat votes. The answer is 3.1.

Two rules govern every topology, and violating either converts multi-agent from a speedup into a liability:

Hand off results, not transcripts. A worker returns an artifact plus a structured summary — what was done, what's verified, what's unresolved — never its raw conversation. Transcripts carry the worker's dead ends, hallucinated intermediates, and tone into the consumer's context, where they poison downstream reasoning and burn the window. The interface between agents is a contract, exactly like a function signature; Chapter 03's tool-design discipline applies to agents talking to agents.
Parallelize only independent subtasks. Two agents editing the same file is a merge-conflict generator with extra steps; two agents researching with a shared mutable notes doc will overwrite each other's reasoning. Enforce a single-writer rule per resource, and remember EQ A5.1 cuts both ways: every handoff is itself a step that can fail. Adding agents adds steps — the topology must remove more failure surface from the critical path than its own coordination adds.

INSTRUMENT A5.2 — TOPOLOGY PICKER6 SCENARIOS · 5 SHAPES

ORCHESTRATOR–WORKERS

Decomposable work; results must merge in one place.

PIPELINE

Staged transforms with checkable interfaces.

COUNCIL + JUDGE

Independent parallel judgments, then aggregation.

DEBATE

One contested claim; high cost of being wrong.

SWARM

Many homogeneous independent units of work.

RECOMMENDED — 

AVOID HERE — 

Pick a scenario; the winning shape lights up mint, the trap lights up red. Real systems nest these — an orchestrator whose workers are pipelines, a debate whose judge polls a council. The picker shows the dominant pattern; hybrids are the norm, and the single-agent null hypothesis should still beat all five for any task that fits in one context window.

5.6

Long-horizon patterns

Past a few hundred steps, the enemy stops being step failure and becomes state amnesia: the context window fills, compaction or restarts shed detail, and the agent forgets what it decided and why. The long-horizon patterns all share one move — get the program state out of the context window and into something durable, so the model becomes a stateless worker against external state.

Compaction checkpoints. Compaction at an arbitrary moment amputates mid-thought reasoning, and the successor context inherits a summary of confusion. Checkpoint deliberately instead: at clean boundaries — typically right after a verify-pass — write a durable record of goal, decisions made (with reasons), verified state, next action, then compact or restart from that record. The agent should be able to die at any checkpoint and a fresh instance continue from the file alone. If it can't, the checkpoint is decorative.

External task lists as program counters. The oldest idea in computing, rediscovered: keep the loop variable outside the loop.

# tasks.md — the program counter lives outside the context window
[x] 01 inventory call sites of legacy API        # done · 312 sites
[x] 02 write codemod + unit tests                # done · tests green
[>] 03 migrate src/billing/** (shard 3/9)        # in progress — resume here
[ ] 04 migrate src/auth/**
[ ] 05 run full suite · bisect any failures
invariant: every beat → read list · do ONE unchecked item · verify · update list · exit

Marking an item done only after verification makes the list a record of truth, not of intention — and making each item idempotent (safe to re-run if the agent died mid-item) makes crashes cost one item instead of one mission.

Heartbeat loops. For work that outlives any session — monitoring, week-long migrations, slow external dependencies — invert the architecture: instead of one agent that must survive, schedule a recurring re-entry. Each beat: wake, read durable state, do one bounded unit of work, write state back, exit. Reliability comes from the boundedness: each beat is a short chain with small $n$ and full verification, so EQ A5.1 never gets room to compound. A long-horizon agent is a chain of short reliable sessions, not one heroic context.

5.7

Cost & latency engineering

Once the loop is reliable, it is usually overpaying: a verified retry loop happily runs the flagship model on steps a model a tenth the price handles identically. Model-tier routing assigns each step the cheapest tier that clears its required $p$ — mechanical steps (formatting, extraction, glue) to a fast cheap model, judgment steps (planning, diagnosis, revision after repeated failure) to the strong one. The verifier is what makes this safe: routing without verification is gambling with a smaller bankroll; routing with verification is an asymmetry you can price exactly:

EQ A5.4 — DRAFTER + VERIFIER EXPECTED COST $$ \mathbb{E}[\text{cost}] \;=\; c_{\text{draft}} + c_{\text{verify}} + (1 - a)\,c_{\text{strong}} \;<\; c_{\text{strong}} \quad\Longleftrightarrow\quad c_{\text{draft}} + c_{\text{verify}} \;<\; a\,c_{\text{strong}} $$

$a$ is the acceptance rate — the fraction of cheap drafts the verifier passes. With a drafter at a tenth the flagship's price, verification at a twentieth, and $a = 0.7$: expected cost $= 0.1 + 0.05 + 0.3 = 0.45$ of always-flagship, at flagship-grade output quality wherever the verifier is sound. The strong model is paid only for the failures of the cheap one.

A drafter costs $c_{\text{draft}} = 0.1$, verification $c_{\text{verify}} = 0.05$, the strong model $c_{\text{strong}} = 1$ (in flagship units), and the verifier accepts $a = 0.7$ of cheap drafts. By EQ A5.4, what is $\mathbb{E}[\text{cost}] = c_{\text{draft}} + c_{\text{verify}} + (1-a)\,c_{\text{strong}}$?

$\mathbb{E}[\text{cost}] = 0.1 + 0.05 + (1 - 0.7)\times 1 = 0.1 + 0.05 + 0.3 = 0.45$ of always-flagship — a 55% saving at flagship-grade quality wherever the verifier is sound. The strong model is paid only for the 30% of drafts the cheap one gets wrong. The answer is 0.45.

If this looks familiar, it should: it is speculative decoding (Vol II · Ch 08) lifted from the token level to the task level. There, a small draft model proposes tokens and the large model verifies them in one cheap parallel pass, keeping the large model's exact distribution while shifting most of the work to the cheap one. Here, a cheap agent proposes a step result and a verifier accepts or escalates. Same theorem, same precondition: the scheme only pays because verification is cheaper than generation. That asymmetry is real for tests, compilers, schema checks, and constrained judges with rubrics; it is contested for open-ended quality judgments, where the LLM-as-judge has biases of its own — Chapter 06 measures exactly how much you can trust it.

Latency obeys different algebra than cost: it follows the critical path, not the sum. Fan-out across independent subtasks costs more tokens but collapses wall-clock to the slowest branch plus synthesis; verification adds latency only if it serializes — so run cheap checks concurrently with the next step's draft when steps are independent, batch verifications where they aren't, and remember the orchestrator that must read every worker's output is a serial drain at the end of every parallel fan-out. Topology, routing, retries, and stop conditions are all one budget in three currencies — success probability, dollars, and seconds — and loop engineering is the art of spending each where its marginal return is highest.

A loop you cannot measure is a loop you cannot trust. Chapter 06: evals for agents — pass@k versus pass^k, trajectory scoring, the observability traces that catch the grep-forever loop in minute two instead of hour two, and the cost dashboards that tell you whether any of this engineering paid for itself.

§