Why agent evals are hard
Classical LLM evaluation assumes a clean contract: one input, one output, one reference to compare against. Agents break every clause of that contract at once:
- Nondeterminism is structural, not incidental. Sampling temperature, tool side effects, live APIs, mutable filesystems, rate limits — run the same task twice and you get two different trajectories, sometimes two different outcomes. A single-run eval of an agent is a coin flip wearing a lab coat. The fix is statistical: multiple seeded runs per task, and reported variance, not just a point estimate (the noise-floor logic of Vol III · EQ P7.1 applies with larger error bars).
- Errors compound across steps. A model that takes the right action 98% of the time finishes a 30-step task at \(0.98^{30} \approx 0.55\) if errors are fatal and independent. They are neither, fully — recovery loops (Ch 05) buy back some of that — but the geometry is real: per-step metrics wildly overstate end-to-end reliability, and small per-step gains move task success a lot.
- There is no single right answer. Two correct trajectories for "fix this bug" may share zero tool calls and produce different-but-both-valid patches. String-matching transcripts is hopeless; you must verify outcomes (does the test suite pass? does the row exist?) or score process quality with rubrics — never diff trajectories against a golden transcript.
- The world is part of the system under test. A flaky dependency, a changed API response, a repo that drifted since the task was authored — all show up as "agent regressions." Serious harnesses pin the environment: containerized repos, recorded API fixtures, snapshot databases.
- Evals are expensive. One end-to-end run can take minutes and cost real dollars, so statistical power is something you budget for, not something you get for free. This is exactly why the next section is a pyramid and not a single metric.
The consequence: agent evaluation is a layered system you engineer, with the same care as the agent itself. Teams that treat it as an afterthought discover their agent's true success rate from customer tickets.
The eval pyramid
Like the testing pyramid in software, agent evals trade fidelity against speed. Cheap, deterministic checks run thousands of times a day and catch most regressions; expensive end-to-end runs are the ground truth you can only afford nightly. Each layer catches what the one below cannot see:
| Layer | Unit scored | Oracle | Cost / run | What it misses |
|---|---|---|---|---|
| L0 — Prompt evals | one completion | asserts: contains / parses / classifies correctly | ~free | Everything multi-step; passes while the agent loops forever |
| L1 — Tool-call accuracy | one decision | golden tool call: name + args match (semantically, not byte-wise) | ¢ | Compounding: 95% per-step ≠ 95% per-task |
| L2 — Trajectory evals | the step sequence | rubric, scored by humans or an LLM judge (§6.4) | ¢¢ | Judge bias; a beautiful trajectory can still end in the wrong answer |
| L3 — End-to-end success | final world state | harness verification: tests pass, record exists, file correct | $ | The why — a pass/fail bit with no diagnosis |
The raw material for every layer is the same: a golden set — a small, frozen, versioned collection of tasks with verified outcomes. The discipline you learned for prompts (Vol III Ch 07) transfers intact: 30–200 tasks drawn from real traffic and real failures, decontaminated from anything the model might have memorized, pinned alongside the prompt and tool versions they test. Every production incident you fix becomes a new golden task. The set is never edited casually — when it changes, every historical score changes meaning.
The pyramid is also a debugging router. L3 fails but L2 looks clean? Suspect the environment or the verifier. L2 degrades while L1 holds? The individual decisions are fine but the plan is drifting — look at context engineering (Ch 02). L1 drops after a tool-description edit? You just measured the blast radius of a one-line change.
pass@k, read honestly
pass@k answers: if I let the agent attempt the task \(k\) times, what is the probability that at least one attempt succeeds? Estimating it naively is a trap. You could sample \(k\) runs, check if any passed, and average — but that wastes samples and has brutal variance. You could compute the per-attempt success rate \(c/n\) from \(n\) runs and plug it into \(1-(1-c/n)^k\) — but that estimator is biased low: it treats your \(k\) hypothetical draws as resampling with replacement from the empirical rate, and the bias is worst exactly where benchmarks live (small \(n\), small \(c\)). The fix, popularized by the Codex paper (Chen et al., 2021), is combinatorial:
# the unbiased pass@k estimator (EQ A6.1), exact, via math.comb
from math import comb
def pass_at_k(n, c, k): # 1 - C(n-c, k) / C(n, k)
if n - c < k: return 1.0
return 1.0 - comb(n - c, k) / comb(n, k)
n = 20
print(f"n = {n} samples per task")
print(f"{'c':>4s}{'pass@1':>9s}{'pass@5':>9s}{'pass@10':>9s}")
for c in (2, 5, 10):
p1, p5, p10 = (pass_at_k(n, c, k) for k in (1, 5, 10))
print(f"{c:4d}{p1:9.1%}{p5:9.1%}{p10:9.1%}")
ks = list(range(1, 16))
plot_xy(ks, [pass_at_k(n, 5, k) for k in ks]) # the c = 5 curve
p1, p8 = pass_at_k(n, 5, 1), pass_at_k(n, 5, 8)
print(f"\nc = 5: pass@1 = {p1:.1%} but pass@8 = {p8:.1%} — "
f"a {p8/p1:.1f}x flattery factor")
print("pass@8 measures the harness's attempt budget plus an oracle to pick")
print("the winner — a headline that omits k is marketing, not measurement")
Now the honesty clause. pass@1 is what a production agent experiences: one attempt, no second chances. pass@8 is what a system with eight attempts and a free oracle to pick the winner experiences. A model that solves a task 5% of the time per attempt posts pass@1 = 5% but pass@8 ≈ 34% — a 7× flattery factor, earned entirely by the harness, not the model. pass@8 is a legitimate number when you actually run best-of-\(k\) with a cheap verifier (unit tests, schema checks) to select among attempts. It is marketing when the headline omits the \(k\). When you read any benchmark claim, demand four facts: k, n, the sampling temperature (pass@1 at \(T=0\) and pass@1 averaged over \(T=0.8\) samples are different quantities), and the scaffold.
That last one dominates more than most people expect. SWE-bench-style harness evals are the L3 gold standard for coding agents: the harness drops the agent into a containerized repo at a pinned commit, the agent produces a patch, and the harness runs the repo's own tests — the previously failing ones must now pass (FAIL_TO_PASS) without breaking the rest (PASS_TO_PASS). Resolution is verified by execution, not by judgment. But the published number is a property of the agent system — model + scaffold + tools + retry budget — and leaderboard climbs mix all four. The Verified subset exists because hundreds of original tasks turned out to be unsolvable or under-specified; contamination remains a live concern for any repo that predates a model's training cutoff. Read harness numbers as: "this scaffold, this model, this k." Nothing more transfers.
LLM-judged trajectories
End-to-end harnesses output one bit per run. When the bit is 0, you need to know which step went wrong — and grading hundreds of fifty-step transcripts is not a job humans accept twice. So you hire a judge model to score trajectories. The craft is to make the judge grade checkable per-step properties, not vibes:
| Rubric item (per step) | Form | What it catches |
|---|---|---|
| Tool choice defensible? | binary + cite the step | Search-when-it-should-read, write-before-verify |
| Arguments grounded? | binary | Args invented rather than copied from a prior observation — the hallucinated-ID classic |
| Result actually read? | binary | Next action contradicts what the tool just returned |
| State re-verified after mutation? | binary | Fire-and-forget writes; assuming success on a 500 |
| Termination correct? | binary, end of run | Declared victory early; gave up with budget remaining; asked the user what a tool could answer |
Binary, citable items keep the judge auditable: every "no" must point to a step number, which a human can check in seconds. Aggregate to a per-run score, then track the distribution across the golden set.
Every judge bias from Vol III Ch 07 applies — doubly. Trajectories are long, so the judge inherits the lost-in-the-middle problem and quietly skims steps 12 through 38. Style bias gets worse: an agent that narrates its failure confidently ("I have verified the fix") outscores one that succeeds tersely, unless the rubric forces evidence citations. Position bias contaminates pairwise trajectory comparisons exactly as it does single completions — judge both orders or don't bother. Self-preference is sharper still, because the judge recognizes its own family's action style, not just prose style: judge with a different model family than the agent. And add one bias unique to this setting: outcome leakage — if the judge can see that the run ended in success, it retroactively scores every step as reasonable. Strip the final outcome when you want a genuine process grade.
Calibrate before you trust: hand-label 50–100 trajectories, measure judge–human agreement per rubric item, and only automate the items where agreement is high. Keep a standing human spot-audit (5–10% of judged runs, forever). A judge you never audit is a metric drifting in the dark.
Observability: the trace is the artifact
For an agent, the log is not a debugging aid — the trace is the primary artifact the system produces, more durable than any single answer. A production-grade trace records, for every step: the model and prompt version, the rendered context (or a hash plus a pointer to it), the tool called, the exact arguments, the result (truncated, with a content hash), input and output token counts, cache hit tokens, latency, and the stop reason. Structure it as a span tree — sub-agents (Ch 05) nest naturally — and store it where engineers can query it, not where it rotates out after 24 hours.
Raw traces become knowledge through a failure taxonomy. Five classes cover the overwhelming majority of agent failures, and each has a recognizable trace signature:
| Failure class | Trace signature | Usual root cause | First fix to try |
|---|---|---|---|
| Wrong tool | plausible call, wrong instrument for the goal | Overlapping or vague tool descriptions | Sharpen descriptions; merge near-duplicate tools (Ch 03) |
| Bad args | schema errors, or valid-but-wrong values | Loose schemas; required context truncated away | Tighten schemas + validate server-side; check what compaction dropped |
| Hallucinated state | acts on a file / ID / result that no observation ever returned | Model filled a gap from priors instead of reading | Force a read-before-write discipline; keep ground truth in context (Ch 02) |
| Gave up | premature "I cannot" with budget and tools remaining | Over-cautious prompt; one failed call treated as fatal | Prompt for retry-with-variation; surface remaining budget to the model |
| Loop | same call (or A→B→A cycle) with near-identical args, 3+ times | No new information entering context between attempts | Loop detector in the runtime; inject "you already tried this" (Ch 05) |
Tag every failed run with one (or more) of these classes — by judge, by heuristic, or by hand — and the vague complaint "the agent is flaky" becomes a Pareto chart with an owner per bar. In practice the distribution is never uniform, and the top class usually points at one tool description or one compaction rule.
Debugging an agent means replaying the trace. Because the trace stores every tool result, you can re-run the model deterministically against recorded observations — no live side effects — and bisect for the exact step where the agent's belief diverged from the world. Change the prompt, replay the same recorded run, and watch whether the divergence step moves. This is the agent engineer's equivalent of a time-travel debugger, and it is the single strongest argument for paying the storage bill on full traces.
Traces are radioactive. They contain user data, secrets that transited tool calls, and occasionally credentials a tool should never have returned. Redact at write time (not query time), encrypt at rest, scope access, and set retention deliberately: 100% of traces for days-to-weeks, a sample plus all failures for the long term.
Cost & token accounting
An agent's bill is a sum over loop iterations, and each iteration re-sends the (growing) context. Two prices apply to input: the full rate for fresh tokens and a deep discount — commonly around 10× cheaper — for tokens served from the provider's prefix cache, which is why cache-friendly context layout (Ch 02) is a line item, not a nicety:
# EQ A6.2 end to end: cost per run is vanity, cost per resolved is truth
S, IN_TOK, OUT_TOK, H = 20, 12_000, 700, 0.60 # steps, in/out per step, cache hit
TIERS = {"frontier": (3.00, 15.00, 0.65), # $/Mtok in, $/Mtok out, resolution
"mid": (0.80, 4.00, 0.55),
"small": (0.15, 0.60, 0.30)}
def run_cost(p_in, p_out): # cached input billed at 10%
inp = S * IN_TOK * (H * 0.1 * p_in + (1 - H) * p_in) / 1e6
return inp + S * OUT_TOK * p_out / 1e6
print(f"{'tier':10s}{'$/run':>8s}{'resolve':>9s}{'$/resolved':>12s}{'monthly@10K':>13s}")
for name, (p_in, p_out, resolve) in TIERS.items():
c = run_cost(p_in, p_out)
print(f"{name:10s}{c:8.3f}{resolve:9.0%}{c/resolve:12.3f}{10_000*c:13,.0f}")
# tier routing: 14 mechanical steps on small, 6 decision steps on frontier
c_routed = (14 / S) * run_cost(0.15, 0.60) + (6 / S) * run_cost(3.00, 15.00)
for resolve in (0.60, 0.20):
print(f"routed, resolution {resolve:.0%}: ${c_routed:.3f}/run "
f"-> ${c_routed/resolve:.3f}/resolved")
print("\nrouting at held resolution beats frontier ($0.30 vs $0.83); the same")
print("routing at collapsed resolution loses to it ($0.90) — never approve a")
print("routing change on cost per run; approve it on cost per resolved task")
Cost per resolved task is the real KPI because it correctly punishes false economies. A worked example with the instrument's illustrative prices — 20 steps, 12K average input and 700 output tokens per step, 60% cache hit rate. All-frontier ($3.00 in / $15.00 out per Mtok): about $0.54 per run; at a 65% resolution rate, ≈ $0.83 per resolved task, or ≈ $5,400 a month at 10K runs. Now route the 14 mechanical steps (file reads, greps, summarization) to the small tier ($0.15 / $0.60) and keep the 6 decision steps on frontier: the run drops to ≈ $0.18 — 3× cheaper. If resolution holds near 60%, cost per resolved task falls to ≈ $0.30 and the routing pays. If resolution collapses to 20% — which sloppy down-tiering absolutely can do — cost per resolved task is ≈ $0.90 and the "savings" made you worse off than frontier-everywhere. Never approve a routing change on cost per run; approve it on cost per resolved task, re-measured.
Token accounting belongs in the trace (§6.5), per step, not just per run — it is how you discover that one chatty tool returns 40K tokens of JSON nobody reads, or that a prompt edit silently broke prefix caching and doubled the bill. Budget caps (next section) are enforced from these same counters, inside the loop, in real time.
The production checklist
Everything in this volume condenses to six controls. An agent missing any of them is a demo, whatever the deck says:
| Control | What it is | Trip-wire / discipline |
|---|---|---|
| Eval gate | No prompt, model, tool, or scaffold change ships without a green golden-set run | Block on any regression beyond the suite's measured noise floor — not on "looks fine" |
| Budget caps | Per-run ceilings on steps, tokens, dollars, and wall-clock, enforced inside the loop | Cap hit → graceful summarize-and-stop, logged as its own failure class |
| Kill switch | One flag halts new runs and drains in-flight ones | Fire-drill it quarterly; a kill switch you have never pulled is a hypothesis |
| Trace retention | 100% of traces for days–weeks; all failures + a sample, long-term; redacted at write time | You cannot replay what you discarded; you cannot leak what you redacted |
| Regression suite | Golden set + one new task per fixed incident, run nightly and pre-deploy | The suite only grows; deletions require a written reason |
| Incident playbook | Failure taxonomy → owner → rollback procedure, written before the incident | Every postmortem ends by adding a golden task and, where possible, a runtime guard |
The through-line of this chapter — and this volume — is a single habit: close the loop. Traces feed the taxonomy, the taxonomy feeds the golden set, the golden set gates the next change, and the cost accounting tells you whether any of it is worth running. Agents do not become reliable by being built well once; they become reliable by being measured forever.
Four volumes of theory earn you exactly nothing until they survive contact with practice. THE GYM is where that happens: drills across all four volumes — foundations, prompting, and agent engineering — with instruments scoring you instead of the model. Go lift.
Further reading
- Chen, M. et al. (2021). Evaluating Large Language Models Trained on Code. — introduces HumanEval and the pass@k metric this chapter reads honestly.
- Jimenez, C. E. et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. — the benchmark that set the bar for end-state-verified agent tasks.
- Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. — a multi-environment suite for measuring agents across interactive tasks.
- Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. — the reference study on using models to judge trajectories, including their biases.
- Sigelman, B. et al. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. — the distributed-tracing model behind agent observability and spans.
- Ribeiro, M. T. et al. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. — argues for capability-targeted test suites over single aggregate scores.