AI // ENCYCLOPEDIA / VOL IV / AGENT ENGINEERING / 06 / EVALS, OBSERVABILITY & COST INDEX NEXT: THE GYM →
VOLUME IV — AGENT ENGINEERING · CHAPTER 06 / 06

Evals, Observability & Cost

The first five chapters of this volume covered how to build agents. This one covers how to know whether they work, why they fail, and what they cost. In production you ship what you measure, and an unmeasured agent is an outage with a head start. We climb the eval pyramid, derive the unbiased pass@k estimator, audit a trajectory judge, make traces first-class artifacts, and close on the KPI that decides everything: cost per resolved task.

LEVELADVANCED READING TIME≈ 25 MIN BUILDS ONVOL IV CH 01–05 · VOL III CH 07 INSTRUMENTSPASS@K EXPLORER · COST OF A TASK
6.1

Why agent evals are hard

Classical LLM evaluation assumes a clean contract: one input, one output, one reference to compare against. Agents break every clause of that contract at once:

  • Nondeterminism is structural, not incidental. Sampling temperature, tool side effects, live APIs, mutable filesystems, rate limits — run the same task twice and you get two different trajectories, sometimes two different outcomes. A single-run eval of an agent is a coin flip wearing a lab coat. The fix is statistical: multiple seeded runs per task, and reported variance, not just a point estimate (the noise-floor logic of Vol III · EQ P7.1 applies with larger error bars).
  • Errors compound across steps. A model that takes the right action 98% of the time finishes a 30-step task at \(0.98^{30} \approx 0.55\) if errors are fatal and independent. They are neither, fully — recovery loops (Ch 05) buy back some of that — but the geometry is real: per-step metrics wildly overstate end-to-end reliability, and small per-step gains move task success a lot.
  • There is no single right answer. Two correct trajectories for "fix this bug" may share zero tool calls and produce different-but-both-valid patches. String-matching transcripts is hopeless; you must verify outcomes (does the test suite pass? does the row exist?) or score process quality with rubrics — never diff trajectories against a golden transcript.
  • The world is part of the system under test. A flaky dependency, a changed API response, a repo that drifted since the task was authored — all show up as "agent regressions." Serious harnesses pin the environment: containerized repos, recorded API fixtures, snapshot databases.
  • Evals are expensive. One end-to-end run can take minutes and cost real dollars, so statistical power is something you budget for, not something you get for free. This is exactly why the next section is a pyramid and not a single metric.

The consequence: agent evaluation is a layered system you engineer, with the same care as the agent itself. Teams that treat it as an afterthought discover their agent's true success rate from customer tickets.

A model takes the right action 97% of the time per step. If errors were fatal and independent over a 30-step task, what end-to-end success rate would \(0.97^{30}\) predict? Give a probability between 0 and 1.
\(0.97^{30}\): \(0.97^{10} = 0.7374\), so \(0.97^{30} = 0.7374^{3} \approx 0.401\). Per-step metrics wildly overstate end-to-end reliability — which is exactly why a 97%-per-step headline is not a 97%-per-task agent, and why the pyramid measures whole tasks at the top. The answer is 0.401.
6.2

The eval pyramid

Like the testing pyramid in software, agent evals trade fidelity against speed. Cheap, deterministic checks run thousands of times a day and catch most regressions; expensive end-to-end runs are the ground truth you can only afford nightly. Each layer catches what the one below cannot see:

FIG A6.ATHE EVAL PYRAMID — VOLUME DOWN, FIDELITY UP
L3 · END-TO-END TASK SUCCESS harness verifies world state · nightly · dollars/run L2 · TRAJECTORY EVALS rubric or judge over the full step sequence L1 · TOOL-CALL ACCURACY right tool, right args, schema-valid · vs golden calls L0 · PROMPT EVALS deterministic asserts on single completions · CI on every commit FIDELITY · COST VOLUME · SPEED
LayerUnit scoredOracleCost / runWhat it misses
L0 — Prompt evalsone completionasserts: contains / parses / classifies correctly~freeEverything multi-step; passes while the agent loops forever
L1 — Tool-call accuracyone decisiongolden tool call: name + args match (semantically, not byte-wise)¢Compounding: 95% per-step ≠ 95% per-task
L2 — Trajectory evalsthe step sequencerubric, scored by humans or an LLM judge (§6.4)¢¢Judge bias; a beautiful trajectory can still end in the wrong answer
L3 — End-to-end successfinal world stateharness verification: tests pass, record exists, file correct$The why — a pass/fail bit with no diagnosis

The raw material for every layer is the same: a golden set — a small, frozen, versioned collection of tasks with verified outcomes. The discipline you learned for prompts (Vol III Ch 07) transfers intact: 30–200 tasks drawn from real traffic and real failures, decontaminated from anything the model might have memorized, pinned alongside the prompt and tool versions they test. Every production incident you fix becomes a new golden task. The set is never edited casually — when it changes, every historical score changes meaning.

The pyramid is also a debugging router. L3 fails but L2 looks clean? Suspect the environment or the verifier. L2 degrades while L1 holds? The individual decisions are fine but the plan is drifting — look at context engineering (Ch 02). L1 drops after a tool-description edit? You just measured the blast radius of a one-line change.

6.3

pass@k, read honestly

pass@k answers: if I let the agent attempt the task \(k\) times, what is the probability that at least one attempt succeeds? Estimating it naively is a trap. You could sample \(k\) runs, check if any passed, and average — but that wastes samples and has brutal variance. You could compute the per-attempt success rate \(c/n\) from \(n\) runs and plug it into \(1-(1-c/n)^k\) — but that estimator is biased low: it treats your \(k\) hypothetical draws as resampling with replacement from the empirical rate, and the bias is worst exactly where benchmarks live (small \(n\), small \(c\)). The fix, popularized by the Codex paper (Chen et al., 2021), is combinatorial:

EQ A6.1 — UNBIASED pass@k ESTIMATOR $$ \widehat{\text{pass@}k} \;=\; \mathop{\mathbb{E}}_{\text{tasks}} \left[\, 1 \;-\; \frac{\dbinom{n-c}{k}}{\dbinom{n}{k}} \,\right] $$
Per task: draw \(n \ge k\) samples, count \(c\) correct. \(\binom{n-c}{k}/\binom{n}{k}\) is exactly the probability that a uniformly random \(k\)-subset of your \(n\) samples contains zero successes — so its complement is the chance at least one of \(k\) passes. This is unbiased for any \(n \ge k\); the plug-in \(1-(1-c/n)^k\) systematically underestimates. If \(n - c < k\), the numerator is zero and the estimate is exactly 1. Compute it as the product \(\prod_{i=0}^{k-1}\frac{n-c-i}{\,n-i\,}\) — never with raw factorials, which overflow past \(n \approx 170\).
You drew \(n = 20\) samples for a task and \(c = 5\) passed. By EQ A6.1, the unbiased pass@2 is \(1 - \dfrac{n-c}{n}\cdot\dfrac{n-c-1}{n-1}\). What is pass@2?
The probability a random 2-subset has zero successes is \(\dfrac{15}{20}\cdot\dfrac{14}{19} = 0.75 \times 0.7368 = 0.5526\). So pass@2 \(= 1 - 0.5526 = 0.447\). Note it tops the per-attempt rate of \(5/20 = 0.25\) — a second attempt with an oracle to pick the winner is real lift, which is why any pass@k headline must state its \(k\). The answer is 0.447.
PYTHON · RUNNABLE IN-BROWSER
# the unbiased pass@k estimator (EQ A6.1), exact, via math.comb
from math import comb

def pass_at_k(n, c, k):              # 1 - C(n-c, k) / C(n, k)
    if n - c < k: return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)

n = 20
print(f"n = {n} samples per task")
print(f"{'c':>4s}{'pass@1':>9s}{'pass@5':>9s}{'pass@10':>9s}")
for c in (2, 5, 10):
    p1, p5, p10 = (pass_at_k(n, c, k) for k in (1, 5, 10))
    print(f"{c:4d}{p1:9.1%}{p5:9.1%}{p10:9.1%}")

ks = list(range(1, 16))
plot_xy(ks, [pass_at_k(n, 5, k) for k in ks])   # the c = 5 curve

p1, p8 = pass_at_k(n, 5, 1), pass_at_k(n, 5, 8)
print(f"\nc = 5: pass@1 = {p1:.1%} but pass@8 = {p8:.1%} — "
      f"a {p8/p1:.1f}x flattery factor")
print("pass@8 measures the harness's attempt budget plus an oracle to pick")
print("the winner — a headline that omits k is marketing, not measurement")
edits are live — break it on purpose

Now the honesty clause. pass@1 is what a production agent experiences: one attempt, no second chances. pass@8 is what a system with eight attempts and a free oracle to pick the winner experiences. A model that solves a task 5% of the time per attempt posts pass@1 = 5% but pass@8 ≈ 34% — a 7× flattery factor, earned entirely by the harness, not the model. pass@8 is a legitimate number when you actually run best-of-\(k\) with a cheap verifier (unit tests, schema checks) to select among attempts. It is marketing when the headline omits the \(k\). When you read any benchmark claim, demand four facts: k, n, the sampling temperature (pass@1 at \(T=0\) and pass@1 averaged over \(T=0.8\) samples are different quantities), and the scaffold.

That last one dominates more than most people expect. SWE-bench-style harness evals are the L3 gold standard for coding agents: the harness drops the agent into a containerized repo at a pinned commit, the agent produces a patch, and the harness runs the repo's own tests — the previously failing ones must now pass (FAIL_TO_PASS) without breaking the rest (PASS_TO_PASS). Resolution is verified by execution, not by judgment. But the published number is a property of the agent system — model + scaffold + tools + retry budget — and leaderboard climbs mix all four. The Verified subset exists because hundreds of original tasks turned out to be unsolvable or under-specified; contamination remains a live concern for any repo that predates a model's training cutoff. Read harness numbers as: "this scaffold, this model, this k." Nothing more transfers.

INSTRUMENT A6.1 — PASS@K EXPLOREREQ A6.1 · EXACT ESTIMATOR · LIVE
PASS@1 (= c/n)
PASS@k AT SELECTED k
FLATTERY (PASS@k ÷ PASS@1)
WEAK MODEL: 5% per-attempt success becomes ≈35% at k = 8 — sampling does the work. STRONG MODEL: pass@8 saturates near 100% and stops discriminating between good and great. LUCKY BENCHMARK: with only n = 10 samples and c = 1, pass@8 reads 80% off a single success — small-n harnesses make weak models look heroic. The curve is the exact estimator, not a fit.
6.4

LLM-judged trajectories

End-to-end harnesses output one bit per run. When the bit is 0, you need to know which step went wrong — and grading hundreds of fifty-step transcripts is not a job humans accept twice. So you hire a judge model to score trajectories. The craft is to make the judge grade checkable per-step properties, not vibes:

Rubric item (per step)FormWhat it catches
Tool choice defensible?binary + cite the stepSearch-when-it-should-read, write-before-verify
Arguments grounded?binaryArgs invented rather than copied from a prior observation — the hallucinated-ID classic
Result actually read?binaryNext action contradicts what the tool just returned
State re-verified after mutation?binaryFire-and-forget writes; assuming success on a 500
Termination correct?binary, end of runDeclared victory early; gave up with budget remaining; asked the user what a tool could answer

Binary, citable items keep the judge auditable: every "no" must point to a step number, which a human can check in seconds. Aggregate to a per-run score, then track the distribution across the golden set.

Every judge bias from Vol III Ch 07 applies — doubly. Trajectories are long, so the judge inherits the lost-in-the-middle problem and quietly skims steps 12 through 38. Style bias gets worse: an agent that narrates its failure confidently ("I have verified the fix") outscores one that succeeds tersely, unless the rubric forces evidence citations. Position bias contaminates pairwise trajectory comparisons exactly as it does single completions — judge both orders or don't bother. Self-preference is sharper still, because the judge recognizes its own family's action style, not just prose style: judge with a different model family than the agent. And add one bias unique to this setting: outcome leakage — if the judge can see that the run ended in success, it retroactively scores every step as reasonable. Strip the final outcome when you want a genuine process grade.

Calibrate before you trust: hand-label 50–100 trajectories, measure judge–human agreement per rubric item, and only automate the items where agreement is high. Keep a standing human spot-audit (5–10% of judged runs, forever). A judge you never audit is a metric drifting in the dark.

6.5

Observability: the trace is the artifact

For an agent, the log is not a debugging aid — the trace is the primary artifact the system produces, more durable than any single answer. A production-grade trace records, for every step: the model and prompt version, the rendered context (or a hash plus a pointer to it), the tool called, the exact arguments, the result (truncated, with a content hash), input and output token counts, cache hit tokens, latency, and the stop reason. Structure it as a span tree — sub-agents (Ch 05) nest naturally — and store it where engineers can query it, not where it rotates out after 24 hours.

Raw traces become knowledge through a failure taxonomy. Five classes cover the overwhelming majority of agent failures, and each has a recognizable trace signature:

Failure classTrace signatureUsual root causeFirst fix to try
Wrong toolplausible call, wrong instrument for the goalOverlapping or vague tool descriptionsSharpen descriptions; merge near-duplicate tools (Ch 03)
Bad argsschema errors, or valid-but-wrong valuesLoose schemas; required context truncated awayTighten schemas + validate server-side; check what compaction dropped
Hallucinated stateacts on a file / ID / result that no observation ever returnedModel filled a gap from priors instead of readingForce a read-before-write discipline; keep ground truth in context (Ch 02)
Gave uppremature "I cannot" with budget and tools remainingOver-cautious prompt; one failed call treated as fatalPrompt for retry-with-variation; surface remaining budget to the model
Loopsame call (or A→B→A cycle) with near-identical args, 3+ timesNo new information entering context between attemptsLoop detector in the runtime; inject "you already tried this" (Ch 05)

Tag every failed run with one (or more) of these classes — by judge, by heuristic, or by hand — and the vague complaint "the agent is flaky" becomes a Pareto chart with an owner per bar. In practice the distribution is never uniform, and the top class usually points at one tool description or one compaction rule.

Debugging an agent means replaying the trace. Because the trace stores every tool result, you can re-run the model deterministically against recorded observations — no live side effects — and bisect for the exact step where the agent's belief diverged from the world. Change the prompt, replay the same recorded run, and watch whether the divergence step moves. This is the agent engineer's equivalent of a time-travel debugger, and it is the single strongest argument for paying the storage bill on full traces.

CAVEAT

Traces are radioactive. They contain user data, secrets that transited tool calls, and occasionally credentials a tool should never have returned. Redact at write time (not query time), encrypt at rest, scope access, and set retention deliberately: 100% of traces for days-to-weeks, a sample plus all failures for the long term.

6.6

Cost & token accounting

An agent's bill is a sum over loop iterations, and each iteration re-sends the (growing) context. Two prices apply to input: the full rate for fresh tokens and a deep discount — commonly around 10× cheaper — for tokens served from the provider's prefix cache, which is why cache-friendly context layout (Ch 02) is a line item, not a nicety:

EQ A6.2 — COST OF A RUN, COST OF A RESULT $$ C_{\text{run}} \;=\; \sum_{s=1}^{S} \Big[\, n^{(s)}_{\text{in}} \big( h\, p_{\text{cached}} + (1-h)\, p_{\text{in}} \big) \;+\; n^{(s)}_{\text{out}}\, p_{\text{out}} \,\Big], \qquad C_{\text{resolved}} \;=\; \frac{C_{\text{run}}}{\Pr[\text{resolved}]} $$
\(S\) steps; \(n^{(s)}_{\text{in}}, n^{(s)}_{\text{out}}\) input/output tokens at step \(s\); \(h\) the cache hit rate; \(p_{\text{cached}} \approx 0.1\, p_{\text{in}}\) on most current pricing. Because context accumulates, \(n^{(s)}_{\text{in}}\) grows roughly linearly in \(s\) — so without compaction, run cost grows quadratically in step count: the sum of a linearly growing context is \(O(S^2)\). And the right-hand identity is the one executives should see: dividing by the resolution rate converts "cost per attempt" into cost per task actually solved — the only number comparable across models, scaffolds, and vendors.
An agent run costs \(C_{\text{run}} = \$0.54\) and resolves the task 65% of the time. By EQ A6.2, what is the cost per resolved task, \(C_{\text{resolved}} = C_{\text{run}} / \Pr[\text{resolved}]\), in dollars?
\(C_{\text{resolved}} = 0.54 / 0.65 \approx \$0.83\). Cost per run is vanity; dividing by the resolution rate gives the only figure comparable across models and scaffolds — a cheaper run that resolves less often can easily cost more per task solved. The answer is 0.83.
PYTHON · RUNNABLE IN-BROWSER
# EQ A6.2 end to end: cost per run is vanity, cost per resolved is truth
S, IN_TOK, OUT_TOK, H = 20, 12_000, 700, 0.60   # steps, in/out per step, cache hit
TIERS = {"frontier": (3.00, 15.00, 0.65),       # $/Mtok in, $/Mtok out, resolution
         "mid":      (0.80,  4.00, 0.55),
         "small":    (0.15,  0.60, 0.30)}

def run_cost(p_in, p_out):                      # cached input billed at 10%
    inp = S * IN_TOK * (H * 0.1 * p_in + (1 - H) * p_in) / 1e6
    return inp + S * OUT_TOK * p_out / 1e6

print(f"{'tier':10s}{'$/run':>8s}{'resolve':>9s}{'$/resolved':>12s}{'monthly@10K':>13s}")
for name, (p_in, p_out, resolve) in TIERS.items():
    c = run_cost(p_in, p_out)
    print(f"{name:10s}{c:8.3f}{resolve:9.0%}{c/resolve:12.3f}{10_000*c:13,.0f}")

# tier routing: 14 mechanical steps on small, 6 decision steps on frontier
c_routed = (14 / S) * run_cost(0.15, 0.60) + (6 / S) * run_cost(3.00, 15.00)
for resolve in (0.60, 0.20):
    print(f"routed, resolution {resolve:.0%}: ${c_routed:.3f}/run "
          f"-> ${c_routed/resolve:.3f}/resolved")
print("\nrouting at held resolution beats frontier ($0.30 vs $0.83); the same")
print("routing at collapsed resolution loses to it ($0.90) — never approve a")
print("routing change on cost per run; approve it on cost per resolved task")
edits are live — break it on purpose

Cost per resolved task is the real KPI because it correctly punishes false economies. A worked example with the instrument's illustrative prices — 20 steps, 12K average input and 700 output tokens per step, 60% cache hit rate. All-frontier ($3.00 in / $15.00 out per Mtok): about $0.54 per run; at a 65% resolution rate, ≈ $0.83 per resolved task, or ≈ $5,400 a month at 10K runs. Now route the 14 mechanical steps (file reads, greps, summarization) to the small tier ($0.15 / $0.60) and keep the 6 decision steps on frontier: the run drops to ≈ $0.18 — 3× cheaper. If resolution holds near 60%, cost per resolved task falls to ≈ $0.30 and the routing pays. If resolution collapses to 20% — which sloppy down-tiering absolutely can do — cost per resolved task is ≈ $0.90 and the "savings" made you worse off than frontier-everywhere. Never approve a routing change on cost per run; approve it on cost per resolved task, re-measured.

After routing the mechanical steps to a cheap tier, an agent run costs $0.18. At 10,000 runs per month, what is the monthly bill, in dollars?
Monthly \(= \$0.18 \times 10{,}000 = \$1{,}800\). But never approve this routing on cost per run alone — re-measure cost per resolved task first, because down-tiering that quietly drops the resolution rate can make the cheaper run more expensive per task solved. The answer is 1800.
INSTRUMENT A6.2 — COST OF A TASKEQ A6.2 · ILLUSTRATIVE PRICE TIERS
COST / RUN
COST / RESOLVED TASK
MONTHLY @ 10K RUNS
Bars show cost per run for all three tiers at your dials (cached input billed at 10% of the input rate); the selected tier feeds the readouts. Push steps to 60 and watch output stay a rounding error while input dominates — then raise the cache hit rate and reclaim most of it. The trap to internalize: switching tiers moves the bar instantly, but the RESOLUTION RATE slider is your honesty dial — drop it to what the cheap tier actually achieves before celebrating. Prices are illustrative; the algebra is EQ A6.2 exactly.

Token accounting belongs in the trace (§6.5), per step, not just per run — it is how you discover that one chatty tool returns 40K tokens of JSON nobody reads, or that a prompt edit silently broke prefix caching and doubled the bill. Budget caps (next section) are enforced from these same counters, inside the loop, in real time.

6.7

The production checklist

Everything in this volume condenses to six controls. An agent missing any of them is a demo, whatever the deck says:

ControlWhat it isTrip-wire / discipline
Eval gateNo prompt, model, tool, or scaffold change ships without a green golden-set runBlock on any regression beyond the suite's measured noise floor — not on "looks fine"
Budget capsPer-run ceilings on steps, tokens, dollars, and wall-clock, enforced inside the loopCap hit → graceful summarize-and-stop, logged as its own failure class
Kill switchOne flag halts new runs and drains in-flight onesFire-drill it quarterly; a kill switch you have never pulled is a hypothesis
Trace retention100% of traces for days–weeks; all failures + a sample, long-term; redacted at write timeYou cannot replay what you discarded; you cannot leak what you redacted
Regression suiteGolden set + one new task per fixed incident, run nightly and pre-deployThe suite only grows; deletions require a written reason
Incident playbookFailure taxonomy → owner → rollback procedure, written before the incidentEvery postmortem ends by adding a golden task and, where possible, a runtime guard

The through-line of this chapter — and this volume — is a single habit: close the loop. Traces feed the taxonomy, the taxonomy feeds the golden set, the golden set gates the next change, and the cost accounting tells you whether any of it is worth running. Agents do not become reliable by being built well once; they become reliable by being measured forever.

NEXT

Four volumes of theory earn you exactly nothing until they survive contact with practice. THE GYM is where that happens: drills across all four volumes — foundations, prompting, and agent engineering — with instruments scoring you instead of the model. Go lift.

§

Further reading

  • Chen, M. et al. (2021). Evaluating Large Language Models Trained on Code. — introduces HumanEval and the pass@k metric this chapter reads honestly.
  • Jimenez, C. E. et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. — the benchmark that set the bar for end-state-verified agent tasks.
  • Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. — a multi-environment suite for measuring agents across interactive tasks.
  • Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. — the reference study on using models to judge trajectories, including their biases.
  • Sigelman, B. et al. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. — the distributed-tracing model behind agent observability and spans.
  • Ribeiro, M. T. et al. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. — argues for capability-targeted test suites over single aggregate scores.