04 · Harness Engineering — AI Encyclopedia

4.1

What a harness is

By 2026 the strange fact of the agent market is that competitors often run the same frontier models and ship wildly different products. The difference is not in the weights — those are rented by the token. It is in the harness: the policy engine that decides which proposed actions execute, the sandbox they execute in, the verifiers that score the result, the checkpoints that make mistakes cheap, and the gates that keep humans in the path of the irreversible. The model proposes; the harness disposes.

FIG A4.ATHE LIFECYCLE OF ONE AGENT ACTION

Every layer is optional, and every omission is a bet. Skip the sandbox and you bet no tool call ever goes wrong; skip the verifier and you bet the first sample is correct; skip the gate and you bet the model never confuses "draft the email" with "send it." Production harnesses make none of these bets.

Why this is where the engineering value concentrated: a harmful outcome needs two things — a bad action proposed, and a bad action allowed to matter. Alignment training suppresses the first factor but cannot zero it, because agent inputs are adversarial (Chapter 03: anything the agent reads is a potential instruction). The second factor is yours:

EQ A4.1 — THE BLAST-RADIUS BOUND $$ \mathbb{E}[\text{damage}] \;\le\; \underbrace{\Pr\big[\text{harmful action executes}\big]}_{\text{model + adversary — never } 0} \;\times\; \underbrace{\max_{a \,\in\, \mathcal{A}_{\text{exec}}} c(a)}_{\text{blast radius — set by the harness}} $$

$\mathcal{A}_{\text{exec}}$ is the set of actions that can actually reach the world after sandboxing and permissions; $c(a)$ is the worst-case cost of action $a$, which checkpoints and reversibility shrink. You cannot drive the first factor to zero under adversarial input — so engineering effort goes into clamping the second. A "safe" agent with system-wide write access is one jailbreak from catastrophe; a mediocre agent in a disposable worktree is one git branch -D from harmless.

PYTHON · RUNNABLE IN-BROWSER

# EQ A4.1 in dollars: identical mistake probabilities, two harnesses
actions = [  # (action class, P[harmful attempt], $cost raw, $cost sandboxed)
    ("bad file edit",        0.050,     2_000,  5),   # git reset vs lost work
    ("rm in the wrong dir",  0.010,    25_000,  5),   # container fs vs your homedir
    ("curl|sh from a README",0.004,   250_000, 50),   # egress allowlist blocks exfil
    ("prod credential use",  0.002, 1_000_000,  0),   # secret never mounted: c(a)=0
]

print(f"{'action class':24s}{'P[attempt]':>11s}{'E[raw]':>9s}{'E[sandboxed]':>14s}")
raw_total = box_total = 0.0
for name, p, c_raw, c_box in actions:
    raw_total += p * c_raw
    box_total += p * c_box
    print(f"{name:24s}{p:11.3f}{p * c_raw:9,.0f}{p * c_box:14.2f}")
print("-" * 58)
print(f"{'expected damage, one attempt of each':35s}{raw_total:9,.0f}{box_total:14.2f}")
print(f"\nsame model, same first factor — the harness cuts E[damage] by "
      f"{raw_total / box_total:,.0f}x")
print("you cannot zero P[harmful attempt]; you fully control max cost c(a)")

edits are live — break it on purpose

By EQ A4.1, $\mathbb{E}[\text{damage}] \le \Pr[\text{harmful action executes}] \times \max c(a)$. An action class has $\Pr[\text{harmful}] = 0.01$ and, after sandboxing, a worst-case cost of $25,000. What is the expected-damage bound, in dollars?

$\mathbb{E}[\text{damage}] \le 0.01 \times 25{,}000 = \$250$. You cannot drive the first factor to zero under adversarial input — so the lever is the blast radius: shrink $c(a)$ with sandboxing and reversibility and the bound falls proportionally. The answer is 250.

Layer	Question it answers	Failure it bounds
Sandbox	where can code run?	Host compromise, data exfiltration, collateral damage
Permissions	which actions execute?	Out-of-scope writes, surprise side effects
Verifier	did it actually work?	Confidently shipped breakage
Checkpoints	can we go back?	Compounding errors, unrecoverable state
Human gate	who owns the irreversible?	Deploys, sends, deletes that no rollback undoes
Telemetry	what happened, exactly?	Unauditable incidents, unlearnable failures

Is the harness really the moat? The claim is contested. Skeptics argue that as models internalize verification and caution, harness layers thin away — and they do thin: teams ask less and allow more with every model generation. But the boundary at the bottom never moves. No amount of capability makes a sent email unsent or a dropped production table undropped. The layers that manage irreversibility are permanent engineering, not scaffolding awaiting a smarter model.

INSTRUMENT A4.1 — HARNESS CONFIGURATORFIVE LAYERS · TWO METERS · ONE INCIDENT

PRESET

SANDBOX

WRITE SCOPE

TEST VERIFICATION

HUMAN GATE

CHECKPOINTS

AUTONOMY / THROUGHPUT—

SAFETY / RECOVERABILITY—

WEAKEST-LINK INCIDENT — —

—

AUTONOMY SCORE

—

SAFETY SCORE

—

WEAKEST LAYER

—

Flip layers and watch the trade. The meters are hand-tuned and illustrative; the incident stories are the real content — each describes what the weakest remaining setting allows, which is how attackers and entropy actually find you. Note that PRODUCTION and REGULATED differ only in write scope, and that no configuration reaches 100 on both meters. That is the theorem of this chapter, not a bug in the widget.

4.2

Sandboxing: blast-radius engineering

The sandbox is where EQ A4.1's second factor gets physically enforced. The design stance is borrowed from security engineering, not from trust: assume the agent will eventually attempt the worst action its environment permits — through error, through injection, or through an instruction it misread — and size the environment so that this worst action is affordable. Three resources need walls:

Filesystem. Read-only mounts for everything the agent needs but must not touch; copy-on-write overlays or dedicated checkouts for what it edits. The cheapest unit of filesystem isolation is the git worktree: a second working directory sharing the repository's object store, where the agent can do anything and the cleanup operation is deleting a branch.
Process. Namespaces, cgroups, and syscall filters (containers); or a separate guest kernel entirely (microVMs such as Firecracker, user-space kernels such as gVisor). The distinction matters because agents run arbitrary code as a feature — every npm install executes strangers' postinstall scripts with the agent's privileges.
Network. The wall that matters most and gets built last. An injected agent with no network egress can corrupt its sandbox; the same agent with open egress can exfiltrate every secret inside it. Default-deny with a short allowlist of package registries and APIs is the production norm.

Mechanism	Isolates	Escape cost	Typical use
Git worktree	workspace state	none — not a security boundary	Parallel attempts, cheap rollback, blast-radius for mistakes
Container	fs · processes · network ns	kernel exploit (shared kernel)	The default agent cell; seccomp/AppArmor hardened
MicroVM / gVisor	guest kernel boundary	hardware-virt escape — rare	Untrusted code at scale, multi-tenant agent platforms
Ephemeral cloud VM	separate machine	≈ infrastructure compromise	Long-horizon autonomous runs; destroyed after the task

CAVEAT

The sandbox protects the host from the agent — not the contents of the sandbox from the agent. Whatever you mount inside the wall is inside the blast radius: production credentials in environment variables, a .env with live keys, an authenticated cloud CLI. Prompt injection does not need a sandbox escape if the valuables were carried into the cell. Mount secrets read-only, scoped, and short-lived — or better, broker them through a proxy the agent never sees raw.

4.3

Permission systems: spending human attention

Inside the sandbox, the permission layer decides per-action: allow, ask, or deny. Two principles carry most of the design. First, allowlist, don't denylist: the set of safe actions is finite and enumerable ("read any file in the repo, run the test suite, edit inside src/"), while the set of dangerous actions is infinite and adversarially generated — a denylist of bad commands is a parlor game against an attacker who can base64-encode. Second, permissions are tiered by capability, not by tool name: the same bash tool is harmless running grep and lethal running curl | sh, so production systems parse and classify the action, not the tool.

Tier	Examples	Policy
Read	ls · grep · cat · GET	Auto-allow, log. Asking here is pure fatigue.
Scoped write	edit in worktree · commit · branch	Auto-allow inside the declared scope; the checkpoint layer makes it cheap to be wrong.
Unscoped / mutating	install package · POST · write outside scope	Ask, with the concrete diff or command shown — never the intent paraphrased.
Irreversible	deploy · send · delete prod data · pay	Hard gate (§4.6) or structurally denied — not reachable from the agent's action set at all.

The binding constraint is not policy expressiveness — it is approval fatigue. Human scrutiny is a depleting budget: the first confirmation dialog of the day gets read; the thirtieth gets reflex-approved in 400 ms. Every unnecessary ask therefore does double damage — it costs throughput now, and it trains the human to rubber-stamp the ask that will matter later. The design objective is brutal and clarifying: asks should be so rare that each one is news. Make the common path silent (reads, scoped writes), batch the questions you must ask, and surface them with the evidence needed to decide in one glance: the diff, the command, the URL — not "the agent wants to use Bash."

A useful audit: count asks per completed task across a week of traces. Above ~5, your users have already stopped reading them, and your permission system has quietly degraded into a latency tax with a false sense of security attached. The fix is almost never "ask better" — it is widening the auto-allow tier while narrowing the scope it applies to (a freer hand inside a smaller room).

4.4

Verification: the ground-truth principle

The single highest-leverage component of a harness is the verifier, because of an asymmetry covered in Vol II · §5.7: for code, math, and structured tasks, checking is enormously cheaper than generating, and the check is objective. RLVR exploits this at training time, turning test results into rewards. The harness exploits the identical signal at inference time, turning test results into retry gates. Same principle, different loop: if the harness can check it, the agent can fix it. The contrapositive governs your roadmap: what the harness cannot check, the agent cannot reliably fix — so the highest-leverage engineering of the era is converting vibes into asserts: golden files, schema validators, screenshot diffs scored by a vision model, latency budgets in CI.

EQ A4.2 — CLOSING THE LOOP $$ \Pr[\text{green within } k \text{ attempts}] \;=\; 1 - (1-p)^k, \qquad \mathbb{E}[\text{attempts}] \;=\; \frac{1-(1-p)^k}{p} $$

$p$ is single-attempt success probability against a sound verifier; attempts stop at first green or after $k$. At $p = 0.45$ and $k = 3$: an 83% completion rate from a model that is right less than half the time — the verifier converts mediocre per-shot accuracy into high task reliability. Caveat: attempts are not independent. Error feedback usually raises later $p_i$ (the failure text is evidence), but failures also correlate — an agent missing the concept loops on variants of the same wrong idea, which is why production harnesses cap retries and escalate to a human instead of burning tokens.

A coder is right $p = 0.45$ of the time per attempt, behind a sound verifier, with up to $k = 3$ attempts. By EQ A4.2, what is $\Pr[\text{green within } k] = 1 - (1-p)^k$?

$(1-p)^k = 0.55^3 = 0.166$, so $\Pr[\text{green}] = 1 - 0.166 = 0.834$. A model that fails more often than it succeeds per shot still clears 83% of tasks once the verifier lets it retry — the verifier, not the model, does the heavy lifting. The answer is 0.834.

PYTHON · RUNNABLE IN-BROWSER

# the ground-truth principle: 2000 patch tasks, verifier on vs off
import numpy as np
rng = np.random.default_rng(0)
P, K, TRIALS = 0.4, 5, 2000          # per-attempt success, retry cap, tasks

correct = rng.random((TRIALS, K)) < P        # correct[t, i]: attempt i would pass
first_ok = correct[:, 0]                     # no verifier: ship attempt 1, unchecked
any_ok = correct.any(axis=1)                 # verifier: retry to green, cap K
attempts = np.where(any_ok, correct.argmax(axis=1) + 1, K)

print(f"no verifier  : ships 100% of tasks, {(~first_ok).mean():.1%} of them broken")
print(f"with verifier: ships {any_ok.mean():.1%} green, 0.0% broken, "
      f"{(~any_ok).mean():.1%} escalated to a human")
print(f"mean attempts per task : {attempts.mean():.2f} "
      f"(theory (1-(1-p)^k)/p = {(1 - (1 - P)**K) / P:.2f})")
print(f"green within {K} (EQ A4.2): simulated {any_ok.mean():.1%}, "
      f"theory 1-(1-p)^k = {1 - (1 - P)**K:.1%}")
print("identical model in both rows — a 40% per-shot coder plus a sound")
print("verifier ships nothing red; the same coder alone ships 60% breakage")

edits are live — break it on purpose

Everything depends on verifier quality, and "quality" decomposes into two error rates: $\alpha$, the chance a correct patch is rejected (flaky tests — wasteful but safe), and $\beta$, the chance a broken patch passes (weak tests — silently fatal). Bayes gives the ceiling:

EQ A4.3 — THE LEAKY-VERIFIER CEILING $$ \Pr[\text{correct} \mid \text{verifier green}] \;=\; \frac{(1-\alpha)\,p}{(1-\alpha)\,p + \beta\,(1-p)} $$

With $p = 0.45$ and a test suite that lets 15% of broken patches through ($\beta = 0.15$, $\alpha = 0.05$), a green run means only 84% correct — and no number of retries raises it, because retries condition on the same leaky green. Worse, a strong agent optimizes against the verifier and inflates effective $\beta$: deleting failing tests, hardcoding expected outputs, special-casing the test inputs. This is reward hacking (Vol II · Ch 05) at inference time. Standard mitigations: tests read-only to the agent's write scope, diff review on any test-file change, and held-out checks the agent never sees.

A patch is correct with prior $p = 0.45$. The suite rejects correct work $\alpha = 0.05$ of the time and passes broken work $\beta = 0.15$ of the time. By EQ A4.3, given a green run, what is $\Pr[\text{correct} \mid \text{green}] = \dfrac{(1-\alpha)p}{(1-\alpha)p + \beta(1-p)}$?

Numerator $= 0.95 \times 0.45 = 0.4275$. Denominator $= 0.4275 + 0.15 \times 0.55 = 0.4275 + 0.0825 = 0.51$. Ratio $= 0.4275 / 0.51 \approx 0.838$. A leaky suite ($\beta = 0.15$) caps trust at ~84% no matter how many times you retry — retries condition on the same leaky green. The answer is 0.838.

INSTRUMENT A4.2 — VERIFY-LOOP SIMONE BUG · TWO HARNESSES · SCRIPTED RUN

TASK — fix: date parser drops timezone fold on DST boundary · suite: 14 tests · model identical in both runs

RUN

PATCH ATTEMPTS

—

SUITE AT SHIP

—

FIRST EXTERNAL VERIFIER

—

Run both modes. The model is identical; only the harness differs. With verification ON, two red runs become context and attempt 3 lands green — EQ A4.2 with $p \approx 0.45$, $k = 3$. With verification OFF, the very same first patch ships, and the verifier role is outsourced to CI, then to customers. Scripted and illustrative — but every event in it is the standard behavior of a verify-loop harness.

4.5

Checkpoints & recovery

Verification tells you an attempt failed; checkpoints make that information affordable. The agent-native unit of recovery is the commit: cheap (milliseconds), content-addressed, diffable, and reversible with one command. Production harnesses run a ratchet: commit on every verified-green state, never on red, so the worktree's history is a monotone sequence of working states and "undo" means git reset --hard to the last tooth of the ratchet. Databases solved this decades ago — write-ahead logging, atomic commit, crash recovery — and agent harnesses are rediscovering each piece under new names.

# the ratchet — recovery loop of a production coding harness
checkpoint:  commit after every green verify        # save point ≈ WAL record
on red:      keep the failure text, discard the diff  # evidence in, damage out
rollback:    reset --hard last-green                  # O(1), no negotiation
reattempt:   same goal + accumulated failure traces   # p rises with evidence
cap:         k attempts, then escalate to human       # correlated failure ≠ retry fuel
abandon:     delete worktree, branch, container       # total cost: one git ref

The second recovery primitive is idempotence: design steps so that $f(f(s)) = f(s)$ — running a step twice lands in the same state as running it once. Idempotent steps make retry-after-partial-failure safe, which matters because agents fail mid-step constantly: a timeout after the database write but before the confirmation, a crash between two file edits. Migrations written with IF NOT EXISTS, PUTs instead of POSTs, "ensure state X" instead of "apply change ΔX" — the agent can then be restarted blindly, which is exactly how it will be restarted at 3 a.m.

Checkpoints also change agent psychology, in the behavioral sense: a harness that can roll back cheaply can afford to let the agent try aggressive refactors that an unprotected harness must forbid. Recoverability is not the opposite of autonomy — it is what makes autonomy affordable. This is the deep reason the SAFETY and AUTONOMY meters in Instrument A4.1 are not a strict trade-off.

4.6

Human-in-the-loop design

Where exactly does a human belong in the loop? The principled answer falls out of §4.5: gate by undo cost, not by anxiety. If an action is cheaply reversible, gating it buys no safety — the checkpoint already covers it — and spends scarce attention (§4.3). If an action is irreversible, no downstream layer can save you, so a human belongs in front of it regardless of how capable the model is. Deploys, sends, deletes, payments, anything that crosses from the sandbox into the world of other people: gated.

GATE

the irreversible

deploy · send · delete · pay — undo cost is infinite, so a human signs each one.

DON'T GATE

the recoverable

edits, commits, scoped writes — the checkpoint layer is the approval. Gating here mints fatigue, not safety.

RELOCATE

the boundary

the strongest move: make more actions reversible — soft deletes, staged deploys, outbox-with-delay — and the gate list shrinks honestly.

The second design axis is sync versus async. A synchronous gate stalls an agent that works at machine speed against a human who context-switches at meeting speed — the agent idles, the human gets interrupted, both lose. The async pattern borrowed from code review wins at scale: the agent completes everything reversible, parks irreversibles in a review queue (a pull request, a staged deploy, an unsent outbox), and the human disposes of the queue in batches, with full diffs, on their own schedule. The PR is the proven artifact here — agents that end every task at "branch pushed, PR open, CI green" compose with two decades of existing review infrastructure.

EROSION

Gates decay under deadline pressure. Every team eventually discovers its humans approving deploys from their phones without reading the diff — at which point the gate is a ritual, not a control. Two honest countermeasures: keep the gated set so small that vigilance is sustainable (single digits per person per day), and make the safe path the fast path — if rollback-capable staged deploys ship in one click and raw deploys need two approvals, entropy works for you instead of against you.

4.7

Parallel harnesses: N attempts, one winner

Once a harness makes single runs cheap, disposable, and verifiable, an upgrade becomes nearly free: run N harnesses in parallel on the same task and keep the best result. Git worktrees make the isolation almost costless — N checkouts sharing one object store, one branch each, no interference — and the economics follow the oldest equation in sampling:

EQ A4.4 — BEST-OF-N WITH A SELECTOR $$ \Pr[\text{ship correct}] \;=\; j\cdot\big(1-(1-p)^{N}\big), \qquad \Delta_N \;=\; p\,(1-p)^{N-1} $$

$p$ per-attempt success, $N$ parallel attempts, $j$ the probability the selector picks a correct candidate when at least one exists. A ground-truth verifier is a perfect selector ($j = 1$): run the tests, ship whichever attempt is green. An LLM judge is a noisy one ($j \approx 0.6\text{–}0.9$ on hard tasks), and its noise caps the whole pipeline — the gap between the two curves in Instrument A4.3 is the price of not having a checkable task. $\Delta_N$, the marginal value of attempt $N$, decays geometrically: most of best-of-N's value arrives by $N = 4\text{–}5$ for $p \gtrsim 0.3$, while cost grows linearly forever.

You run $N = 4$ parallel attempts at per-attempt success $p = 0.3$, selected by a ground-truth verifier ($j = 1$). By EQ A4.4, $\Pr[\text{ship correct}] = j\,(1-(1-p)^N)$. What is it?

$(1-p)^N = 0.7^4 = 0.2401$, so $1 - 0.2401 = 0.7599$; with $j = 1$, $\Pr[\text{ship correct}] = 0.760$. A 30% agent, run four times against a perfect selector, ships correctly 76% of the time — but the marginal value of each extra attempt decays geometrically, so most of the gain is already in by $N = 4$–5. The answer is 0.76.

The honest caveat is correlation: the N attempts come from the same model with the same blind spots, so true success is below the independence curve — if the model misunderstands the task, it misunderstands it N times, in N worktrees, at N× the cost. Production mitigations: vary the approach across attempts (different plans seeded into each prompt), vary temperature, or vary the model itself. And the selector inherits §4.4's failure mode wholesale: best-of-N against a leaky verifier is N chances to find the hack.

INSTRUMENT A4.3 — BEST-OF-N PLANNEREQ A4.4 · INDEPENDENCE ASSUMED — OPTIMISTIC

PER-ATTEMPT SUCCESS p 0.30

PARALLEL ATTEMPTS N 4

SELECTOR ACCURACY j 0.75

P(≥1 CORRECT IN N)

—

P(SHIP CORRECT)

—

MARGINAL GAIN OF ATTEMPT N

—

Mint curve: ground-truth selector (j = 1, e.g. a test suite). Blue curve: your LLM judge at accuracy j. Drag j to 1.0 and watch the curves merge — that vertical gap is the dollar value of making your task checkable. Then set p = 0.1 and note how N must explode to compensate: parallelism amplifies a competent agent and merely bankrolls an incompetent one. Compute cost scales as N; the marginal-gain readout tells you when to stop paying.

The cage is built; now study what runs inside it. Chapter 05: loop engineering and multi-agent patterns — how the agent's inner loop is structured, when to split work across orchestrators and subagents, and why most multi-agent failures are really context failures wearing a trench coat.

§