What a harness is
By 2026 the strange fact of the agent market is that competitors often run the same frontier models and ship wildly different products. The difference is not in the weights — those are rented by the token. It is in the harness: the policy engine that decides which proposed actions execute, the sandbox they execute in, the verifiers that score the result, the checkpoints that make mistakes cheap, and the gates that keep humans in the path of the irreversible. The model proposes; the harness disposes.
Why this is where the engineering value concentrated: a harmful outcome needs two things — a bad action proposed, and a bad action allowed to matter. Alignment training suppresses the first factor but cannot zero it, because agent inputs are adversarial (Chapter 03: anything the agent reads is a potential instruction). The second factor is yours:
git branch -D from harmless.# EQ A4.1 in dollars: identical mistake probabilities, two harnesses
actions = [ # (action class, P[harmful attempt], $cost raw, $cost sandboxed)
("bad file edit", 0.050, 2_000, 5), # git reset vs lost work
("rm in the wrong dir", 0.010, 25_000, 5), # container fs vs your homedir
("curl|sh from a README",0.004, 250_000, 50), # egress allowlist blocks exfil
("prod credential use", 0.002, 1_000_000, 0), # secret never mounted: c(a)=0
]
print(f"{'action class':24s}{'P[attempt]':>11s}{'E[raw]':>9s}{'E[sandboxed]':>14s}")
raw_total = box_total = 0.0
for name, p, c_raw, c_box in actions:
raw_total += p * c_raw
box_total += p * c_box
print(f"{name:24s}{p:11.3f}{p * c_raw:9,.0f}{p * c_box:14.2f}")
print("-" * 58)
print(f"{'expected damage, one attempt of each':35s}{raw_total:9,.0f}{box_total:14.2f}")
print(f"\nsame model, same first factor — the harness cuts E[damage] by "
f"{raw_total / box_total:,.0f}x")
print("you cannot zero P[harmful attempt]; you fully control max cost c(a)")
| Layer | Question it answers | Failure it bounds |
|---|---|---|
| Sandbox | where can code run? | Host compromise, data exfiltration, collateral damage |
| Permissions | which actions execute? | Out-of-scope writes, surprise side effects |
| Verifier | did it actually work? | Confidently shipped breakage |
| Checkpoints | can we go back? | Compounding errors, unrecoverable state |
| Human gate | who owns the irreversible? | Deploys, sends, deletes that no rollback undoes |
| Telemetry | what happened, exactly? | Unauditable incidents, unlearnable failures |
Is the harness really the moat? The claim is contested. Skeptics argue that as models internalize verification and caution, harness layers thin away — and they do thin: teams ask less and allow more with every model generation. But the boundary at the bottom never moves. No amount of capability makes a sent email unsent or a dropped production table undropped. The layers that manage irreversibility are permanent engineering, not scaffolding awaiting a smarter model.
Sandboxing: blast-radius engineering
The sandbox is where EQ A4.1's second factor gets physically enforced. The design stance is borrowed from security engineering, not from trust: assume the agent will eventually attempt the worst action its environment permits — through error, through injection, or through an instruction it misread — and size the environment so that this worst action is affordable. Three resources need walls:
- Filesystem. Read-only mounts for everything the agent needs but must not touch; copy-on-write overlays or dedicated checkouts for what it edits. The cheapest unit of filesystem isolation is the git worktree: a second working directory sharing the repository's object store, where the agent can do anything and the cleanup operation is deleting a branch.
- Process. Namespaces, cgroups, and syscall filters (containers); or a separate guest kernel entirely (microVMs such as Firecracker, user-space kernels such as gVisor). The distinction matters because agents run arbitrary code as a feature — every
npm installexecutes strangers' postinstall scripts with the agent's privileges. - Network. The wall that matters most and gets built last. An injected agent with no network egress can corrupt its sandbox; the same agent with open egress can exfiltrate every secret inside it. Default-deny with a short allowlist of package registries and APIs is the production norm.
| Mechanism | Isolates | Escape cost | Typical use |
|---|---|---|---|
| Git worktree | workspace state | none — not a security boundary | Parallel attempts, cheap rollback, blast-radius for mistakes |
| Container | fs · processes · network ns | kernel exploit (shared kernel) | The default agent cell; seccomp/AppArmor hardened |
| MicroVM / gVisor | guest kernel boundary | hardware-virt escape — rare | Untrusted code at scale, multi-tenant agent platforms |
| Ephemeral cloud VM | separate machine | ≈ infrastructure compromise | Long-horizon autonomous runs; destroyed after the task |
The sandbox protects the host from the agent — not the contents of the sandbox from the agent. Whatever you mount inside the wall is inside the blast radius: production credentials in environment variables, a .env with live keys, an authenticated cloud CLI. Prompt injection does not need a sandbox escape if the valuables were carried into the cell. Mount secrets read-only, scoped, and short-lived — or better, broker them through a proxy the agent never sees raw.
Permission systems: spending human attention
Inside the sandbox, the permission layer decides per-action: allow, ask, or deny. Two principles carry most of the design. First, allowlist, don't denylist: the set of safe actions is finite and enumerable ("read any file in the repo, run the test suite, edit inside src/"), while the set of dangerous actions is infinite and adversarially generated — a denylist of bad commands is a parlor game against an attacker who can base64-encode. Second, permissions are tiered by capability, not by tool name: the same bash tool is harmless running grep and lethal running curl | sh, so production systems parse and classify the action, not the tool.
| Tier | Examples | Policy |
|---|---|---|
| Read | ls · grep · cat · GET | Auto-allow, log. Asking here is pure fatigue. |
| Scoped write | edit in worktree · commit · branch | Auto-allow inside the declared scope; the checkpoint layer makes it cheap to be wrong. |
| Unscoped / mutating | install package · POST · write outside scope | Ask, with the concrete diff or command shown — never the intent paraphrased. |
| Irreversible | deploy · send · delete prod data · pay | Hard gate (§4.6) or structurally denied — not reachable from the agent's action set at all. |
The binding constraint is not policy expressiveness — it is approval fatigue. Human scrutiny is a depleting budget: the first confirmation dialog of the day gets read; the thirtieth gets reflex-approved in 400 ms. Every unnecessary ask therefore does double damage — it costs throughput now, and it trains the human to rubber-stamp the ask that will matter later. The design objective is brutal and clarifying: asks should be so rare that each one is news. Make the common path silent (reads, scoped writes), batch the questions you must ask, and surface them with the evidence needed to decide in one glance: the diff, the command, the URL — not "the agent wants to use Bash."
A useful audit: count asks per completed task across a week of traces. Above ~5, your users have already stopped reading them, and your permission system has quietly degraded into a latency tax with a false sense of security attached. The fix is almost never "ask better" — it is widening the auto-allow tier while narrowing the scope it applies to (a freer hand inside a smaller room).
Verification: the ground-truth principle
The single highest-leverage component of a harness is the verifier, because of an asymmetry covered in Vol II · §5.7: for code, math, and structured tasks, checking is enormously cheaper than generating, and the check is objective. RLVR exploits this at training time, turning test results into rewards. The harness exploits the identical signal at inference time, turning test results into retry gates. Same principle, different loop: if the harness can check it, the agent can fix it. The contrapositive governs your roadmap: what the harness cannot check, the agent cannot reliably fix — so the highest-leverage engineering of the era is converting vibes into asserts: golden files, schema validators, screenshot diffs scored by a vision model, latency budgets in CI.
# the ground-truth principle: 2000 patch tasks, verifier on vs off
import numpy as np
rng = np.random.default_rng(0)
P, K, TRIALS = 0.4, 5, 2000 # per-attempt success, retry cap, tasks
correct = rng.random((TRIALS, K)) < P # correct[t, i]: attempt i would pass
first_ok = correct[:, 0] # no verifier: ship attempt 1, unchecked
any_ok = correct.any(axis=1) # verifier: retry to green, cap K
attempts = np.where(any_ok, correct.argmax(axis=1) + 1, K)
print(f"no verifier : ships 100% of tasks, {(~first_ok).mean():.1%} of them broken")
print(f"with verifier: ships {any_ok.mean():.1%} green, 0.0% broken, "
f"{(~any_ok).mean():.1%} escalated to a human")
print(f"mean attempts per task : {attempts.mean():.2f} "
f"(theory (1-(1-p)^k)/p = {(1 - (1 - P)**K) / P:.2f})")
print(f"green within {K} (EQ A4.2): simulated {any_ok.mean():.1%}, "
f"theory 1-(1-p)^k = {1 - (1 - P)**K:.1%}")
print("identical model in both rows — a 40% per-shot coder plus a sound")
print("verifier ships nothing red; the same coder alone ships 60% breakage")
Everything depends on verifier quality, and "quality" decomposes into two error rates: \(\alpha\), the chance a correct patch is rejected (flaky tests — wasteful but safe), and \(\beta\), the chance a broken patch passes (weak tests — silently fatal). Bayes gives the ceiling:
Checkpoints & recovery
Verification tells you an attempt failed; checkpoints make that information affordable. The agent-native unit of recovery is the commit: cheap (milliseconds), content-addressed, diffable, and reversible with one command. Production harnesses run a ratchet: commit on every verified-green state, never on red, so the worktree's history is a monotone sequence of working states and "undo" means git reset --hard to the last tooth of the ratchet. Databases solved this decades ago — write-ahead logging, atomic commit, crash recovery — and agent harnesses are rediscovering each piece under new names.
# the ratchet — recovery loop of a production coding harness
checkpoint: commit after every green verify # save point ≈ WAL record
on red: keep the failure text, discard the diff # evidence in, damage out
rollback: reset --hard last-green # O(1), no negotiation
reattempt: same goal + accumulated failure traces # p rises with evidence
cap: k attempts, then escalate to human # correlated failure ≠ retry fuel
abandon: delete worktree, branch, container # total cost: one git ref
The second recovery primitive is idempotence: design steps so that \(f(f(s)) = f(s)\) — running a step twice lands in the same state as running it once. Idempotent steps make retry-after-partial-failure safe, which matters because agents fail mid-step constantly: a timeout after the database write but before the confirmation, a crash between two file edits. Migrations written with IF NOT EXISTS, PUTs instead of POSTs, "ensure state X" instead of "apply change ΔX" — the agent can then be restarted blindly, which is exactly how it will be restarted at 3 a.m.
Checkpoints also change agent psychology, in the behavioral sense: a harness that can roll back cheaply can afford to let the agent try aggressive refactors that an unprotected harness must forbid. Recoverability is not the opposite of autonomy — it is what makes autonomy affordable. This is the deep reason the SAFETY and AUTONOMY meters in Instrument A4.1 are not a strict trade-off.
Human-in-the-loop design
Where exactly does a human belong in the loop? The principled answer falls out of §4.5: gate by undo cost, not by anxiety. If an action is cheaply reversible, gating it buys no safety — the checkpoint already covers it — and spends scarce attention (§4.3). If an action is irreversible, no downstream layer can save you, so a human belongs in front of it regardless of how capable the model is. Deploys, sends, deletes, payments, anything that crosses from the sandbox into the world of other people: gated.
The second design axis is sync versus async. A synchronous gate stalls an agent that works at machine speed against a human who context-switches at meeting speed — the agent idles, the human gets interrupted, both lose. The async pattern borrowed from code review wins at scale: the agent completes everything reversible, parks irreversibles in a review queue (a pull request, a staged deploy, an unsent outbox), and the human disposes of the queue in batches, with full diffs, on their own schedule. The PR is the proven artifact here — agents that end every task at "branch pushed, PR open, CI green" compose with two decades of existing review infrastructure.
Gates decay under deadline pressure. Every team eventually discovers its humans approving deploys from their phones without reading the diff — at which point the gate is a ritual, not a control. Two honest countermeasures: keep the gated set so small that vigilance is sustainable (single digits per person per day), and make the safe path the fast path — if rollback-capable staged deploys ship in one click and raw deploys need two approvals, entropy works for you instead of against you.
Parallel harnesses: N attempts, one winner
Once a harness makes single runs cheap, disposable, and verifiable, an upgrade becomes nearly free: run N harnesses in parallel on the same task and keep the best result. Git worktrees make the isolation almost costless — N checkouts sharing one object store, one branch each, no interference — and the economics follow the oldest equation in sampling:
The honest caveat is correlation: the N attempts come from the same model with the same blind spots, so true success is below the independence curve — if the model misunderstands the task, it misunderstands it N times, in N worktrees, at N× the cost. Production mitigations: vary the approach across attempts (different plans seeded into each prompt), vary temperature, or vary the model itself. And the selector inherits §4.4's failure mode wholesale: best-of-N against a leaky verifier is N chances to find the hack.
The cage is built; now study what runs inside it. Chapter 05: loop engineering and multi-agent patterns — how the agent's inner loop is structured, when to split work across orchestrators and subagents, and why most multi-agent failures are really context failures wearing a trench coat.
Further reading
- Saltzer, J. H. & Schroeder, M. D. (1975). The Protection of Information in Computer Systems. — the origin of least privilege and fail-safe defaults that govern sandboxing.
- Yee, B. et al. (2009). Native Client: A Sandbox for Portable, Untrusted x86 Native Code. — a canonical study of isolating untrusted execution, the core of a harness sandbox.
- Goldberg, I. et al. (1996). A Secure Environment for Untrusted Helper Applications (Janus). — early system-call interposition, the ancestor of permission-mediated tool access.
- Wu, T. et al. (2022). AI Chains: Transparent and Controllable Human-AI Interaction via Chaining LLM Prompts. — design study of human-in-the-loop checkpoints over LLM steps.
- Amershi, S. et al. (2019). Guidelines for Human-AI Interaction. — eighteen evidence-based design principles for approval, correction, and recovery.
- Anthropic (2025). Claude Code: Best Practices for Agentic Coding. — practitioner guidance on sandboxes, permission gating, and parallel worktrees in a real harness.