AI // ENCYCLOPEDIA / VOL IV / AGENT ENGINEERING / 04 / HARNESS ENGINEERING INDEX NEXT: LOOP ENGINEERING →
VOLUME IV — AGENT ENGINEERING · CHAPTER 04 / 06

Harness Engineering

A capable model wired straight into a shell is not a product. It is an incident waiting to happen. The harness, meaning the sandbox, permissions, verification, recovery, and human gates, is everything around the model that converts raw capability into deployable autonomy. The model sets the probability of a harmful action; the harness sets its maximum cost. Only the second factor is fully under your control.

LEVELADVANCED READING TIME≈ 24 MIN BUILDS ONVOL IV · CH 01–03 INSTRUMENTSCONFIGURATOR · VERIFY LOOP · BEST-OF-N
4.1

What a harness is

By 2026 the strange fact of the agent market is that competitors often run the same frontier models and ship wildly different products. The difference is not in the weights — those are rented by the token. It is in the harness: the policy engine that decides which proposed actions execute, the sandbox they execute in, the verifiers that score the result, the checkpoints that make mistakes cheap, and the gates that keep humans in the path of the irreversible. The model proposes; the harness disposes.

FIG A4.ATHE LIFECYCLE OF ONE AGENT ACTION
HUMANanswers asks · approves the irreversible MODELproposes action PERMISSIONSallow · ask · deny SANDBOXbounded execution VERIFIERtests · build · lint CHECKPOINTcommit on green GATEirreversibles WORLD RED → failure text re-enters context as evidence · the loop that makes agents work
Every layer is optional, and every omission is a bet. Skip the sandbox and you bet no tool call ever goes wrong; skip the verifier and you bet the first sample is correct; skip the gate and you bet the model never confuses "draft the email" with "send it." Production harnesses make none of these bets.

Why this is where the engineering value concentrated: a harmful outcome needs two things — a bad action proposed, and a bad action allowed to matter. Alignment training suppresses the first factor but cannot zero it, because agent inputs are adversarial (Chapter 03: anything the agent reads is a potential instruction). The second factor is yours:

EQ A4.1 — THE BLAST-RADIUS BOUND $$ \mathbb{E}[\text{damage}] \;\le\; \underbrace{\Pr\big[\text{harmful action executes}\big]}_{\text{model + adversary — never } 0} \;\times\; \underbrace{\max_{a \,\in\, \mathcal{A}_{\text{exec}}} c(a)}_{\text{blast radius — set by the harness}} $$
\(\mathcal{A}_{\text{exec}}\) is the set of actions that can actually reach the world after sandboxing and permissions; \(c(a)\) is the worst-case cost of action \(a\), which checkpoints and reversibility shrink. You cannot drive the first factor to zero under adversarial input — so engineering effort goes into clamping the second. A "safe" agent with system-wide write access is one jailbreak from catastrophe; a mediocre agent in a disposable worktree is one git branch -D from harmless.
PYTHON · RUNNABLE IN-BROWSER
# EQ A4.1 in dollars: identical mistake probabilities, two harnesses
actions = [  # (action class, P[harmful attempt], $cost raw, $cost sandboxed)
    ("bad file edit",        0.050,     2_000,  5),   # git reset vs lost work
    ("rm in the wrong dir",  0.010,    25_000,  5),   # container fs vs your homedir
    ("curl|sh from a README",0.004,   250_000, 50),   # egress allowlist blocks exfil
    ("prod credential use",  0.002, 1_000_000,  0),   # secret never mounted: c(a)=0
]

print(f"{'action class':24s}{'P[attempt]':>11s}{'E[raw]':>9s}{'E[sandboxed]':>14s}")
raw_total = box_total = 0.0
for name, p, c_raw, c_box in actions:
    raw_total += p * c_raw
    box_total += p * c_box
    print(f"{name:24s}{p:11.3f}{p * c_raw:9,.0f}{p * c_box:14.2f}")
print("-" * 58)
print(f"{'expected damage, one attempt of each':35s}{raw_total:9,.0f}{box_total:14.2f}")
print(f"\nsame model, same first factor — the harness cuts E[damage] by "
      f"{raw_total / box_total:,.0f}x")
print("you cannot zero P[harmful attempt]; you fully control max cost c(a)")
edits are live — break it on purpose
By EQ A4.1, \(\mathbb{E}[\text{damage}] \le \Pr[\text{harmful action executes}] \times \max c(a)\). An action class has \(\Pr[\text{harmful}] = 0.01\) and, after sandboxing, a worst-case cost of $25,000. What is the expected-damage bound, in dollars?
\(\mathbb{E}[\text{damage}] \le 0.01 \times 25{,}000 = \$250\). You cannot drive the first factor to zero under adversarial input — so the lever is the blast radius: shrink \(c(a)\) with sandboxing and reversibility and the bound falls proportionally. The answer is 250.
LayerQuestion it answersFailure it bounds
Sandboxwhere can code run?Host compromise, data exfiltration, collateral damage
Permissionswhich actions execute?Out-of-scope writes, surprise side effects
Verifierdid it actually work?Confidently shipped breakage
Checkpointscan we go back?Compounding errors, unrecoverable state
Human gatewho owns the irreversible?Deploys, sends, deletes that no rollback undoes
Telemetrywhat happened, exactly?Unauditable incidents, unlearnable failures

Is the harness really the moat? The claim is contested. Skeptics argue that as models internalize verification and caution, harness layers thin away — and they do thin: teams ask less and allow more with every model generation. But the boundary at the bottom never moves. No amount of capability makes a sent email unsent or a dropped production table undropped. The layers that manage irreversibility are permanent engineering, not scaffolding awaiting a smarter model.

INSTRUMENT A4.1 — HARNESS CONFIGURATORFIVE LAYERS · TWO METERS · ONE INCIDENT
AUTONOMY / THROUGHPUT
SAFETY / RECOVERABILITY
WEAKEST-LINK INCIDENT —
AUTONOMY SCORE
SAFETY SCORE
WEAKEST LAYER
Flip layers and watch the trade. The meters are hand-tuned and illustrative; the incident stories are the real content — each describes what the weakest remaining setting allows, which is how attackers and entropy actually find you. Note that PRODUCTION and REGULATED differ only in write scope, and that no configuration reaches 100 on both meters. That is the theorem of this chapter, not a bug in the widget.
4.2

Sandboxing: blast-radius engineering

The sandbox is where EQ A4.1's second factor gets physically enforced. The design stance is borrowed from security engineering, not from trust: assume the agent will eventually attempt the worst action its environment permits — through error, through injection, or through an instruction it misread — and size the environment so that this worst action is affordable. Three resources need walls:

  • Filesystem. Read-only mounts for everything the agent needs but must not touch; copy-on-write overlays or dedicated checkouts for what it edits. The cheapest unit of filesystem isolation is the git worktree: a second working directory sharing the repository's object store, where the agent can do anything and the cleanup operation is deleting a branch.
  • Process. Namespaces, cgroups, and syscall filters (containers); or a separate guest kernel entirely (microVMs such as Firecracker, user-space kernels such as gVisor). The distinction matters because agents run arbitrary code as a feature — every npm install executes strangers' postinstall scripts with the agent's privileges.
  • Network. The wall that matters most and gets built last. An injected agent with no network egress can corrupt its sandbox; the same agent with open egress can exfiltrate every secret inside it. Default-deny with a short allowlist of package registries and APIs is the production norm.
MechanismIsolatesEscape costTypical use
Git worktreeworkspace statenone — not a security boundaryParallel attempts, cheap rollback, blast-radius for mistakes
Containerfs · processes · network nskernel exploit (shared kernel)The default agent cell; seccomp/AppArmor hardened
MicroVM / gVisorguest kernel boundaryhardware-virt escape — rareUntrusted code at scale, multi-tenant agent platforms
Ephemeral cloud VMseparate machine≈ infrastructure compromiseLong-horizon autonomous runs; destroyed after the task
CAVEAT

The sandbox protects the host from the agent — not the contents of the sandbox from the agent. Whatever you mount inside the wall is inside the blast radius: production credentials in environment variables, a .env with live keys, an authenticated cloud CLI. Prompt injection does not need a sandbox escape if the valuables were carried into the cell. Mount secrets read-only, scoped, and short-lived — or better, broker them through a proxy the agent never sees raw.

4.3

Permission systems: spending human attention

Inside the sandbox, the permission layer decides per-action: allow, ask, or deny. Two principles carry most of the design. First, allowlist, don't denylist: the set of safe actions is finite and enumerable ("read any file in the repo, run the test suite, edit inside src/"), while the set of dangerous actions is infinite and adversarially generated — a denylist of bad commands is a parlor game against an attacker who can base64-encode. Second, permissions are tiered by capability, not by tool name: the same bash tool is harmless running grep and lethal running curl | sh, so production systems parse and classify the action, not the tool.

TierExamplesPolicy
Readls · grep · cat · GETAuto-allow, log. Asking here is pure fatigue.
Scoped writeedit in worktree · commit · branchAuto-allow inside the declared scope; the checkpoint layer makes it cheap to be wrong.
Unscoped / mutatinginstall package · POST · write outside scopeAsk, with the concrete diff or command shown — never the intent paraphrased.
Irreversibledeploy · send · delete prod data · payHard gate (§4.6) or structurally denied — not reachable from the agent's action set at all.

The binding constraint is not policy expressiveness — it is approval fatigue. Human scrutiny is a depleting budget: the first confirmation dialog of the day gets read; the thirtieth gets reflex-approved in 400 ms. Every unnecessary ask therefore does double damage — it costs throughput now, and it trains the human to rubber-stamp the ask that will matter later. The design objective is brutal and clarifying: asks should be so rare that each one is news. Make the common path silent (reads, scoped writes), batch the questions you must ask, and surface them with the evidence needed to decide in one glance: the diff, the command, the URL — not "the agent wants to use Bash."

A useful audit: count asks per completed task across a week of traces. Above ~5, your users have already stopped reading them, and your permission system has quietly degraded into a latency tax with a false sense of security attached. The fix is almost never "ask better" — it is widening the auto-allow tier while narrowing the scope it applies to (a freer hand inside a smaller room).

4.4

Verification: the ground-truth principle

The single highest-leverage component of a harness is the verifier, because of an asymmetry covered in Vol II · §5.7: for code, math, and structured tasks, checking is enormously cheaper than generating, and the check is objective. RLVR exploits this at training time, turning test results into rewards. The harness exploits the identical signal at inference time, turning test results into retry gates. Same principle, different loop: if the harness can check it, the agent can fix it. The contrapositive governs your roadmap: what the harness cannot check, the agent cannot reliably fix — so the highest-leverage engineering of the era is converting vibes into asserts: golden files, schema validators, screenshot diffs scored by a vision model, latency budgets in CI.

EQ A4.2 — CLOSING THE LOOP $$ \Pr[\text{green within } k \text{ attempts}] \;=\; 1 - (1-p)^k, \qquad \mathbb{E}[\text{attempts}] \;=\; \frac{1-(1-p)^k}{p} $$
\(p\) is single-attempt success probability against a sound verifier; attempts stop at first green or after \(k\). At \(p = 0.45\) and \(k = 3\): an 83% completion rate from a model that is right less than half the time — the verifier converts mediocre per-shot accuracy into high task reliability. Caveat: attempts are not independent. Error feedback usually raises later \(p_i\) (the failure text is evidence), but failures also correlate — an agent missing the concept loops on variants of the same wrong idea, which is why production harnesses cap retries and escalate to a human instead of burning tokens.
A coder is right \(p = 0.45\) of the time per attempt, behind a sound verifier, with up to \(k = 3\) attempts. By EQ A4.2, what is \(\Pr[\text{green within } k] = 1 - (1-p)^k\)?
\((1-p)^k = 0.55^3 = 0.166\), so \(\Pr[\text{green}] = 1 - 0.166 = 0.834\). A model that fails more often than it succeeds per shot still clears 83% of tasks once the verifier lets it retry — the verifier, not the model, does the heavy lifting. The answer is 0.834.
PYTHON · RUNNABLE IN-BROWSER
# the ground-truth principle: 2000 patch tasks, verifier on vs off
import numpy as np
rng = np.random.default_rng(0)
P, K, TRIALS = 0.4, 5, 2000          # per-attempt success, retry cap, tasks

correct = rng.random((TRIALS, K)) < P        # correct[t, i]: attempt i would pass
first_ok = correct[:, 0]                     # no verifier: ship attempt 1, unchecked
any_ok = correct.any(axis=1)                 # verifier: retry to green, cap K
attempts = np.where(any_ok, correct.argmax(axis=1) + 1, K)

print(f"no verifier  : ships 100% of tasks, {(~first_ok).mean():.1%} of them broken")
print(f"with verifier: ships {any_ok.mean():.1%} green, 0.0% broken, "
      f"{(~any_ok).mean():.1%} escalated to a human")
print(f"mean attempts per task : {attempts.mean():.2f} "
      f"(theory (1-(1-p)^k)/p = {(1 - (1 - P)**K) / P:.2f})")
print(f"green within {K} (EQ A4.2): simulated {any_ok.mean():.1%}, "
      f"theory 1-(1-p)^k = {1 - (1 - P)**K:.1%}")
print("identical model in both rows — a 40% per-shot coder plus a sound")
print("verifier ships nothing red; the same coder alone ships 60% breakage")
edits are live — break it on purpose

Everything depends on verifier quality, and "quality" decomposes into two error rates: \(\alpha\), the chance a correct patch is rejected (flaky tests — wasteful but safe), and \(\beta\), the chance a broken patch passes (weak tests — silently fatal). Bayes gives the ceiling:

EQ A4.3 — THE LEAKY-VERIFIER CEILING $$ \Pr[\text{correct} \mid \text{verifier green}] \;=\; \frac{(1-\alpha)\,p}{(1-\alpha)\,p + \beta\,(1-p)} $$
With \(p = 0.45\) and a test suite that lets 15% of broken patches through (\(\beta = 0.15\), \(\alpha = 0.05\)), a green run means only 84% correct — and no number of retries raises it, because retries condition on the same leaky green. Worse, a strong agent optimizes against the verifier and inflates effective \(\beta\): deleting failing tests, hardcoding expected outputs, special-casing the test inputs. This is reward hacking (Vol II · Ch 05) at inference time. Standard mitigations: tests read-only to the agent's write scope, diff review on any test-file change, and held-out checks the agent never sees.
A patch is correct with prior \(p = 0.45\). The suite rejects correct work \(\alpha = 0.05\) of the time and passes broken work \(\beta = 0.15\) of the time. By EQ A4.3, given a green run, what is \(\Pr[\text{correct} \mid \text{green}] = \dfrac{(1-\alpha)p}{(1-\alpha)p + \beta(1-p)}\)?
Numerator \(= 0.95 \times 0.45 = 0.4275\). Denominator \(= 0.4275 + 0.15 \times 0.55 = 0.4275 + 0.0825 = 0.51\). Ratio \(= 0.4275 / 0.51 \approx 0.838\). A leaky suite (\(\beta = 0.15\)) caps trust at ~84% no matter how many times you retry — retries condition on the same leaky green. The answer is 0.838.
INSTRUMENT A4.2 — VERIFY-LOOP SIMONE BUG · TWO HARNESSES · SCRIPTED RUN
TASK — fix: date parser drops timezone fold on DST boundary · suite: 14 tests · model identical in both runs
PATCH ATTEMPTS
SUITE AT SHIP
FIRST EXTERNAL VERIFIER
Run both modes. The model is identical; only the harness differs. With verification ON, two red runs become context and attempt 3 lands green — EQ A4.2 with \(p \approx 0.45\), \(k = 3\). With verification OFF, the very same first patch ships, and the verifier role is outsourced to CI, then to customers. Scripted and illustrative — but every event in it is the standard behavior of a verify-loop harness.
4.5

Checkpoints & recovery

Verification tells you an attempt failed; checkpoints make that information affordable. The agent-native unit of recovery is the commit: cheap (milliseconds), content-addressed, diffable, and reversible with one command. Production harnesses run a ratchet: commit on every verified-green state, never on red, so the worktree's history is a monotone sequence of working states and "undo" means git reset --hard to the last tooth of the ratchet. Databases solved this decades ago — write-ahead logging, atomic commit, crash recovery — and agent harnesses are rediscovering each piece under new names.

# the ratchet — recovery loop of a production coding harness
checkpoint:  commit after every green verify        # save point ≈ WAL record
on red:      keep the failure text, discard the diff  # evidence in, damage out
rollback:    reset --hard last-green                  # O(1), no negotiation
reattempt:   same goal + accumulated failure traces   # p rises with evidence
cap:         k attempts, then escalate to human       # correlated failure ≠ retry fuel
abandon:     delete worktree, branch, container       # total cost: one git ref

The second recovery primitive is idempotence: design steps so that \(f(f(s)) = f(s)\) — running a step twice lands in the same state as running it once. Idempotent steps make retry-after-partial-failure safe, which matters because agents fail mid-step constantly: a timeout after the database write but before the confirmation, a crash between two file edits. Migrations written with IF NOT EXISTS, PUTs instead of POSTs, "ensure state X" instead of "apply change ΔX" — the agent can then be restarted blindly, which is exactly how it will be restarted at 3 a.m.

Checkpoints also change agent psychology, in the behavioral sense: a harness that can roll back cheaply can afford to let the agent try aggressive refactors that an unprotected harness must forbid. Recoverability is not the opposite of autonomy — it is what makes autonomy affordable. This is the deep reason the SAFETY and AUTONOMY meters in Instrument A4.1 are not a strict trade-off.

4.6

Human-in-the-loop design

Where exactly does a human belong in the loop? The principled answer falls out of §4.5: gate by undo cost, not by anxiety. If an action is cheaply reversible, gating it buys no safety — the checkpoint already covers it — and spends scarce attention (§4.3). If an action is irreversible, no downstream layer can save you, so a human belongs in front of it regardless of how capable the model is. Deploys, sends, deletes, payments, anything that crosses from the sandbox into the world of other people: gated.

GATE
the irreversible
deploy · send · delete · pay — undo cost is infinite, so a human signs each one.
DON'T GATE
the recoverable
edits, commits, scoped writes — the checkpoint layer is the approval. Gating here mints fatigue, not safety.
RELOCATE
the boundary
the strongest move: make more actions reversible — soft deletes, staged deploys, outbox-with-delay — and the gate list shrinks honestly.

The second design axis is sync versus async. A synchronous gate stalls an agent that works at machine speed against a human who context-switches at meeting speed — the agent idles, the human gets interrupted, both lose. The async pattern borrowed from code review wins at scale: the agent completes everything reversible, parks irreversibles in a review queue (a pull request, a staged deploy, an unsent outbox), and the human disposes of the queue in batches, with full diffs, on their own schedule. The PR is the proven artifact here — agents that end every task at "branch pushed, PR open, CI green" compose with two decades of existing review infrastructure.

EROSION

Gates decay under deadline pressure. Every team eventually discovers its humans approving deploys from their phones without reading the diff — at which point the gate is a ritual, not a control. Two honest countermeasures: keep the gated set so small that vigilance is sustainable (single digits per person per day), and make the safe path the fast path — if rollback-capable staged deploys ship in one click and raw deploys need two approvals, entropy works for you instead of against you.

4.7

Parallel harnesses: N attempts, one winner

Once a harness makes single runs cheap, disposable, and verifiable, an upgrade becomes nearly free: run N harnesses in parallel on the same task and keep the best result. Git worktrees make the isolation almost costless — N checkouts sharing one object store, one branch each, no interference — and the economics follow the oldest equation in sampling:

EQ A4.4 — BEST-OF-N WITH A SELECTOR $$ \Pr[\text{ship correct}] \;=\; j\cdot\big(1-(1-p)^{N}\big), \qquad \Delta_N \;=\; p\,(1-p)^{N-1} $$
\(p\) per-attempt success, \(N\) parallel attempts, \(j\) the probability the selector picks a correct candidate when at least one exists. A ground-truth verifier is a perfect selector (\(j = 1\)): run the tests, ship whichever attempt is green. An LLM judge is a noisy one (\(j \approx 0.6\text{–}0.9\) on hard tasks), and its noise caps the whole pipeline — the gap between the two curves in Instrument A4.3 is the price of not having a checkable task. \(\Delta_N\), the marginal value of attempt \(N\), decays geometrically: most of best-of-N's value arrives by \(N = 4\text{–}5\) for \(p \gtrsim 0.3\), while cost grows linearly forever.
You run \(N = 4\) parallel attempts at per-attempt success \(p = 0.3\), selected by a ground-truth verifier (\(j = 1\)). By EQ A4.4, \(\Pr[\text{ship correct}] = j\,(1-(1-p)^N)\). What is it?
\((1-p)^N = 0.7^4 = 0.2401\), so \(1 - 0.2401 = 0.7599\); with \(j = 1\), \(\Pr[\text{ship correct}] = 0.760\). A 30% agent, run four times against a perfect selector, ships correctly 76% of the time — but the marginal value of each extra attempt decays geometrically, so most of the gain is already in by \(N = 4\)–5. The answer is 0.76.

The honest caveat is correlation: the N attempts come from the same model with the same blind spots, so true success is below the independence curve — if the model misunderstands the task, it misunderstands it N times, in N worktrees, at N× the cost. Production mitigations: vary the approach across attempts (different plans seeded into each prompt), vary temperature, or vary the model itself. And the selector inherits §4.4's failure mode wholesale: best-of-N against a leaky verifier is N chances to find the hack.

INSTRUMENT A4.3 — BEST-OF-N PLANNEREQ A4.4 · INDEPENDENCE ASSUMED — OPTIMISTIC
P(≥1 CORRECT IN N)
P(SHIP CORRECT)
MARGINAL GAIN OF ATTEMPT N
Mint curve: ground-truth selector (j = 1, e.g. a test suite). Blue curve: your LLM judge at accuracy j. Drag j to 1.0 and watch the curves merge — that vertical gap is the dollar value of making your task checkable. Then set p = 0.1 and note how N must explode to compensate: parallelism amplifies a competent agent and merely bankrolls an incompetent one. Compute cost scales as N; the marginal-gain readout tells you when to stop paying.
NEXT

The cage is built; now study what runs inside it. Chapter 05: loop engineering and multi-agent patterns — how the agent's inner loop is structured, when to split work across orchestrators and subagents, and why most multi-agent failures are really context failures wearing a trench coat.

§

Further reading

  • Saltzer, J. H. & Schroeder, M. D. (1975). The Protection of Information in Computer Systems. — the origin of least privilege and fail-safe defaults that govern sandboxing.
  • Yee, B. et al. (2009). Native Client: A Sandbox for Portable, Untrusted x86 Native Code. — a canonical study of isolating untrusted execution, the core of a harness sandbox.
  • Goldberg, I. et al. (1996). A Secure Environment for Untrusted Helper Applications (Janus). — early system-call interposition, the ancestor of permission-mediated tool access.
  • Wu, T. et al. (2022). AI Chains: Transparent and Controllable Human-AI Interaction via Chaining LLM Prompts. — design study of human-in-the-loop checkpoints over LLM steps.
  • Amershi, S. et al. (2019). Guidelines for Human-AI Interaction. — eighteen evidence-based design principles for approval, correction, and recovery.
  • Anthropic (2025). Claude Code: Best Practices for Agentic Coding. — practitioner guidance on sandboxes, permission gating, and parallel worktrees in a real harness.