07 · Evaluation & The Prompt Lab

7.1

Evals for prompts: 20 examples beat 0

The standard failure mode of prompt work looks like this: edit the prompt, eyeball one output, decide it "feels better," ship. That is an experiment with $n = 1$, no control, and a judge — you — who wants the change to work. The difference between prompt tinkering and prompt engineering is not cleverness; it is a number that moves when the prompt improves. Three eval designs cover nearly every case:

Design	When	Scoring	Output
Golden set	The output is checkable: a label, a number, an extraction, a schema, code that runs	exact match · contains · regex · schema-valid · tests pass	pass rate
Pairwise A/B	No single right answer — emails, summaries, explanations	human or LLM judge picks the better of two outputs for the same input	win rate
Rubric scoring	Quality has named dimensions you can argue about	per-criterion verdicts: accurate? cited? under length? right register?	per-criterion pass rates

Build the golden set from real traffic, not invented inputs: pull 20–50 cases, deliberately oversample the weird tail (ambiguous tickets, hostile users, inputs in the wrong language), and write down the expected behavior at collection time — deciding what "correct" means after seeing the model's answer is how wishful grading creeps in. For rubric scoring, prefer many small boolean criteria over one holistic 1–10: "did it cite the source span? Y/N" is stable across graders; "rate the quality" is not.

Why does a set as small as 20 matter so much? Because the information gain from $0 \to 20$ examples is the largest you will ever get — it converts "I think this is better" into "this broke 4 of 20 cases." But binomial noise sets a floor on what small sets can resolve:

EQ P7.1 — THE NOISE FLOOR OF A GOLDEN SET $$ \widehat{p} = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\!\left[\,\text{pass}_i\,\right], \qquad \mathrm{SE}\!\left(\widehat{p}\right) = \sqrt{\frac{\widehat{p}\,(1-\widehat{p})}{n}}, \qquad \text{95\% CI} \;\approx\; \widehat{p} \pm 1.96\,\mathrm{SE} $$

At $n = 20$ and $\widehat{p} = 0.7$, the standard error is $\approx 0.10$ — the confidence interval spans ±20 points. Twenty examples will catch a collapse (90% → 55%) and rank clearly-different prompts; they cannot certify a 5-point refinement. Inverting the formula, resolving a difference of $h$ needs roughly $n \approx (1.96/h)^2\, p(1-p)$ cases: ±10 points wants ~80, ±5 points wants ~320. Twenty examples beat zero by more than a thousand beats twenty — start small, grow as the deltas you chase shrink.

A golden set of $n = 21$ examples gives a pass rate of $\widehat{p} = 0.7$. Using EQ P7.1, what is the standard error $\mathrm{SE}(\widehat{p})$?

$\mathrm{SE} = \sqrt{\dfrac{\widehat{p}(1-\widehat{p})}{n}} = \sqrt{\dfrac{0.7\times 0.3}{21}} = \sqrt{\dfrac{0.21}{21}} = \sqrt{0.01} =$ 0.10. The 95% CI is $\pm 1.96\times 0.10 \approx \pm 20$ points — wide enough to catch a collapse but never a 5-point refinement. That is the noise floor a single green run ignores.

There is a cheap statistical upgrade most teams skip: run both prompts on the same items and compare per-item, rather than comparing two independent pass rates. Item difficulty is shared noise, and pairing cancels it:

EQ P7.2 — WHY PAIRED BEATS POOLED $$ \mathrm{Var}\!\left(\widehat{p}_A - \widehat{p}_B\right) \;=\; \frac{1}{n}\Big[\, p_A(1-p_A) + p_B(1-p_B) - 2\,\mathrm{Cov}(X_A, X_B) \,\Big] $$

$X_A, X_B$ are pass/fail indicators for the two prompts on the same item. Hard items sink both prompts and easy items lift both, so $\mathrm{Cov} > 0$ — and the subtraction shrinks the variance of the difference, often by half or more. The items that decide the comparison are the discordant ones (A passes where B fails, or vice versa); a sign test over just those pairs (McNemar's logic) is the correct significance check, and it is three lines of arithmetic.

So put a number on it. Forty paired comparisons is a typical first eval; the cell below counts B's wins over A and wraps the rate in a Wilson confidence interval — the small-sample-correct version of EQ P7.1, which (unlike the textbook normal interval) never runs off the end of $[0,1]$ and stays honest near the edges. Read whether the interval clears 50%: if it doesn't, you have not yet earned the right to call B better.

PYTHON · RUNNABLE IN-BROWSER

# Paired A/B win rate over 40 comparisons with a Wilson 95% confidence interval
import numpy as np
rng = np.random.default_rng(0)
n, p_true = 40, 0.65                    # comparisons, B's true per-item win prob over A
wins = rng.random(n) < p_true          # one paired verdict per item (did B beat A?)
k = int(wins.sum()); p = k / n         # observed win rate
z = 1.96                                # 95%

# Wilson score interval -- correct for small n and rates near the edges
center = p + z*z / (2*n)
half = z * np.sqrt(p*(1-p)/n + z*z / (4*n*n))
lo, hi = (center - half) / (1 + z*z/n), (center + half) / (1 + z*z/n)
naive = z * np.sqrt(p*(1-p)/n)         # textbook normal half-width, for contrast

print(f"B won {k}/{n} paired comparisons -> win rate {100*p:.1f}%")
print(f"Wilson 95% CI : [{100*lo:.1f}%, {100*hi:.1f}%]  (width {100*(hi-lo):.1f} pts)")
print(f"naive normal  : {100*p:.1f}% +/- {100*naive:.1f}  -- 50% is inside: tie not ruled out")

edits are live — raise n to 200 and watch the interval clear 50%

B wins 23 of 40 — a 57.5% win rate that looks like a result. The Wilson interval is [42.2%, 71.5%]: nearly 30 points wide, and it straddles 50%. With true skill of 65%, forty comparisons still cannot rule out a coin flip. This is EQ P7.1's noise floor in its most common operational form — the wide bar is why a single green run does not close the ticket. Push n to 200 and the interval finally lifts off 50%.

In a paired A/B eval, prompt B beats prompt A on $23$ of $40$ head-to-head comparisons. What is B's observed win rate?

$23 / 40 = 0.575 =$ 57.5%. It looks like a win — but the Wilson 95% interval around it is roughly [42%, 72%] and straddles 50%, so forty comparisons have not yet earned the right to call B better. Grow $n$ until the interval clears the coin flip.

CONTAMINATION

If you iterate against your golden set, you become the overfit. Every time you read a failing case and patch the prompt for it, that case stops measuring generalization. Split even tiny sets — 30 for development, 10 you only touch before shipping — and refresh the holdout from live traffic on a schedule. This is Vol II's eval-decontamination discipline (Vol II · Ch 04) shrunk to prompt scale; the failure mode is identical, only faster.

7.2

LLM-as-judge — and the judge's rap sheet

Pairwise and rubric evals need a grader, and humans do not scale to nightly CI. The fix is to hire a model: show a judge model the input, the two candidate outputs (or one output plus a rubric), and ask for a verdict. The canonical result (MT-Bench, Zheng et al., 2023) is that a frontier judge agrees with human majority preference roughly 80–85% of the time — about as often as humans agree with each other. That made LLM-as-judge the default instrument of applied evaluation. It also imported a defendant's worth of biases:

Bias	Symptom	Mitigation
Position bias	The same pair, presented in the other order, gets a different verdict; most judges systematically favor one slot (direction varies by model)	judge every pair in both orders — average the verdicts, or count only wins that survive the swap (EQ P7.3)
Length bias	Longer answers win regardless of content; verbosity reads as effort	length-controlled win rates (the AlpacaEval 2 fix), explicit rubric line "do not reward length", compare at matched lengths
Self-preference	Judges score their own family's outputs higher — they recognize and reward their own style	judge from a different model family; or a panel of judges across families, majority vote
Style over substance	Confident tone, headers, and bullet polish outscore a correct but plain answer; errors inside fluent prose go unnoticed	give the judge a reference answer to compare against; grade correctness as its own criterion, isolated from presentation

The protocol that survives these biases is boring and effective: fixed rubric, reference answer when one exists, both orders, low temperature, verdict in a parseable field. Asking the judge to justify before deciding makes verdicts easier to audit; evidence that it makes them more accurate is mixed — treat the justification as a debugging artifact, not a guarantee (the faithfulness caveat of Ch 04 applies to judges too).

EQ P7.3 — ORDER-SWAP DEBIASING & THE FLIP RATE $$ \widehat{w}(A) \;=\; \tfrac{1}{2}\Big[\, \widehat{w}_{A\text{-first}} + \widehat{w}_{A\text{-second}} \,\Big], \qquad \Phi \;=\; \Pr\!\Big[\, \text{verdict}_{AB} \neq \text{verdict}_{BA} \,\Big] $$

$\widehat{w}$ is A's win rate; averaging the two presentation orders cancels position bias to first order. $\Phi$, the flip rate, is the audit you can run with zero ground truth: judge every pair twice with the order swapped and count changed verdicts. A flip means the judge read the seating chart, the dice, or both — not the quality. MT-Bench's stricter variant declares a win only if it survives both orders and calls everything else a tie.

Judged A-first, prompt A wins $80\%$ of pairs; judged A-second, A wins only $50\%$. Using the order-swap debias in EQ P7.3, what is A's position-corrected win rate $\widehat{w}(A)$?

$\widehat{w}(A) = \tfrac{1}{2}\big(0.80 + 0.50\big) = \tfrac{1}{2}(1.30) =$ 0.65. The 30-point gap between the two orders was pure seating bias; averaging cancels it to first order, leaving the true ~65% edge (judge noise still blurs it toward 50%).

INSTRUMENT P7.1 — JUDGE BIAS DEMOSEEDED · 3,000 PAIRS PER SETTING · EQ P7.3

TRUE QUALITY GAP Δ (A − B) +0.30

POSITION BIAS β 0.35

JUDGE NOISE σ 0.50

PROTOCOL — WHAT YOUR EVAL REPORTS

REPORTED WIN%(A)

—

TRUE WIN%(A)

—

FLIP RATE Φ ON SWAP

—

REPORTED − TRUE

—

A toy judge: on each item, A's real quality edge is Δ plus item-to-item spread; the judge adds β to whichever response it reads first, plus fresh noise σ per reading. At the defaults A genuinely wins ≈ 70% of items — but judged A-first the eval reports ≈ 81%, judged A-second ≈ 48%. The seating chart outvotes the quality gap. Toggle to SWAP & AVERAGE: the asymmetry cancels, though judge noise still compresses the gap toward 50% — debiasing fixes the tilt, not the blur. Set Δ = 0 and watch a pure-bias "preference" appear from nothing. Note Φ stays above zero even at β = 0: independent re-reads disagree near the boundary, so the flip rate measures total verdict instability — position bias is its systematic part, visible as the gap between the two single-order bars.

Calibrate the judge itself. Before trusting any judge pipeline, run it on 20–30 pairs you have hand-labeled and check agreement; published judge–human agreement transfers poorly across domains. And reuse the noise-floor logic of EQ P7.1: a judge-scored win rate over 50 pairs carries ±14-point error bars at $w \approx 0.5$ — wide enough to swallow most prompt tweaks.

The flip rate is not just a slider — it is three lines of arithmetic you can run against your own judge. The cell below builds a toy judge that adds a fixed boost to whichever answer it reads first, then scores the same pairs in both orders. The naive eval (A always first) reports a win rate inflated by the boost; swapping and averaging recovers the truth; and the flip rate names how often the verdict was an artifact of seating.

PYTHON · RUNNABLE IN-BROWSER

# LLM-judge position bias: judge the same pair in both orders, count the flips
import numpy as np
rng = np.random.default_rng(0)
N, b, sigma = 2000, 0.6, 0.8           # items, first-slot boost, judging noise
qA = rng.normal(0.15, 1.0, N)          # latent quality of answers A and B, per item
qB = rng.normal(0.00, 1.0, N)

def a_wins(a_is_first):                 # +b goes to whichever answer is shown first
    sa = qA + (b if a_is_first else 0) + sigma * rng.standard_normal(N)
    sb = qB + (b if not a_is_first else 0) + sigma * rng.standard_normal(N)
    return sa > sb

naive = a_wins(True)                    # A always shown first -- the lazy eval
o1 = a_wins(True)                       # order 1: A first
o2 = a_wins(False)                      # order 2: B first
flip = np.mean(o1 != o2)               # the two orders disagree on who won

print(f"naive  win%(A), A-always-first  : {100*naive.mean():.1f}%")
print(f"swapped win%(A), order-averaged  : {100*0.5*(o1.mean()+o2.mean()):.1f}%")
print(f"verdict FLIP rate on order swap  : {100*flip:.1f}%   <- this is bias, not quality")

edits are live — set b = 0 and watch the flip rate survive

Naive reports A winning 66%; order-averaged says 53% — a thirteen-point phantom edge, pure seating. The 36.6% flip rate is the alarm: more than a third of verdicts changed under nothing but a swap. Set b = 0 and the naive and averaged rates converge, but the flip rate stays well above zero — independent re-reads disagree near the boundary regardless. Position bias is the systematic slice of that instability, and it is the slice swapping removes.

You judge $20$ candidate pairs twice — once in each presentation order. On $6$ of the pairs the verdict changes when the order is swapped. What is the flip rate $\Phi$ (EQ P7.3)?

$\Phi = 6 / 20 = 0.30 =$ 30%. Nearly a third of verdicts were an artifact of seating, not quality — a flip rate you can measure with zero ground truth, just by swapping and re-judging. High $\Phi$ means trust the both-orders protocol, not any single run.

7.3

Prompt versioning & regression: prompts are code

A production prompt is configuration that controls live system behavior — yet teams that would never push code without review and CI routinely edit prompts in a dashboard textbox at 6 p.m. on a Friday. The fix is to grant prompts the full citizenship of code:

# prompts-as-code — the minimum viable discipline
repo:      prompts/support-triage/v3.2.1.md  + CHANGELOG.md
semver:    MAJOR task change · MINOR scaffold/technique change · PATCH wording
pin:       model ID + temperature + max_tokens versioned WITH the prompt —
           a prompt is only reproducible as (text, model, params)
gate:      CI runs the golden set on every prompt diff; merge blocked when
           the score drops by more than the noise floor (EQ P7.1 — know yours)
review:    prompt diffs get human review; "harmless rewording" is how
           load-bearing constraints die
canary:    new version to 5% of traffic; compare online metrics before 100%
re-eval:   every model version bump re-runs ALL prompt evals —
           the prompt didn't change, but its interpreter did

Two details earn their lines. First, the noise floor: before a gate can blame a diff, you must know how much the score wobbles when nothing changes — run the unchanged prompt through the eval twice at your production temperature and record the spread. Gating on movements smaller than that spread generates alarms nobody trusts, and untrusted alarms get deleted. Second, model upgrades are silent prompt regressions. A prompt is an artifact tuned against one model's quirks; swap the model and the tuning is stale — formats drift, refusal boundaries move, the few-shot examples land differently. Pinning model IDs and re-running the full suite on every upgrade is the difference between discovering this in CI and discovering it from customers.

The regression story is always the same shape: someone tightens a 400-word prompt to 340 because "it was bloated," format compliance quietly falls from 99% to 91%, and three systems downstream of the parser start retrying. With an eval gate, that is a red X on a pull request. Without one, it is an incident review. Same edit, different Tuesday.

7.4

Anti-patterns catalog

Every entry below survives in the wild for one reason: nobody measured it. Each is a real pattern from production prompts, with the failure mechanism and the repair.

Anti-pattern	Specimen	Why it fails	The fix
"World's-best-expert" inflation	"You are the world's greatest marketer with 50 years of experience and 17 industry awards…"	Role conditioning works by selecting a register and vocabulary distribution (Ch 02), not by rank. Superlatives carry zero task information and tilt output toward grandiose prose.	An information-bearing role: domain, seniority, audience. "Senior lifecycle marketer at a B2B SaaS, writing for trial users who stalled at step 2."
Threat & tip folklore	"I will tip you $200 for a perfect answer." · "If you fail, I will lose my job."	Effects were small, model-specific, and unstable even when first reported; current post-training largely normalizes them away. You spend tokens on theater and risk a weird, placating tone.	State the real stakes as usable context — "this summary goes to the CFO unedited" changes behavior because it carries information, not pressure.
The mega-prompt	3,000 accumulated words; the actual task on line 41 of 90; three format rules from three authors, two of them contradictory	Instructions buried mid-context are recalled worst (Ch 01); patch-on-patch prompts accumulate contradictions the model resolves arbitrarily — differently each sample.	Refactor like legacy code: dedupe, delete rules that cite no failure, move reference material into tagged sections, state the task first or last — and keep an eval so the refactor is provably safe.
Negative-only constraints	"Don't be verbose. Don't use jargon. Don't speculate. Don't mention competitors…"	A wall of don'ts says where not to go and nothing about where to go — and a negated concept is still an activated one: "don't mention competitors" raises their salience.	Pair every DON'T with a DO, then show one exemplar of the desired output. An example is worth twenty constraints (Ch 03).
Vague qualifiers	"Be concise but comprehensive, professional yet warm, detailed where it matters."	Unfalsifiable adjective pairs: the model picks the trade-off point arbitrarily, and differently on every sample. You cannot eval compliance with a vibe.	Operationalize: word caps, named structure ("3 bullets + 1 risk"), reading level, or an exemplar that embodies the trade-off. If you can't write the check, the model can't hit the target.

The catalog compresses to one rule: every token must carry information the model can act on. Rank, flattery, threats, and vibes carry none. Context, constraints, examples, and checkable formats carry plenty — and everything that carries information can be measured, which is what the next section is for.

The broken-prompt diagnostic

The catalog tells you what bad prompts look like; the diagnostic tells you how to find the break in your own. Run these five questions in order — the first NO is usually the whole bug. They are deliberately yes/no: a vibe is not a diagnosis.

Question	What broken looks like	The fix
1 · ROLE defined?	No role at all, or a superlative one ("world's best expert") that selects grandiosity instead of a register.	Name domain, seniority, and audience — the three facts that actually move the output distribution (Ch 02).
2 · CONTEXT named?	The model is asked to act on facts it was never given — reader, stakes, prior history, the actual input — so it invents them.	Supply the real inputs and the real stakes as information, not adjectives ("goes to the CFO unedited").
3 · FORMAT locked?	"Write something good" with no shape — length, sections, schema all left to chance, so every sample differs and nothing parses.	Specify a checkable structure: word cap, named sections, schema, or one exemplar that embodies it (Ch 05).
4 · CONSTRAINTS named & refusal licensed?	No boundaries, or only DON'Ts; and the model is never told it may refuse or flag missing inputs, so it confabulates to comply.	Pair each DON'T with a DO; resolve trade-offs explicitly; license the escape hatch ("if a field is missing, write UNKNOWN — do not guess").
5 · EXAMPLES present?	The desired output is described in prose only; the model matches the description loosely and the label/format discipline drifts.	Show one to three worked exemplars. An example is worth twenty constraints (Ch 03).

Most broken prompts fail question 4 first. Roles and formats are the parts authors remember to write; the unstated constraint and the un-granted refusal license are the parts they forget — and they are the parts that turn a confident wrong answer into an incident. Diagnose in order, but expect the break at four.

INSTRUMENT P7.3 — PROMPT DOCTORRUN THE FIVE QUESTIONS ON ONE REAL BROKEN PROMPT

THE PATIENT — A PRODUCTION PROMPT THAT KEEPS CAUSING RETRIES

You are an amazing customer-support assistant. A user has written in
about a problem with their order. Here is their message:

"My order #44812 arrived with the wrong item — I got the blue case,
not the black one I paid for. This is the second time. I need the
right one before Friday or I'm disputing the charge."

Reply to the customer in a friendly, professional tone with a clear
subject line and 2–3 short paragraphs. Don't be defensive and don't
make promises we can't keep.

RUN A DIAGNOSTIC — EACH BUTTON REVEALS ITS VERDICT FOR THIS PROMPT

Click a diagnostic above. Three of the five fail on this prompt — see if you can predict which before revealing.

DIAGNOSTICS RUN: 0 / 5

Two diagnostics pass: the prompt names the CONTEXT (the actual ticket, with stakes) and locks a FORMAT (subject + 2–3 paragraphs). Three fail: the ROLE is a superlative with no register, the CONSTRAINTS are negative-only with no refusal license for the missing replacement-stock fact, and there are no EXAMPLES. As the rule predicts, the load-bearing break is question 4 — nothing tells the model what to do when it doesn't know whether a black case is even in stock, so it will cheerfully promise one. SHOW FIX rewrites all three.

7.5

The Prompt Lab — run the volume's claims live

Everything above assumed you had outputs to score. Time to generate some. The lab below sends two prompts — A, a baseline; B, a technique from this volume — to a real Claude model and shows both outputs side by side. Four preset experiments reproduce the volume's central comparisons; the textareas stay fully editable, so the fifth experiment is yours.

PRIVACY

Bring your own key; keep your own key. Your API key is held in this tab's sessionStorage only — it is never sent anywhere except api.anthropic.com, and requests travel directly from your browser to Anthropic over TLS. This page has no backend and no analytics on the lab. Closing the tab forgets the key. Use a key from console.anthropic.com with a low spend limit; a lab run costs a fraction of a cent.

INSTRUMENT P7.2 — THE PROMPT LABBYOK · LIVE A/B AGAINST THE ANTHROPIC API

ANTHROPIC API KEY

MODEL

PRESET EXPERIMENTS — EACH MAPS TO A CHAPTER OF THIS VOLUME

SYSTEM PROMPT — SHARED BY A AND B (THE CONTROLLED VARIABLE)

PROMPT A — BASELINE

PROMPT B — TECHNIQUE

IDLE — PASTE A KEY · PICK A PRESET · RUN

OUTPUT A

—

OUTPUT B

—

Run a preset, read both outputs against the technique's claim — then run it again. Sampling at the API's default temperature means each run is one draw; a conclusion from a single pair is the $n = 1$ sin §7.1 opened with. And notice your own protocol: B always sits on the right and you know which is which — position bias and experimenter bias, live, in you. For a judgment you'd defend, decide the criterion before reading, and for real evals, blind and randomize (§7.2).

What the presets test. Scaffold vs bare reruns Ch 02's central claim on your model of choice. Few-shot vs zero-shot (Ch 03) uses a deliberately MIXED-sentiment ticket — watch whether the examples transfer the output format and the label discipline. Critique-then-revise vs single pass (Ch 06) shows all three passes, so you can check whether the critique actually found anything. XML vs prose (Ch 05) feeds the same meeting notes as a run-on mess and as tagged sections — compare which one flags the unassigned action item.

A prompt that survives an eval gate is ready for responsibility. Volume IV hands it tools: the agentic loop — model calls a tool, reads the result, decides what to do next — where every technique in this volume becomes the control surface for software that acts, and every missing eval becomes an incident.

§