Evals for prompts: 20 examples beat 0
The standard failure mode of prompt work looks like this: edit the prompt, eyeball one output, decide it "feels better," ship. That is an experiment with \(n = 1\), no control, and a judge — you — who wants the change to work. The difference between prompt tinkering and prompt engineering is not cleverness; it is a number that moves when the prompt improves. Three eval designs cover nearly every case:
| Design | When | Scoring | Output |
|---|---|---|---|
| Golden set | The output is checkable: a label, a number, an extraction, a schema, code that runs | exact match · contains · regex · schema-valid · tests pass | pass rate |
| Pairwise A/B | No single right answer — emails, summaries, explanations | human or LLM judge picks the better of two outputs for the same input | win rate |
| Rubric scoring | Quality has named dimensions you can argue about | per-criterion verdicts: accurate? cited? under length? right register? | per-criterion pass rates |
Build the golden set from real traffic, not invented inputs: pull 20–50 cases, deliberately oversample the weird tail (ambiguous tickets, hostile users, inputs in the wrong language), and write down the expected behavior at collection time — deciding what "correct" means after seeing the model's answer is how wishful grading creeps in. For rubric scoring, prefer many small boolean criteria over one holistic 1–10: "did it cite the source span? Y/N" is stable across graders; "rate the quality" is not.
Why does a set as small as 20 matter so much? Because the information gain from \(0 \to 20\) examples is the largest you will ever get — it converts "I think this is better" into "this broke 4 of 20 cases." But binomial noise sets a floor on what small sets can resolve:
There is a cheap statistical upgrade most teams skip: run both prompts on the same items and compare per-item, rather than comparing two independent pass rates. Item difficulty is shared noise, and pairing cancels it:
So put a number on it. Forty paired comparisons is a typical first eval; the cell below counts B's wins over A and wraps the rate in a Wilson confidence interval — the small-sample-correct version of EQ P7.1, which (unlike the textbook normal interval) never runs off the end of \([0,1]\) and stays honest near the edges. Read whether the interval clears 50%: if it doesn't, you have not yet earned the right to call B better.
# Paired A/B win rate over 40 comparisons with a Wilson 95% confidence interval
import numpy as np
rng = np.random.default_rng(0)
n, p_true = 40, 0.65 # comparisons, B's true per-item win prob over A
wins = rng.random(n) < p_true # one paired verdict per item (did B beat A?)
k = int(wins.sum()); p = k / n # observed win rate
z = 1.96 # 95%
# Wilson score interval -- correct for small n and rates near the edges
center = p + z*z / (2*n)
half = z * np.sqrt(p*(1-p)/n + z*z / (4*n*n))
lo, hi = (center - half) / (1 + z*z/n), (center + half) / (1 + z*z/n)
naive = z * np.sqrt(p*(1-p)/n) # textbook normal half-width, for contrast
print(f"B won {k}/{n} paired comparisons -> win rate {100*p:.1f}%")
print(f"Wilson 95% CI : [{100*lo:.1f}%, {100*hi:.1f}%] (width {100*(hi-lo):.1f} pts)")
print(f"naive normal : {100*p:.1f}% +/- {100*naive:.1f} -- 50% is inside: tie not ruled out")
B wins 23 of 40 — a 57.5% win rate that looks like a result. The Wilson interval is [42.2%, 71.5%]: nearly 30 points wide, and it straddles 50%. With true skill of 65%, forty comparisons still cannot rule out a coin flip. This is EQ P7.1's noise floor in its most common operational form — the wide bar is why a single green run does not close the ticket. Push n to 200 and the interval finally lifts off 50%.
If you iterate against your golden set, you become the overfit. Every time you read a failing case and patch the prompt for it, that case stops measuring generalization. Split even tiny sets — 30 for development, 10 you only touch before shipping — and refresh the holdout from live traffic on a schedule. This is Vol II's eval-decontamination discipline (Vol II · Ch 04) shrunk to prompt scale; the failure mode is identical, only faster.
LLM-as-judge — and the judge's rap sheet
Pairwise and rubric evals need a grader, and humans do not scale to nightly CI. The fix is to hire a model: show a judge model the input, the two candidate outputs (or one output plus a rubric), and ask for a verdict. The canonical result (MT-Bench, Zheng et al., 2023) is that a frontier judge agrees with human majority preference roughly 80–85% of the time — about as often as humans agree with each other. That made LLM-as-judge the default instrument of applied evaluation. It also imported a defendant's worth of biases:
| Bias | Symptom | Mitigation |
|---|---|---|
| Position bias | The same pair, presented in the other order, gets a different verdict; most judges systematically favor one slot (direction varies by model) | judge every pair in both orders — average the verdicts, or count only wins that survive the swap (EQ P7.3) |
| Length bias | Longer answers win regardless of content; verbosity reads as effort | length-controlled win rates (the AlpacaEval 2 fix), explicit rubric line "do not reward length", compare at matched lengths |
| Self-preference | Judges score their own family's outputs higher — they recognize and reward their own style | judge from a different model family; or a panel of judges across families, majority vote |
| Style over substance | Confident tone, headers, and bullet polish outscore a correct but plain answer; errors inside fluent prose go unnoticed | give the judge a reference answer to compare against; grade correctness as its own criterion, isolated from presentation |
The protocol that survives these biases is boring and effective: fixed rubric, reference answer when one exists, both orders, low temperature, verdict in a parseable field. Asking the judge to justify before deciding makes verdicts easier to audit; evidence that it makes them more accurate is mixed — treat the justification as a debugging artifact, not a guarantee (the faithfulness caveat of Ch 04 applies to judges too).
Calibrate the judge itself. Before trusting any judge pipeline, run it on 20–30 pairs you have hand-labeled and check agreement; published judge–human agreement transfers poorly across domains. And reuse the noise-floor logic of EQ P7.1: a judge-scored win rate over 50 pairs carries ±14-point error bars at \(w \approx 0.5\) — wide enough to swallow most prompt tweaks.
The flip rate is not just a slider — it is three lines of arithmetic you can run against your own judge. The cell below builds a toy judge that adds a fixed boost to whichever answer it reads first, then scores the same pairs in both orders. The naive eval (A always first) reports a win rate inflated by the boost; swapping and averaging recovers the truth; and the flip rate names how often the verdict was an artifact of seating.
# LLM-judge position bias: judge the same pair in both orders, count the flips
import numpy as np
rng = np.random.default_rng(0)
N, b, sigma = 2000, 0.6, 0.8 # items, first-slot boost, judging noise
qA = rng.normal(0.15, 1.0, N) # latent quality of answers A and B, per item
qB = rng.normal(0.00, 1.0, N)
def a_wins(a_is_first): # +b goes to whichever answer is shown first
sa = qA + (b if a_is_first else 0) + sigma * rng.standard_normal(N)
sb = qB + (b if not a_is_first else 0) + sigma * rng.standard_normal(N)
return sa > sb
naive = a_wins(True) # A always shown first -- the lazy eval
o1 = a_wins(True) # order 1: A first
o2 = a_wins(False) # order 2: B first
flip = np.mean(o1 != o2) # the two orders disagree on who won
print(f"naive win%(A), A-always-first : {100*naive.mean():.1f}%")
print(f"swapped win%(A), order-averaged : {100*0.5*(o1.mean()+o2.mean()):.1f}%")
print(f"verdict FLIP rate on order swap : {100*flip:.1f}% <- this is bias, not quality")
Naive reports A winning 66%; order-averaged says 53% — a thirteen-point phantom edge, pure seating. The 36.6% flip rate is the alarm: more than a third of verdicts changed under nothing but a swap. Set b = 0 and the naive and averaged rates converge, but the flip rate stays well above zero — independent re-reads disagree near the boundary regardless. Position bias is the systematic slice of that instability, and it is the slice swapping removes.
Prompt versioning & regression: prompts are code
A production prompt is configuration that controls live system behavior — yet teams that would never push code without review and CI routinely edit prompts in a dashboard textbox at 6 p.m. on a Friday. The fix is to grant prompts the full citizenship of code:
# prompts-as-code — the minimum viable discipline
repo: prompts/support-triage/v3.2.1.md + CHANGELOG.md
semver: MAJOR task change · MINOR scaffold/technique change · PATCH wording
pin: model ID + temperature + max_tokens versioned WITH the prompt —
a prompt is only reproducible as (text, model, params)
gate: CI runs the golden set on every prompt diff; merge blocked when
the score drops by more than the noise floor (EQ P7.1 — know yours)
review: prompt diffs get human review; "harmless rewording" is how
load-bearing constraints die
canary: new version to 5% of traffic; compare online metrics before 100%
re-eval: every model version bump re-runs ALL prompt evals —
the prompt didn't change, but its interpreter did
Two details earn their lines. First, the noise floor: before a gate can blame a diff, you must know how much the score wobbles when nothing changes — run the unchanged prompt through the eval twice at your production temperature and record the spread. Gating on movements smaller than that spread generates alarms nobody trusts, and untrusted alarms get deleted. Second, model upgrades are silent prompt regressions. A prompt is an artifact tuned against one model's quirks; swap the model and the tuning is stale — formats drift, refusal boundaries move, the few-shot examples land differently. Pinning model IDs and re-running the full suite on every upgrade is the difference between discovering this in CI and discovering it from customers.
The regression story is always the same shape: someone tightens a 400-word prompt to 340 because "it was bloated," format compliance quietly falls from 99% to 91%, and three systems downstream of the parser start retrying. With an eval gate, that is a red X on a pull request. Without one, it is an incident review. Same edit, different Tuesday.
Anti-patterns catalog
Every entry below survives in the wild for one reason: nobody measured it. Each is a real pattern from production prompts, with the failure mechanism and the repair.
| Anti-pattern | Specimen | Why it fails | The fix |
|---|---|---|---|
| "World's-best-expert" inflation | "You are the world's greatest marketer with 50 years of experience and 17 industry awards…" | Role conditioning works by selecting a register and vocabulary distribution (Ch 02), not by rank. Superlatives carry zero task information and tilt output toward grandiose prose. | An information-bearing role: domain, seniority, audience. "Senior lifecycle marketer at a B2B SaaS, writing for trial users who stalled at step 2." |
| Threat & tip folklore | "I will tip you $200 for a perfect answer." · "If you fail, I will lose my job." | Effects were small, model-specific, and unstable even when first reported; current post-training largely normalizes them away. You spend tokens on theater and risk a weird, placating tone. | State the real stakes as usable context — "this summary goes to the CFO unedited" changes behavior because it carries information, not pressure. |
| The mega-prompt | 3,000 accumulated words; the actual task on line 41 of 90; three format rules from three authors, two of them contradictory | Instructions buried mid-context are recalled worst (Ch 01); patch-on-patch prompts accumulate contradictions the model resolves arbitrarily — differently each sample. | Refactor like legacy code: dedupe, delete rules that cite no failure, move reference material into tagged sections, state the task first or last — and keep an eval so the refactor is provably safe. |
| Negative-only constraints | "Don't be verbose. Don't use jargon. Don't speculate. Don't mention competitors…" | A wall of don'ts says where not to go and nothing about where to go — and a negated concept is still an activated one: "don't mention competitors" raises their salience. | Pair every DON'T with a DO, then show one exemplar of the desired output. An example is worth twenty constraints (Ch 03). |
| Vague qualifiers | "Be concise but comprehensive, professional yet warm, detailed where it matters." | Unfalsifiable adjective pairs: the model picks the trade-off point arbitrarily, and differently on every sample. You cannot eval compliance with a vibe. | Operationalize: word caps, named structure ("3 bullets + 1 risk"), reading level, or an exemplar that embodies the trade-off. If you can't write the check, the model can't hit the target. |
The catalog compresses to one rule: every token must carry information the model can act on. Rank, flattery, threats, and vibes carry none. Context, constraints, examples, and checkable formats carry plenty — and everything that carries information can be measured, which is what the next section is for.
The broken-prompt diagnostic
The catalog tells you what bad prompts look like; the diagnostic tells you how to find the break in your own. Run these five questions in order — the first NO is usually the whole bug. They are deliberately yes/no: a vibe is not a diagnosis.
| Question | What broken looks like | The fix |
|---|---|---|
| 1 · ROLE defined? | No role at all, or a superlative one ("world's best expert") that selects grandiosity instead of a register. | Name domain, seniority, and audience — the three facts that actually move the output distribution (Ch 02). |
| 2 · CONTEXT named? | The model is asked to act on facts it was never given — reader, stakes, prior history, the actual input — so it invents them. | Supply the real inputs and the real stakes as information, not adjectives ("goes to the CFO unedited"). |
| 3 · FORMAT locked? | "Write something good" with no shape — length, sections, schema all left to chance, so every sample differs and nothing parses. | Specify a checkable structure: word cap, named sections, schema, or one exemplar that embodies it (Ch 05). |
| 4 · CONSTRAINTS named & refusal licensed? | No boundaries, or only DON'Ts; and the model is never told it may refuse or flag missing inputs, so it confabulates to comply. | Pair each DON'T with a DO; resolve trade-offs explicitly; license the escape hatch ("if a field is missing, write UNKNOWN — do not guess"). |
| 5 · EXAMPLES present? | The desired output is described in prose only; the model matches the description loosely and the label/format discipline drifts. | Show one to three worked exemplars. An example is worth twenty constraints (Ch 03). |
Most broken prompts fail question 4 first. Roles and formats are the parts authors remember to write; the unstated constraint and the un-granted refusal license are the parts they forget — and they are the parts that turn a confident wrong answer into an incident. Diagnose in order, but expect the break at four.
The Prompt Lab — run the volume's claims live
Everything above assumed you had outputs to score. Time to generate some. The lab below sends two prompts — A, a baseline; B, a technique from this volume — to a real Claude model and shows both outputs side by side. Four preset experiments reproduce the volume's central comparisons; the textareas stay fully editable, so the fifth experiment is yours.
Bring your own key; keep your own key. Your API key is held in this tab's sessionStorage only — it is never sent anywhere except api.anthropic.com, and requests travel directly from your browser to Anthropic over TLS. This page has no backend and no analytics on the lab. Closing the tab forgets the key. Use a key from console.anthropic.com with a low spend limit; a lab run costs a fraction of a cent.
What the presets test. Scaffold vs bare reruns Ch 02's central claim on your model of choice. Few-shot vs zero-shot (Ch 03) uses a deliberately MIXED-sentiment ticket — watch whether the examples transfer the output format and the label discipline. Critique-then-revise vs single pass (Ch 06) shows all three passes, so you can check whether the critique actually found anything. XML vs prose (Ch 05) feeds the same meeting notes as a run-on mess and as tagged sections — compare which one flags the unassigned action item.
A prompt that survives an eval gate is ready for responsibility. Volume IV hands it tools: the agentic loop — model calls a tool, reads the result, decides what to do next — where every technique in this volume becomes the control surface for software that acts, and every missing eval becomes an incident.
Further reading
- Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. — the foundational study of LLM judges; documents the position, verbosity, and self-enhancement biases and the both-orders protocol this chapter builds on.
- Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. — the length-bias fix referenced in §7.2; shows verbosity alone can move judged win rates by double digits.
- Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. — the original Wilson score interval used in the §7.1 paired-eval cell; still the correct small-sample interval for a proportion.
- McNemar, Q. (1947). Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. — the paired sign test behind EQ P7.2; the right significance check when two prompts are scored on the same items.
- Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models (HELM). — the case for multi-metric, scenario-based evaluation over a single aggregate score; the discipline §7.3's eval gates operationalize.
- Perez, E., Huang, S., Song, F., et al. (2022). Red Teaming Language Models with Language Models. — methodology for generating adversarial test cases automatically; how to grow the weird-tail golden set §7.1 demands.