Why single-pass output is a draft
An autoregressive model commits to every token as it goes. There is no backspace in the decoding loop: a weak opening sentence constrains everything after it, an early arithmetic slip propagates to the conclusion, and the model's trademark fluency papers over both. Single-pass generation is a first draft produced by a writer who is forbidden from rereading.
What rescues this is an asymmetry the field keeps rediscovering. Ask a model to produce a correct solution and it succeeds with some probability; show it a candidate solution and ask is this correct? and it succeeds more often. The canonical early result: on grade-school math, a small model that generates many answers and ranks them with a trained verifier beat a generator 30× its size sampling once (Cobbe et al., 2021). The same asymmetry is why RLHF works at all — humans (and reward models) can rank outputs they could never write (Vol II · EQ 5.2). The intuition is old: checking a proof is easier than finding one.
The gap is not a slogan; it is arithmetic. Suppose a model writes a correct first draft only half the time, but can judge a candidate correctly 85% of the time. A bare draft is a coin flip — but generate, verify, and revise only what the verifier flags, and the effective accuracy climbs well past either number alone. The cell below runs that lifecycle on a toy model and prints the lift; it is the smallest possible version of every pattern in this chapter.
# generator-verifier gap: generate -> verify -> revise lifts effective accuracy
import numpy as np
rng = np.random.default_rng(0)
g, v = 0.50, 0.85 # P(draft correct), P(verifier judges correctly)
N = 200_000
correct = rng.random(N) < g # is the draft actually right?
verifier_right = rng.random(N) < v # does the verifier judge it correctly?
says_ok = np.where(correct, verifier_right, ~verifier_right) # OK iff judged "good"
revised = rng.random(N) < g # flagged drafts get one fresh attempt
final = np.where(says_ok, correct, revised) # keep OK drafts; replace the flagged
analytic = g*v + g*(1-v)*g + (1-g)*v*g # the three ways to end up correct
print(f"raw generator {correct.mean():.3f}")
print(f"verifier flags 'bad' {(~says_ok).mean():.3f} (these get revised)")
print(f"after verify + revise {final.mean():.3f} (analytic {analytic:.3f})")
print(f"lift over raw draft {final.mean() - correct.mean():+.3f}")
Half-right drafts become two-thirds-right answers, paid for in one extra verification pass — that surplus is \(\Delta_{\mathrm{GV}}\) spent. Drop the verifier to \(v = 0.5\) (a coin) and the lift collapses to zero: a verifier no better than chance launders no information, which is the §6.8 caution stated as code. Push \(v\) higher and the ceiling rises toward what a perfect filter plus one retry can reach.
Three topologies organize everything that follows: run the check after the draft (sequential — self-critique, Reflexion), run many checks in parallel (the council), or make two copies of the model fight and judge the wreckage (debate). The pre-mortem and red team are sequential patterns wearing armor: the critique arrives dressed as an attacker or a coroner, which turns out to matter enormously.
Self-critique & revise: the three-turn pattern
The minimum viable verification loop is three calls: produce → critique against explicit criteria → revise. Each clause carries weight. Three calls, because critique appended to the generation prompt ("write it, then review your work") collapses into one distribution — the model that just committed to a draft is the model least able to see its flaws, and in practice appends a paragraph of polite self-congratulation. Explicit criteria, because "make it better" licenses cosmetic edits; a rubric converts taste into checkable claims.
Rubric-as-prompt is the load-bearing trick. A good rubric has 3–6 criteria, each phrased so that a verdict can be defended by quotation — the critic must point at failing text, not emit vibes. A worked example, for a status-update paragraph:
| Criterion | Checkable phrasing | Catches |
|---|---|---|
| Specificity | every claim carries a number, date, or named source | "much faster", "significantly" |
| Falsifiability | a skeptic could in principle prove each claim wrong | "better performance overall" |
| Causal clarity | mechanisms stated — X because Y, with Y measured | "caching and other improvements" |
| Reader cost | no sentence makes the reader do the author's work | "the team worked very hard" |
# THE 3-TURN PATTERN — each turn is a separate API call
TURN 1 — AUTHOR
Write the deployment update for the engineering newsletter.
{{task context}}
TURN 2 — CRITIC # fresh context: gets draft + rubric, nothing else
You are reviewing a draft you did not write. Grade it against each
criterion below. For every verdict, QUOTE the text that earns it.
Do not rewrite. Do not praise. Verdicts: PASS / PARTIAL / FAIL.
RUBRIC
1. SPECIFICITY every claim carries a number, date, or named source
2. FALSIFIABILITY a skeptic could in principle prove each claim wrong
3. CAUSAL CLARITY mechanisms stated (X because Y), not adjacency
4. READER COST no sentence makes the reader do the author's work
DRAFT
{{draft}}
TURN 3 — REVISER # gets draft + critique; NOT the critic's context
Rewrite the draft so every FAIL and PARTIAL becomes a PASS.
Change nothing the critique did not flag. Output only the revision.
The reviser's leash — change nothing the critique did not flag — prevents revision drift, where a model "improving" a draft quietly rewrites the parts that were already right. And the critic's quote-to-convict rule is your hallucination filter: a criticism that cannot point at text is usually invented.
The three-pass discipline
In regulated-industry field practice — where a flawed artifact survives to a downstream review that has consequences — the three-turn pattern hardens into a fixed ritual run on every load-bearing document. The shape is the same three calls, but each pass is named, scoped, and given a job it cannot fake its way out of.
Pass 1 — generate. Produce the full scaffold: not an outline, not a sketch, but the complete artifact with every section populated, so the critic has real text to convict rather than intentions to approve. A half-finished draft invites a half-hearted review.
Pass 2 — critique as a named senior reviewer. Fresh context. The model is cast as a specific, senior, skeptical persona — a named role with a reputation to protect — and made to grade the artifact against an explicit, pre-registered checklist, every verdict anchored to quoted text. The persona is load-bearing for the same reason as in §6.4: "review this" returns courtesy; "you are the principal reviewer who signs off on this and owns the failures" returns findings.
Pass 3 — confidence-score every section 1–5 with rationale. The reviewer assigns each section a numeric confidence (1 = would block release, 5 = ship as-is) and a one-line rationale for the score. The enforced rule is the whole point: at least one section must score ≤ 3. A scorecard of straight fives is rejected and the pass re-run — because all fives means the critique never happened.
The forced low score is a direct countermeasure to sycophantic self-review (§6.8). A model grading work — especially work adjacent to its own first draft — drifts toward charitable, "looks good with minor nits" verdicts; left to free-form scoring it will hand out fives to close the task. Mandating that the distribution contain a low number removes the comfortable equilibrium: the model can no longer satisfy the instruction and bless everything, so it is forced to locate the genuinely weakest section and defend a real criticism of it. The constraint does not invent flaws — every artifact has a weakest part — it simply refuses to let the reviewer pretend there isn't one. In practice the section the model is most reluctant to mark down is, more often than not, the one that actually needed the work.
# PASS 2 — SENIOR REVIEWER · fresh context: artifact + checklist only
You are the principal reviewer who signs the release for this artifact
and personally owns every defect that reaches production. You did not
write it. Grade it against the checklist. QUOTE the text behind every
verdict. Do not rewrite. Do not praise. Verdict: PASS / PARTIAL / FAIL.
CHECKLIST
1. {{check 1 — e.g. every claim carries a number, date, or source}}
2. {{check 2 — e.g. no two sections contradict}}
3. {{check 3 — e.g. each stated mechanism is measured, not asserted}}
4. {{check 4 — e.g. nothing here can be quoted out of context to mislead}}
ARTIFACT
{{artifact, full scaffold from Pass 1}}
# PASS 3 — CONFIDENCE SCORECARD · same reviewer, after the critique
Score every section 1-5 (1 = would block release, 5 = ship as-is) with
a one-line rationale per score. HARD RULE: at least one section must
score 3 or below. A scorecard of all fives is invalid — find the
weakest section and defend a real criticism of it.
SECTION SCORE RATIONALE (one line, anchored to text)
{{section 1}} _/5 ...
{{section 2}} _/5 ...
{{section N}} _/5 ...
LOWEST-SCORING SECTION: {{name}} — the one change that raises it.
Feed the lowest-scoring section and its FAIL/PARTIAL verdicts to the §6.2 reviser; leave the fours and fives alone (the reviser's leash). The scorecard is also a cheap audit trail: in a regulated setting, "every section scored, weakest one named and addressed" is a defensible record of having actually checked — which is most of what a downstream reviewer is looking for.
Reflexion loops: carry the critique forward
One critique pass fixes one draft. Reflexion (Shinn et al., 2023) turns the pattern into a loop with memory: attempt the task, fail against an external signal (a unit test, a validator, an environment), then have a reflector distill why it failed into one or two sentences — and feed only those lessons into the next attempt, not the failed transcripts. The critique becomes episodic memory: a verbal gradient step, applied at inference time, with no weights touched.
The design choice that makes it work is what you exclude. Naive retry-with-history stuffs every failed attempt into context, which bloats the prompt and — worse — anchors the model on its own failed approach; models shown their previous wrong answer reproduce its skeleton with cosmetic edits. A distilled lesson ("the regex missed multiline input; anchor with \A…\z") transfers the information without the anchor. Memory should hold conclusions, not transcripts.
# REFLEXION LOOP — repeat until pass or attempts == K
ATTEMPT k # fresh context: task + memory, never previous transcripts
<memory> # one lesson per failed attempt, written by the reflector
- Attempt 1 failed: regex missed multiline input; anchor with \A…\z
- Attempt 2 failed: parsing fixed, but the empty-file case now throws
</memory>
<task>{{task}}</task>
# after each failure, with the attempt + the error/test output:
REFLECTOR
In at most 2 sentences: state why this attempt failed and the one
rule that would have prevented it. Append to <memory>.
Do not apologize. Do not restate the task. Do not propose code.
The honest constraint: Reflexion's published gains (it pushed GPT-4 from ~80% to ~91% pass@1 on HumanEval) lean on a ground-truth signal — tests either pass or they don't. With no external verifier, the reflector grades its own homework and the loop can wander: lessons become confabulated, and accuracy can go down across iterations (§6.8). Reflexion is a technique for tasks with oracles, approximate or exact.
Red-team prompting: attack your own output
A critique prompt asks "how good is this?" A red-team prompt asks "how does this fail?" — and the reframing changes what the model retrieves. Assistants are tuned toward helpfulness, which makes their default review charitable: they look for things to fix gently. Casting the model as a hostile reviewer — a security auditor, a competitor's analyst, a paid breaker — relicenses pure negativity, and the findings get sharper and more specific. The persona isn't theater; it's distribution selection (Ch 02).
Structure the attack like a security review, because the taxonomy forces coverage instead of letting the model fixate on its first objection. Four lines of attack, in escalating order of imagination: failure modes (inputs and states where behavior is wrong or undefined), edge cases (empty, enormous, malformed, concurrent, adversarial), the hostile reader (the sentence a critic quotes out of context; the claim a competitor screenshots), and abuse (how a motivated user weaponizes the artifact exactly as written).
RED TEAM # fresh context — the attacker must not have written the artifact
You wrote nothing below. You are a hostile reviewer paid per finding
to identify the fastest ways this artifact fails in production.
ARTIFACT
{{output}}
Attack in order:
1. FAILURE MODES inputs or states where behavior is wrong/undefined
2. EDGE CASES empty · huge · malformed · concurrent · adversarial
3. HOSTILE READER the sentence a critic quotes out of context;
the claim a competitor puts on a slide
4. ABUSE how a motivated user weaponizes this as written
For each finding: severity P0–P3, the exact text or line, and a
concrete trigger that reproduces it. No praise. No summary.
If a category yields nothing real, write "no finding" — do not invent.
Red teams produce candidates, not confirmations. Some findings will be hallucinated — the "concrete trigger" requirement is the triage filter (an attack that can't name its reproduction is noise), and the explicit permission to return "no finding" suppresses the model's urge to fill all four quotas. Feed the surviving P0/P1s back through the §6.2 reviser.
Pre-mortem: write the post-incident report first
The red team attacks an artifact that exists. The pre-mortem — borrowed from Gary Klein's decision research — attacks a plan before anything is built, which is when objections are cheapest to act on. The move is a tense shift: not "what could go wrong?" (which invites hedged, low-effort maybes) but "it is six months later and this failed — explain". Psychologists call it prospective hindsight: presupposing the outcome measurably increases the number and specificity of causes people generate. The same framing moves an LLM out of its plan-completion groove and into its incident-report groove — a genre it knows deeply, and one whose conventions force a timeline, a root cause, and a missed early signal.
PRE-MORTEM # run at planning time, before resources are committed
It is six months later. The plan below shipped and failed badly
enough to be rolled back. Write the post-incident report.
PLAN
{{plan}}
REPORT FORMAT
- TIMELINE what broke first and what it cascaded into
- ROOT CAUSE the assumption in the plan that turned out false
- EARLY SIGNAL the metric that would have caught this in week 1
- THE FIX the single change to the plan that prevents this
Write three independent reports: the MOST LIKELY failure, the MOST
EXPENSIVE failure, and the MOST EMBARRASSING failure. Different root
causes for each — no overlap.
The deliverable is not anxiety; it's the EARLY SIGNAL lines. Three pre-mortem reports yield three monitoring metrics and usually one genuine plan change — which is a better return than most planning meetings. The most-likely/most-expensive/most-embarrassing split exists to break the model's habit of writing the same failure three ways.
Council of judges
A single LLM judge is noisy: rerun it and the verdict flips more often than anyone likes to admit (Ch 07 measures this). The classical fix is the classical one — ask several and take the majority. If \(n\) judges vote independently and each is right with probability \(p > \tfrac12\), majority error doesn't just shrink, it collapses:
EQ P6.2 is easy to state and easy to disbelieve, so run it. The cell below builds an IID council — \(n\) judges each correct with probability \(a = 0.72\), voting independently — and reports majority accuracy from both a seeded simulation and the exact binomial sum, side by side, for every council size 1 through 9.
# council of judges: majority accuracy vs council size (Condorcet, EQ P6.2)
import numpy as np
from math import comb
rng = np.random.default_rng(0)
a = 0.72 # each judge correct w.p. a, independently
sizes = range(1, 10)
trials = 20000
print(f"single-judge accuracy a = {a}")
print(" n simulated exact")
exact_pts = []
for n in sizes:
votes = rng.random((trials, n)) < a # True = judge votes correctly
sim = (votes.sum(1) > n / 2).mean() # strict majority correct
exact = sum(comb(n, k) * a**k * (1 - a)**(n - k) # EQ P6.2, summed tail
for k in range(n // 2 + 1, n + 1))
exact_pts.append(exact)
print(f"{n:2d} {sim:8.3f} {exact:8.3f}")
plot_xy(list(sizes), exact_pts)
Three readings off the printout. The odd sizes climb steadily — 72% at n=1 to ~92% at n=9 — which is the variance reduction EQ P6.3 promises when \(\rho = 0\). The even sizes dip below the odd ones (a strict majority demands a real lead, and ties count as losses), which is the arithmetic behind "use odd N." And flip \(a\) to 0.45: every added judge now makes the council more confidently wrong — the theorem's cruel symmetry. This is the ceiling a real council never reaches, because its judges are correlated; the simulator above shows the gap.
Hence the production shape: N judges, each with a distinct lens, in separate contexts, plus an aggregator that sees only verdicts and evidence — never the artifact author's chain of thought. The lenses do double duty: they decorrelate the judges and partition the review surface, so five judges cover five failure classes instead of quintuple-checking grammar. This is also a familiar object wearing new clothes: GRPO scores each sampled response against its group's mean reward (Vol II · EQ 5.6) — the group is the baseline that makes a single noisy score meaningful. A council is the same variance-reduction move executed at inference time, with verdicts instead of rewards; self-consistency (Ch 04) is its degenerate cousin where every "judge" is the same persona resampled at temperature.
# COUNCIL — each judge is a separate call; aggregator sees verdicts only
JUDGE i of N # one lens per judge — assign, don't let them choose
You are one of N independent reviewers. You see only the artifact
and your lens. Verdict first, then at most 3 sentences of evidence,
each anchored to quoted text.
LENS 1 factual accuracy — verify every checkable claim
LENS 2 internal consistency — do any two sections contradict?
LENS 3 completeness against the spec — what is missing?
LENS 4 security and abuse — how does this get misused?
LENS 5 the intended reader — what will they misunderstand first?
VERDICT: ACCEPT | REVISE | REJECT
AGGREGATOR # gets the N verdicts + evidence, not the artifact's history
Tally the verdicts. Quote each judge's strongest piece of evidence.
Where judges disagree, name the disagreement — splits are signal,
not noise to average away. Output: final verdict + minimal revision
list, ordered by how many judges' objections each item resolves.
Practicalities: use odd N; 3–5 judges capture most of the gain (the binomial tail flattens fast); keep the aggregator's job mechanical — tally, quote, surface splits. An aggregator allowed to "weigh holistically" becomes a sixth judge with veto power, and your variance reduction evaporates.
Debate & the devil's advocate
Councils judge in parallel silence. Debate makes the strongest case for and against collide, on the theory — proposed for AI oversight by Irving, Christiano and Amodei (2018) — that refuting a lie is easier than detecting one: a judge too weak to evaluate a claim directly can still tell which side's evidence survived rebuttal. The empirical record is genuinely encouraging on factual tasks: in the Khan et al. (2024) reading-comprehension setup, non-expert judges (who couldn't see the source text) gained double-digit accuracy from watching expert models debate, and stronger debaters helped judges more. The honest asterisk: debate rewards persuasion, persuasion and truth are correlated but not identical, and on open-ended questions a silver-tongued wrong answer can win rounds.
DEBATE # advocates argue in separate contexts; judge sees transcript only
ROUND 1 — ADVOCATE A: strongest honest case FOR the answer below.
ROUND 1 — ADVOCATE B: strongest honest case AGAINST it.
Cite evidence; invented evidence forfeits the round.
ROUND 2 — each advocate rebuts the other's specific points.
Quote what you rebut. Unanswered points stand.
JUDGE # fresh context — has not seen the original question's solution
You see only the transcript. Score which side's EVIDENCE survived
rebuttal — not which side is better written. List the surviving
points per side. Verdict + the single argument that decided it.
The budget version is the devil's advocate: a single extra instruction at the end of a generation — "before finalizing: state the strongest case that this answer is wrong; if it changes your answer, change it; if not, say in one line why it fails." It runs inside the author's own context, so it inherits the author's blind spots (§6.8) and is the weakest pattern in this chapter — but it costs ~50 tokens, catches the embarrassing class of error, and is the one technique here cheap enough to leave on by default.
Honest costs
Every pattern in this chapter multiplies tokens, latency, or both. The honest ledger:
| Pattern | LLM calls | Token cost | Latency shape | Reach for it when |
|---|---|---|---|---|
| Critique → revise | 3 | ≈ 3× | 3 serial turns | one artifact, quality floor matters |
| Reflexion loop | 2k + 1 | 5–10× | serial × attempts | an external pass/fail signal exists |
| Red team | 2–3 | 2–3× | short serial | anything public-facing or load-bearing |
| Pre-mortem | 1–3 | ≈ 2× | planning time | before resources are committed |
| Council of N | N + 1 | ≈ (N+1)× | ~2 turns (judges run parallel) | high-stakes verdicts, noisy single judge |
| Debate | 5–7 | 4–6× | 3 serial rounds | contested claims, judge weaker than task |
The subtler cost is sycophantic self-review. Models exhibit self-preference: asked to grade outputs, they systematically favor their own — and recent work suggests part of the mechanism is self-recognition (Panickssery et al., 2024). Worse, a critic running in the author's context inherits the author's framing, retrieves the author's justifications, and converges on "looks good with minor nits." This is why every template above repeats the same clause: the judge is a fresh context. The author's chain of thought is contamination. Same model with a clean context and an adversarial persona is good; a different model entirely is better; a different model with a rubric and quote-to-convict rules is the strongest cheap judge you can build.
And self-correction has a documented failure mode: without an external signal, asking a model to reconsider a correct answer frequently talks it out of that answer — measured as net-negative "intrinsic self-correction" on reasoning benchmarks (Huang et al., 2024). The lesson is not "never critique"; it's that critique needs either an oracle (tests, validators, retrieval) or an asymmetric frame (rubric, attack taxonomy) — a bare "are you sure?" is an invitation to dither, and models accept it.
When single-shot is fine: low stakes, latency-bound UX, taste-driven tasks with no articulable rubric, and anywhere \(\Delta_{\mathrm{GV}} \approx 0\). One more honest note: reasoning models trained with RLVR (Vol II · Ch 05) already run a private draft-check-backtrack loop inside their chain of thought — for them, external critique buys less than it did in 2023, and a council of reasoning models is often compute better spent as one longer reasoning budget. Measure, don't assume — which is precisely the next chapter.
You have been judging by hand; now make it reproducible. Chapter 07: prompt evals, the biases of LLM judges (position, length, self-preference — measured), versioning prompts like code, and the Prompt Lab — every technique in this volume, run live against a real model with your own key.
Further reading
- Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. — Introduces GSM8K and shows a verifier-ranked small generator beating a 30× larger one: the generator–verifier gap, measured.
- Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. — The canonical reflect-on-failure loop with episodic memory; the source of §6.3's distilled-lesson design.
- Irving, G., Christiano, P., & Amodei, D. (2018). AI Safety via Debate. — The original argument that judging adversarial debate is easier than direct evaluation, the foundation of §6.7.
- Khan, A., et al. (2024). Debating with More Persuasive LLMs Leads to More Truthful Answers. — Empirical evidence that weaker judges gain accuracy from watching stronger models debate — and the persuasion caveat.
- Huang, J., et al. (2024). Large Language Models Cannot Self-Correct Reasoning Yet. — Documents net-negative intrinsic self-correction without an external signal; the honest counterweight to this whole chapter.
- Panickssery, A., Bowman, S. R., & Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. — Links the self-preference bias to self-recognition, the mechanism behind §6.8's "fresh context" rule.
- Wang, X., et al. (2023). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. — Majority vote over sampled reasoning paths: the degenerate, single-persona cousin of the council in §6.6.