AI // ENCYCLOPEDIA / VOL III / PROMPTING / 06 / SELF-CRITIQUE & RED TEAMS INDEX NEXT: EVALUATION LAB →
VOLUME III — PROMPTING · CHAPTER 06 / 07

Self-Critique, Red Teams & Councils

Everything a model emits in one pass is a draft: fluent, confident, and unexamined. The techniques here exploit one asymmetry. Models are measurably better at judging work than at producing it, so a second pass spent checking buys more quality per token than a longer first pass spent generating, provided the judge is never the author still warm in the same context.

LEVELADVANCED READING TIME≈ 24 MIN BUILDS ONCH 04–05 · VOL II CH 05 INSTRUMENTSCRITIQUE DIFF · COUNCIL SIM
6.1

Why single-pass output is a draft

An autoregressive model commits to every token as it goes. There is no backspace in the decoding loop: a weak opening sentence constrains everything after it, an early arithmetic slip propagates to the conclusion, and the model's trademark fluency papers over both. Single-pass generation is a first draft produced by a writer who is forbidden from rereading.

What rescues this is an asymmetry the field keeps rediscovering. Ask a model to produce a correct solution and it succeeds with some probability; show it a candidate solution and ask is this correct? and it succeeds more often. The canonical early result: on grade-school math, a small model that generates many answers and ranks them with a trained verifier beat a generator 30× its size sampling once (Cobbe et al., 2021). The same asymmetry is why RLHF works at all — humans (and reward models) can rank outputs they could never write (Vol II · EQ 5.2). The intuition is old: checking a proof is easier than finding one.

EQ P6.1 — THE GENERATOR–VERIFIER GAP $$ \Delta_{\mathrm{GV}} \;=\; \underbrace{\Pr\!\big[\,V_\theta(x,\hat y) \,=\, \mathbf{1}[\hat y \text{ solves } x]\,\big]}_{\text{verification accuracy}} \;-\; \underbrace{\Pr_{\hat y\,\sim\,p_\theta(\cdot\mid x)}\!\big[\,\hat y \text{ solves } x\,\big]}_{\text{generation accuracy}} $$
\(V_\theta\) is the same model prompted as a judge. When \(\Delta_{\mathrm{GV}} > 0\), extra compute is better spent checking and selecting than generating longer. The gap is task-dependent and honestly contested: it is large for code-with-tests, math-with-verifiers, and factual claims; it shrinks toward zero for taste, style, and — without an external signal — for the model's own reasoning chains (§6.8). Every pattern in this chapter is a way of spending a positive gap.

The gap is not a slogan; it is arithmetic. Suppose a model writes a correct first draft only half the time, but can judge a candidate correctly 85% of the time. A bare draft is a coin flip — but generate, verify, and revise only what the verifier flags, and the effective accuracy climbs well past either number alone. The cell below runs that lifecycle on a toy model and prints the lift; it is the smallest possible version of every pattern in this chapter.

PYTHON · RUNNABLE IN-BROWSER
# generator-verifier gap: generate -> verify -> revise lifts effective accuracy
import numpy as np
rng = np.random.default_rng(0)
g, v = 0.50, 0.85                      # P(draft correct), P(verifier judges correctly)
N = 200_000

correct = rng.random(N) < g                    # is the draft actually right?
verifier_right = rng.random(N) < v             # does the verifier judge it correctly?
says_ok = np.where(correct, verifier_right, ~verifier_right)   # OK iff judged "good"
revised = rng.random(N) < g                     # flagged drafts get one fresh attempt
final = np.where(says_ok, correct, revised)     # keep OK drafts; replace the flagged

analytic = g*v + g*(1-v)*g + (1-g)*v*g          # the three ways to end up correct
print(f"raw generator          {correct.mean():.3f}")
print(f"verifier flags 'bad'   {(~says_ok).mean():.3f}   (these get revised)")
print(f"after verify + revise  {final.mean():.3f}   (analytic {analytic:.3f})")
print(f"lift over raw draft    {final.mean() - correct.mean():+.3f}")
edits are live — push v toward 0.5 and watch the lift vanish

Half-right drafts become two-thirds-right answers, paid for in one extra verification pass — that surplus is \(\Delta_{\mathrm{GV}}\) spent. Drop the verifier to \(v = 0.5\) (a coin) and the lift collapses to zero: a verifier no better than chance launders no information, which is the §6.8 caution stated as code. Push \(v\) higher and the ceiling rises toward what a perfect filter plus one retry can reach.

A model writes a correct first draft with probability \(0.50\), but judges a candidate solution correctly with probability \(0.85\). What is the generator–verifier gap \(\Delta_{\mathrm{GV}}\) (EQ P6.1)?
\(\Delta_{\mathrm{GV}} = 0.85 - 0.50 =\) 0.35. A positive gap means extra compute is better spent checking-and-selecting than generating longer — the surplus every pattern in this chapter is built to spend.

Three topologies organize everything that follows: run the check after the draft (sequential — self-critique, Reflexion), run many checks in parallel (the council), or make two copies of the model fight and judge the wreckage (debate). The pre-mortem and red team are sequential patterns wearing armor: the critique arrives dressed as an attacker or a coroner, which turns out to matter enormously.

FIG P6.1THREE VERIFICATION TOPOLOGIES
AUTHOR → DRAFT CRITIC (FRESH CTX) REVISER SEQUENTIAL — §6.2–6.5 ARTIFACT JUDGE 1 JUDGE 2 JUDGE N AGGREGATOR PARALLEL — §6.6 ADV. A ADV. B JUDGE ADVERSARIAL — §6.7
Same gap, three ways to spend it. Sequential patterns trade latency for depth on one artifact; parallel councils trade tokens for variance reduction; adversarial setups make claims earn survival under attack. Mint boxes mark the contexts that must stay fresh — they never see the author's reasoning, only its output.
6.2

Self-critique & revise: the three-turn pattern

The minimum viable verification loop is three calls: produce → critique against explicit criteria → revise. Each clause carries weight. Three calls, because critique appended to the generation prompt ("write it, then review your work") collapses into one distribution — the model that just committed to a draft is the model least able to see its flaws, and in practice appends a paragraph of polite self-congratulation. Explicit criteria, because "make it better" licenses cosmetic edits; a rubric converts taste into checkable claims.

Rubric-as-prompt is the load-bearing trick. A good rubric has 3–6 criteria, each phrased so that a verdict can be defended by quotation — the critic must point at failing text, not emit vibes. A worked example, for a status-update paragraph:

CriterionCheckable phrasingCatches
Specificityevery claim carries a number, date, or named source"much faster", "significantly"
Falsifiabilitya skeptic could in principle prove each claim wrong"better performance overall"
Causal claritymechanisms stated — X because Y, with Y measured"caching and other improvements"
Reader costno sentence makes the reader do the author's work"the team worked very hard"
# THE 3-TURN PATTERN — each turn is a separate API call
TURN 1 — AUTHOR
Write the deployment update for the engineering newsletter.
{{task context}}

TURN 2 — CRITIC   # fresh context: gets draft + rubric, nothing else
You are reviewing a draft you did not write. Grade it against each
criterion below. For every verdict, QUOTE the text that earns it.
Do not rewrite. Do not praise. Verdicts: PASS / PARTIAL / FAIL.

RUBRIC
1. SPECIFICITY     every claim carries a number, date, or named source
2. FALSIFIABILITY  a skeptic could in principle prove each claim wrong
3. CAUSAL CLARITY  mechanisms stated (X because Y), not adjacency
4. READER COST     no sentence makes the reader do the author's work

DRAFT
{{draft}}

TURN 3 — REVISER  # gets draft + critique; NOT the critic's context
Rewrite the draft so every FAIL and PARTIAL becomes a PASS.
Change nothing the critique did not flag. Output only the revision.

The reviser's leash — change nothing the critique did not flag — prevents revision drift, where a model "improving" a draft quietly rewrites the parts that were already right. And the critic's quote-to-convict rule is your hallucination filter: a criticism that cannot point at text is usually invented.

INSTRUMENT P6.1 — CRITIQUE PASS DIFF3 TURNS · RUBRIC OF §6.2
DRAFT RUBRIC
REVISION RUBRIC
PIPELINE TOKEN COST
Step through the three turns. The draft is fluent and empty; the critique convicts it by quotation, criterion by criterion; the revision view shows exactly what the critique paid for — deletions struck in red, insertions in mint. Toggle VIEW: CLEAN to read the final text. Token cost is computed from the actual word counts of the three turns — note it lands near 5× here, not the 3× rule of thumb of §6.8: short drafts amortize the rubric badly, long artifacts amortize it well.

The three-pass discipline

In regulated-industry field practice — where a flawed artifact survives to a downstream review that has consequences — the three-turn pattern hardens into a fixed ritual run on every load-bearing document. The shape is the same three calls, but each pass is named, scoped, and given a job it cannot fake its way out of.

Pass 1 — generate. Produce the full scaffold: not an outline, not a sketch, but the complete artifact with every section populated, so the critic has real text to convict rather than intentions to approve. A half-finished draft invites a half-hearted review.

Pass 2 — critique as a named senior reviewer. Fresh context. The model is cast as a specific, senior, skeptical persona — a named role with a reputation to protect — and made to grade the artifact against an explicit, pre-registered checklist, every verdict anchored to quoted text. The persona is load-bearing for the same reason as in §6.4: "review this" returns courtesy; "you are the principal reviewer who signs off on this and owns the failures" returns findings.

Pass 3 — confidence-score every section 1–5 with rationale. The reviewer assigns each section a numeric confidence (1 = would block release, 5 = ship as-is) and a one-line rationale for the score. The enforced rule is the whole point: at least one section must score ≤ 3. A scorecard of straight fives is rejected and the pass re-run — because all fives means the critique never happened.

The forced low score is a direct countermeasure to sycophantic self-review (§6.8). A model grading work — especially work adjacent to its own first draft — drifts toward charitable, "looks good with minor nits" verdicts; left to free-form scoring it will hand out fives to close the task. Mandating that the distribution contain a low number removes the comfortable equilibrium: the model can no longer satisfy the instruction and bless everything, so it is forced to locate the genuinely weakest section and defend a real criticism of it. The constraint does not invent flaws — every artifact has a weakest part — it simply refuses to let the reviewer pretend there isn't one. In practice the section the model is most reluctant to mark down is, more often than not, the one that actually needed the work.

# PASS 2 — SENIOR REVIEWER · fresh context: artifact + checklist only
You are the principal reviewer who signs the release for this artifact
and personally owns every defect that reaches production. You did not
write it. Grade it against the checklist. QUOTE the text behind every
verdict. Do not rewrite. Do not praise. Verdict: PASS / PARTIAL / FAIL.

CHECKLIST
1. {{check 1 — e.g. every claim carries a number, date, or source}}
2. {{check 2 — e.g. no two sections contradict}}
3. {{check 3 — e.g. each stated mechanism is measured, not asserted}}
4. {{check 4 — e.g. nothing here can be quoted out of context to mislead}}

ARTIFACT
{{artifact, full scaffold from Pass 1}}

# PASS 3 — CONFIDENCE SCORECARD · same reviewer, after the critique
Score every section 1-5 (1 = would block release, 5 = ship as-is) with
a one-line rationale per score. HARD RULE: at least one section must
score 3 or below. A scorecard of all fives is invalid — find the
weakest section and defend a real criticism of it.

SECTION                 SCORE   RATIONALE (one line, anchored to text)
{{section 1}}          _/5    ...
{{section 2}}          _/5    ...
{{section N}}          _/5    ...
LOWEST-SCORING SECTION: {{name}} — the one change that raises it.

Feed the lowest-scoring section and its FAIL/PARTIAL verdicts to the §6.2 reviser; leave the fours and fives alone (the reviser's leash). The scorecard is also a cheap audit trail: in a regulated setting, "every section scored, weakest one named and addressed" is a defensible record of having actually checked — which is most of what a downstream reviewer is looking for.

6.3

Reflexion loops: carry the critique forward

One critique pass fixes one draft. Reflexion (Shinn et al., 2023) turns the pattern into a loop with memory: attempt the task, fail against an external signal (a unit test, a validator, an environment), then have a reflector distill why it failed into one or two sentences — and feed only those lessons into the next attempt, not the failed transcripts. The critique becomes episodic memory: a verbal gradient step, applied at inference time, with no weights touched.

The design choice that makes it work is what you exclude. Naive retry-with-history stuffs every failed attempt into context, which bloats the prompt and — worse — anchors the model on its own failed approach; models shown their previous wrong answer reproduce its skeleton with cosmetic edits. A distilled lesson ("the regex missed multiline input; anchor with \A…\z") transfers the information without the anchor. Memory should hold conclusions, not transcripts.

# REFLEXION LOOP — repeat until pass or attempts == K
ATTEMPT k   # fresh context: task + memory, never previous transcripts
<memory>    # one lesson per failed attempt, written by the reflector
- Attempt 1 failed: regex missed multiline input; anchor with \A…\z
- Attempt 2 failed: parsing fixed, but the empty-file case now throws
</memory>
<task>{{task}}</task>

# after each failure, with the attempt + the error/test output:
REFLECTOR
In at most 2 sentences: state why this attempt failed and the one
rule that would have prevented it. Append to <memory>.
Do not apologize. Do not restate the task. Do not propose code.

The honest constraint: Reflexion's published gains (it pushed GPT-4 from ~80% to ~91% pass@1 on HumanEval) lean on a ground-truth signal — tests either pass or they don't. With no external verifier, the reflector grades its own homework and the loop can wander: lessons become confabulated, and accuracy can go down across iterations (§6.8). Reflexion is a technique for tasks with oracles, approximate or exact.

6.4

Red-team prompting: attack your own output

A critique prompt asks "how good is this?" A red-team prompt asks "how does this fail?" — and the reframing changes what the model retrieves. Assistants are tuned toward helpfulness, which makes their default review charitable: they look for things to fix gently. Casting the model as a hostile reviewer — a security auditor, a competitor's analyst, a paid breaker — relicenses pure negativity, and the findings get sharper and more specific. The persona isn't theater; it's distribution selection (Ch 02).

Structure the attack like a security review, because the taxonomy forces coverage instead of letting the model fixate on its first objection. Four lines of attack, in escalating order of imagination: failure modes (inputs and states where behavior is wrong or undefined), edge cases (empty, enormous, malformed, concurrent, adversarial), the hostile reader (the sentence a critic quotes out of context; the claim a competitor screenshots), and abuse (how a motivated user weaponizes the artifact exactly as written).

RED TEAM   # fresh context — the attacker must not have written the artifact
You wrote nothing below. You are a hostile reviewer paid per finding
to identify the fastest ways this artifact fails in production.

ARTIFACT
{{output}}

Attack in order:
1. FAILURE MODES   inputs or states where behavior is wrong/undefined
2. EDGE CASES      empty · huge · malformed · concurrent · adversarial
3. HOSTILE READER  the sentence a critic quotes out of context;
                   the claim a competitor puts on a slide
4. ABUSE           how a motivated user weaponizes this as written

For each finding: severity P0–P3, the exact text or line, and a
concrete trigger that reproduces it. No praise. No summary.
If a category yields nothing real, write "no finding" — do not invent.

Red teams produce candidates, not confirmations. Some findings will be hallucinated — the "concrete trigger" requirement is the triage filter (an attack that can't name its reproduction is noise), and the explicit permission to return "no finding" suppresses the model's urge to fill all four quotas. Feed the surviving P0/P1s back through the §6.2 reviser.

6.5

Pre-mortem: write the post-incident report first

The red team attacks an artifact that exists. The pre-mortem — borrowed from Gary Klein's decision research — attacks a plan before anything is built, which is when objections are cheapest to act on. The move is a tense shift: not "what could go wrong?" (which invites hedged, low-effort maybes) but "it is six months later and this failed — explain". Psychologists call it prospective hindsight: presupposing the outcome measurably increases the number and specificity of causes people generate. The same framing moves an LLM out of its plan-completion groove and into its incident-report groove — a genre it knows deeply, and one whose conventions force a timeline, a root cause, and a missed early signal.

PRE-MORTEM   # run at planning time, before resources are committed
It is six months later. The plan below shipped and failed badly
enough to be rolled back. Write the post-incident report.

PLAN
{{plan}}

REPORT FORMAT
- TIMELINE       what broke first and what it cascaded into
- ROOT CAUSE     the assumption in the plan that turned out false
- EARLY SIGNAL   the metric that would have caught this in week 1
- THE FIX        the single change to the plan that prevents this

Write three independent reports: the MOST LIKELY failure, the MOST
EXPENSIVE failure, and the MOST EMBARRASSING failure. Different root
causes for each — no overlap.

The deliverable is not anxiety; it's the EARLY SIGNAL lines. Three pre-mortem reports yield three monitoring metrics and usually one genuine plan change — which is a better return than most planning meetings. The most-likely/most-expensive/most-embarrassing split exists to break the model's habit of writing the same failure three ways.

6.6

Council of judges

A single LLM judge is noisy: rerun it and the verdict flips more often than anyone likes to admit (Ch 07 measures this). The classical fix is the classical one — ask several and take the majority. If \(n\) judges vote independently and each is right with probability \(p > \tfrac12\), majority error doesn't just shrink, it collapses:

EQ P6.2 — CONDORCET JURY THEOREM (ODD n, IID JUDGES) $$ P_{\mathrm{maj}}(n,p) \;=\; \sum_{k=\frac{n+1}{2}}^{n} \binom{n}{k}\, p^{k}\,(1-p)^{\,n-k} \;\;\xrightarrow[\;n\,\to\,\infty\;]{}\;\; \begin{cases} 1 & p > \tfrac12 \\[2pt] 0 & p < \tfrac12 \end{cases} $$
Five judges at \(p = 0.72\) give a majority that is right 86% of the time; nine give 92%. The theorem cuts both ways: a council of below-chance judges converges confidently on the wrong answer. And the whole guarantee rests on the word independently — which is exactly what N samples from one model in one context are not.
EQ P6.3 — THE CORRELATION FLOOR $$ \operatorname{Var}(\bar s_N) \;=\; \frac{(1-\rho)\,\sigma^2}{N} \;+\; \rho\,\sigma^2 \;\;\xrightarrow[\;N\,\to\,\infty\;]{}\;\; \rho\,\sigma^2 $$
For \(N\) judge scores with pairwise correlation \(\rho\), only the uncorrelated part of the noise averages away. Judges sharing a base model, a prompt phrasing, or a context window share blind spots: \(\rho \gg 0\), and judges 4 through 9 buy almost nothing. Engineering independence — distinct lenses, separate contexts, different phrasings, ideally different models — is worth more than raising N.
Four judge scores have pairwise correlation \(\rho = 0.2\) and per-judge variance \(\sigma^2 = 1\). Using EQ P6.3, what is the variance of their mean \(\bar s_N\) at \(N = 4\)?
\(\operatorname{Var}(\bar s_N) = \dfrac{(1-\rho)\sigma^2}{N} + \rho\sigma^2 = \dfrac{(0.8)(1)}{4} + (0.2)(1) = 0.2 + 0.2 =\) 0.4. Half the variance is the irreducible correlation floor \(\rho\sigma^2 = 0.2\) — raising \(N\) shrinks only the other half, which is why decorrelating judges beats adding them.

EQ P6.2 is easy to state and easy to disbelieve, so run it. The cell below builds an IID council — \(n\) judges each correct with probability \(a = 0.72\), voting independently — and reports majority accuracy from both a seeded simulation and the exact binomial sum, side by side, for every council size 1 through 9.

PYTHON · RUNNABLE IN-BROWSER
# council of judges: majority accuracy vs council size (Condorcet, EQ P6.2)
import numpy as np
from math import comb
rng = np.random.default_rng(0)
a = 0.72                                   # each judge correct w.p. a, independently
sizes = range(1, 10)
trials = 20000
print(f"single-judge accuracy a = {a}")
print(" n   simulated   exact")
exact_pts = []
for n in sizes:
    votes = rng.random((trials, n)) < a               # True = judge votes correctly
    sim = (votes.sum(1) > n / 2).mean()               # strict majority correct
    exact = sum(comb(n, k) * a**k * (1 - a)**(n - k)  # EQ P6.2, summed tail
                for k in range(n // 2 + 1, n + 1))
    exact_pts.append(exact)
    print(f"{n:2d}   {sim:8.3f}   {exact:8.3f}")
plot_xy(list(sizes), exact_pts)
set a below 0.5 and watch the council converge on WRONG

Three readings off the printout. The odd sizes climb steadily — 72% at n=1 to ~92% at n=9 — which is the variance reduction EQ P6.3 promises when \(\rho = 0\). The even sizes dip below the odd ones (a strict majority demands a real lead, and ties count as losses), which is the arithmetic behind "use odd N." And flip \(a\) to 0.45: every added judge now makes the council more confidently wrong — the theorem's cruel symmetry. This is the ceiling a real council never reaches, because its judges are correlated; the simulator above shows the gap.

A council of \(5\) independent judges each votes correctly with probability \(p = 0.72\). Using the Condorcet sum (EQ P6.2), what is the probability the strict majority is correct?
\(P_{\mathrm{maj}} = \binom{5}{3}p^3 q^2 + \binom{5}{4}p^4 q + \binom{5}{5}p^5\) with \(q = 0.28\): \(10(0.3732)(0.0784) + 5(0.2687)(0.28) + 0.1935 = 0.2926 + 0.3762 + 0.1935 \approx\) 0.86. Five judges turn a 72% single verdict into an 86% council verdict — the variance reduction the IID theorem promises.

Hence the production shape: N judges, each with a distinct lens, in separate contexts, plus an aggregator that sees only verdicts and evidence — never the artifact author's chain of thought. The lenses do double duty: they decorrelate the judges and partition the review surface, so five judges cover five failure classes instead of quintuple-checking grammar. This is also a familiar object wearing new clothes: GRPO scores each sampled response against its group's mean reward (Vol II · EQ 5.6) — the group is the baseline that makes a single noisy score meaningful. A council is the same variance-reduction move executed at inference time, with verdicts instead of rewards; self-consistency (Ch 04) is its degenerate cousin where every "judge" is the same persona resampled at temperature.

# COUNCIL — each judge is a separate call; aggregator sees verdicts only
JUDGE i of N   # one lens per judge — assign, don't let them choose
You are one of N independent reviewers. You see only the artifact
and your lens. Verdict first, then at most 3 sentences of evidence,
each anchored to quoted text.

LENS 1  factual accuracy — verify every checkable claim
LENS 2  internal consistency — do any two sections contradict?
LENS 3  completeness against the spec — what is missing?
LENS 4  security and abuse — how does this get misused?
LENS 5  the intended reader — what will they misunderstand first?

VERDICT: ACCEPT | REVISE | REJECT

AGGREGATOR   # gets the N verdicts + evidence, not the artifact's history
Tally the verdicts. Quote each judge's strongest piece of evidence.
Where judges disagree, name the disagreement — splits are signal,
not noise to average away. Output: final verdict + minimal revision
list, ordered by how many judges' objections each item resolves.
INSTRUMENT P6.2 — COUNCIL SIMULATOR5 PERSONAS · EXACT POISSON-BINOMIAL MAJORITY
CLAIM UNDER REVIEW
“Moving the session store to Redis will eliminate our p99 latency spikes, because the spikes are caused by row-lock contention in Postgres.”
VERDICTS (BORDER: MINT = CORRECT · RED = WRONG)
MAJORITY VERDICT
VOTE SPLIT
P(MAJORITY CORRECT) — EXACT
MEAN SINGLE JUDGE
Five toy personas vote on one claim: STRICT (biased to reject), LENIENT (biased to accept), EXPERT (+12 pts accuracy), RANDOM (coin flip), CONTRARIAN (worse than chance). RESAMPLE re-rolls the vote; the curve is the exact majority error rate versus council size — mint for this persona mix, blue dashed for the ideal IID council of EQ P6.2. Watch three lessons: the persona mix never reaches the IID curve (correlation and bad judges are a tax); flipping GROUND TRUTH swaps which biased judge helps you — bias is only "strictness" until the truth changes; and even council sizes kink the curve (ties resolved by coin flip). The personas are toys; the majority math is exact.

Practicalities: use odd N; 3–5 judges capture most of the gain (the binomial tail flattens fast); keep the aggregator's job mechanical — tally, quote, surface splits. An aggregator allowed to "weigh holistically" becomes a sixth judge with veto power, and your variance reduction evaporates.

6.7

Debate & the devil's advocate

Councils judge in parallel silence. Debate makes the strongest case for and against collide, on the theory — proposed for AI oversight by Irving, Christiano and Amodei (2018) — that refuting a lie is easier than detecting one: a judge too weak to evaluate a claim directly can still tell which side's evidence survived rebuttal. The empirical record is genuinely encouraging on factual tasks: in the Khan et al. (2024) reading-comprehension setup, non-expert judges (who couldn't see the source text) gained double-digit accuracy from watching expert models debate, and stronger debaters helped judges more. The honest asterisk: debate rewards persuasion, persuasion and truth are correlated but not identical, and on open-ended questions a silver-tongued wrong answer can win rounds.

DEBATE   # advocates argue in separate contexts; judge sees transcript only
ROUND 1 — ADVOCATE A: strongest honest case FOR the answer below.
ROUND 1 — ADVOCATE B: strongest honest case AGAINST it.
          Cite evidence; invented evidence forfeits the round.
ROUND 2 — each advocate rebuts the other's specific points.
          Quote what you rebut. Unanswered points stand.

JUDGE   # fresh context — has not seen the original question's solution
You see only the transcript. Score which side's EVIDENCE survived
rebuttal — not which side is better written. List the surviving
points per side. Verdict + the single argument that decided it.

The budget version is the devil's advocate: a single extra instruction at the end of a generation — "before finalizing: state the strongest case that this answer is wrong; if it changes your answer, change it; if not, say in one line why it fails." It runs inside the author's own context, so it inherits the author's blind spots (§6.8) and is the weakest pattern in this chapter — but it costs ~50 tokens, catches the embarrassing class of error, and is the one technique here cheap enough to leave on by default.

6.8

Honest costs

Every pattern in this chapter multiplies tokens, latency, or both. The honest ledger:

PatternLLM callsToken costLatency shapeReach for it when
Critique → revise3≈ 3×3 serial turnsone artifact, quality floor matters
Reflexion loop2k + 15–10×serial × attemptsan external pass/fail signal exists
Red team2–32–3×short serialanything public-facing or load-bearing
Pre-mortem1–3≈ 2×planning timebefore resources are committed
Council of NN + 1≈ (N+1)×~2 turns (judges run parallel)high-stakes verdicts, noisy single judge
Debate5–74–6×3 serial roundscontested claims, judge weaker than task

The subtler cost is sycophantic self-review. Models exhibit self-preference: asked to grade outputs, they systematically favor their own — and recent work suggests part of the mechanism is self-recognition (Panickssery et al., 2024). Worse, a critic running in the author's context inherits the author's framing, retrieves the author's justifications, and converges on "looks good with minor nits." This is why every template above repeats the same clause: the judge is a fresh context. The author's chain of thought is contamination. Same model with a clean context and an adversarial persona is good; a different model entirely is better; a different model with a rubric and quote-to-convict rules is the strongest cheap judge you can build.

And self-correction has a documented failure mode: without an external signal, asking a model to reconsider a correct answer frequently talks it out of that answer — measured as net-negative "intrinsic self-correction" on reasoning benchmarks (Huang et al., 2024). The lesson is not "never critique"; it's that critique needs either an oracle (tests, validators, retrieval) or an asymmetric frame (rubric, attack taxonomy) — a bare "are you sure?" is an invitation to dither, and models accept it.

When single-shot is fine: low stakes, latency-bound UX, taste-driven tasks with no articulable rubric, and anywhere \(\Delta_{\mathrm{GV}} \approx 0\). One more honest note: reasoning models trained with RLVR (Vol II · Ch 05) already run a private draft-check-backtrack loop inside their chain of thought — for them, external critique buys less than it did in 2023, and a council of reasoning models is often compute better spent as one longer reasoning budget. Measure, don't assume — which is precisely the next chapter.

NEXT

You have been judging by hand; now make it reproducible. Chapter 07: prompt evals, the biases of LLM judges (position, length, self-preference — measured), versioning prompts like code, and the Prompt Lab — every technique in this volume, run live against a real model with your own key.

§

Further reading

  • Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. — Introduces GSM8K and shows a verifier-ranked small generator beating a 30× larger one: the generator–verifier gap, measured.
  • Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. — The canonical reflect-on-failure loop with episodic memory; the source of §6.3's distilled-lesson design.
  • Irving, G., Christiano, P., & Amodei, D. (2018). AI Safety via Debate. — The original argument that judging adversarial debate is easier than direct evaluation, the foundation of §6.7.
  • Khan, A., et al. (2024). Debating with More Persuasive LLMs Leads to More Truthful Answers. — Empirical evidence that weaker judges gain accuracy from watching stronger models debate — and the persuasion caveat.
  • Huang, J., et al. (2024). Large Language Models Cannot Self-Correct Reasoning Yet. — Documents net-negative intrinsic self-correction without an external signal; the honest counterweight to this whole chapter.
  • Panickssery, A., Bowman, S. R., & Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. — Links the self-preference bias to self-recognition, the mechanism behind §6.8's "fresh context" rule.
  • Wang, X., et al. (2023). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. — Majority vote over sampled reasoning paths: the degenerate, single-persona cousin of the council in §6.6.