Why structure beats vibes
Chapter 01 established the mechanics: a prompt is not an incantation, it is conditioning evidence. The model computes \(p_\theta(y \mid x)\) — a probability distribution over continuations given everything in the context window — and your prompt is the entire \(x\). Whatever you leave out, the model fills in from its priors, which means it fills in the statistically average audience, purpose, length, and tone. Average is exactly what a vague prompt gets back.
The scaffold turns that observation into a checklist. Five parts, each answering a question the model cannot answer for you:
| Part | Question it answers | Failure mode when missing |
|---|---|---|
| ROLE | who is producing this? | Default-assistant register: competent, generic, mid-formal |
| TASK | what single deliverable? | The model guesses the verb — and hedges across several |
| CONTEXT | what can't the model know? | Answers calibrated to a reader and situation that don't exist |
| FORMAT | what shape is the output? | Unpredictable length and structure; re-prompting roulette |
| CONSTRAINTS | where are the edges? | Boundary violations and 50/50 splits on every trade-off |
Two honest clarifications before the parts. First, the labels are not magic syntax. Instruction-tuned models parse ROLE: headers, markdown sections, and XML tags about equally well, because their post-training data is full of all three; segmentation helps the model carve the prompt into spans, but the dominant effect of the scaffold is on you — a checklist is hard to half-fill without noticing. Second, the scaffold is a checklist, not a form: a part that adds no information should be omitted, not padded. "You are a helpful assistant" is a row of zeros.
The Gym's prompt katas grade your prompt against exactly these five anchors — role, context, format, constraints, examples — by regex, before any model is called. That grader is small enough to fit in a cell: paste a weak prompt and a strong one, and watch the same PASS/MISS table the kata prints. It catches structure, not truth — but a row of MISS is a row of conditioning you forgot to write.
import re
# experiment: the five-anchor linter — regex-grade a weak vs a strong prompt
ANCHORS = {
"role": r"you are|as an?|acting as|role:",
"context": r"context:|using only|attached|below|the document|reader:|audience:",
"format": r"format:|table|bullet|json|\d+ (?:bullets|words|sentences|paragraphs)|template",
"constraints": r"constraints?:|do not|only|never|if .* (?:conflict|missing|unsure)|not specified",
"examples": r"example|e\.g\.|for instance|like this|input ->|input:.*output:",
}
def lint(name, prompt):
print(f"\n{name}")
score = 0
for anchor, pat in ANCHORS.items():
hit = re.search(pat, prompt, re.IGNORECASE) is not None
score += hit
print(f" {anchor:<12} {'PASS' if hit else 'MISS'}")
print(f" SCORE {score}/5")
return score
weak = "write an email to the client telling them the project is delayed"
strong = ("ROLE: You are a senior account manager. TASK: write an email requesting a "
"two-week extension. CONTEXT: reader is Dana, VP Ops; their API arrived late. "
"FORMAT: under 150 words, subject line + 3 paragraphs. CONSTRAINTS: do not blame "
"their team; if apology and confidence conflict, choose confidence.")
for name, p in [("WEAK PROMPT", weak), ("STRONG PROMPT", strong)]:
lint(name, p)
The weak prompt scores 0/5 — it carries a task verb but trips none of the anchor patterns; the strong one scores 4/5, missing only EXAMPLES (which §2.7 shows is often the right anchor to omit). The linter is deliberately dumb — it counts the presence of conditioning, not its correctness, exactly the limitation EQ P2.1 warns about. A high score means you left the model less to guess; it does not mean you aimed the funnel at the truth. That second job is yours.
"You are a staff engineer. Review the diff for security bugs. Output numbered findings, most severe first. Do not comment on style."
ROLE — conditioning the author
A pre-trained model is a superposition of every author it has read. A useful fiction for what a role line does: it shifts the model's posterior over which latent author is writing.
When roles genuinely help. A role pays rent when it changes what the words mean. "Review this contract as opposing counsel" loads an adversarial reading no instruction list fully captures. "Explain this as a pediatrician talking to a worried parent" sets vocabulary, sentence length, and what to leave out — three dials that would take a paragraph to set explicitly. The best roles smuggle in evaluation criteria: "a staff engineer reviewing for OWASP Top 10" is really a compressed checklist wearing a costume.
When roles are cargo cult. On factual and reasoning accuracy, the evidence is bleak: the largest controlled study to date (162 personas across multiple model families on thousands of factual questions, 2023) found no reliable accuracy gain from adding personas — and occasional unpredictable harm. Stacking flattery ("world-class, award-winning, 30 years of experience") adds little a modern instruction-tuned model doesn't already default to, because post-training has already collapsed \(p_\theta(z \mid x)\) onto a competent-expert prior. The folklore add-ons — tips, threats, emotional appeals — show idiosyncratic, model-specific effects that fail to transfer: automatic prompt search has surfaced absurd winners (including Star Trek framings) that beat polite hand-tuning on one model and evaporate on the next. Treat any role text that would survive a find-and-replace of your task with someone else's as decoration.
Rule of thumb: give the role a function, not a costume. "You are X" earns its tokens when X implies a vocabulary, a set of conventions, an audience posture, or a checklist — and is dead weight when it merely implies "be good at this."
TASK — one verb, one deliverable
The task line is the spine of the prompt: one verb, one deliverable, stated in the first sentence the model will treat as an instruction. The verb sets the depth of transformation — proofread < edit < rewrite; list < summarize < analyze < recommend — and models take it more literally than humans do. Ask for "thoughts on this draft" and you get thoughts; ask for "a unified diff that fixes the three weakest paragraphs" and you get a diff. The deliverable should be a noun with a unit: a 5-bullet summary, a SQL query, a subject line plus three paragraphs — never "some ideas."
The case against compound tasks is arithmetic, not taste:
Decompose or die. "Summarize this report, critique its methodology, and rewrite the executive summary" is three tasks sharing one context window and one attention budget. Run them as three prompts — each verified before its output feeds the next — and you convert one 53% coin flip into three 90% checks with an inspection gate between each. The chain costs more tokens; it buys you the ability to catch a bad summary before it poisons the critique. Single-prompt compounds are defensible only when the subtasks genuinely interlock (the critique must reference the summary's framing) or when latency dominates everything else.
The hidden compound. "Improve this email" looks like one task but is secretly five — fix grammar, sharpen the ask, adjust tone, cut length, restructure — and the model picks its own subset, usually not yours. If you can't say which verb you mean, the model can't either.
CONTEXT — what the model cannot know
Context is the part of the scaffold with the highest information content per token, because it is the only part the model cannot possibly reconstruct. It can guess a plausible format; it cannot guess that your reader is a skeptical CFO, that this is the second extension request on this project, or that the previous draft was rejected for sounding defensive. The standing checklist:
- Audience — who reads the output, what they know, what they decide with it.
- Purpose — the decision or action downstream of the output. "This summary decides whether we renew the vendor" reshapes every sentence.
- Situation — the facts of the case: history, stakes, deadlines, what has already happened.
- Prior attempts — what was tried, and why it was rejected. This is the single highest-leverage item: without it the model will cheerfully regenerate the draft you already hate.
- House truths — internal names, conventions, definitions of success the public internet has never seen.
Context beats cleverness. A plain prompt carrying the five facts that change the answer outperforms an artfully worded prompt carrying none. This is the empirical center of gravity of the whole volume: across the prompting literature, gains from supplying missing information dwarf gains from rephrasing existing information. When a prompt underperforms, the productive question is almost never "how do I word this better?" — it is "what do I know that the model doesn't?"
Relevance, not bulk. Context is not a dumping ground. Chapter 01's mechanics still apply: every token competes for attention, and burying the one decisive fact under ten incidental ones dilutes it. Select the facts that would change a competent human's answer; paste those; stop.
FORMAT — show the shape
Models are pattern completers before they are instruction followers. The most reliable way to get a shape is to show the shape — a skeleton the model fills rather than a description it interprets:
# Instead of "structure your answer clearly":
FORMAT — fill this template exactly:
VERDICT: one sentence
EVIDENCE: 3 bullets, each citing a line number
RISKS: 2 bullets, worst first
CONFIDENCE: high / medium / low + one-line reason
Three rules of format engineering:
- Bound length in structural units, not words. "3 bullets, ≤ 15 words each" is enforceable; "around 200 words" is a suggestion the model tracks only loosely. Tokenization makes exact word counts genuinely hard for models — they don't see words, they see tokens (Vol III · Ch 01). Frontier models have improved, but bounds in bullets, sentences, and paragraphs remain the reliable currency.
- Field order is computation order. Generation is left-to-right, so the template's sequence decides what the model commits to first. "Evidence, then verdict" forces the evidence to exist before the conclusion that must follow from it; "verdict, then evidence" invites a snap judgment followed by motivated reasoning. Format is a cheap lever on the reasoning itself.
- If a machine parses the output, don't rely on the prompt. Prompt-requested JSON holds up in the bulk of cases and fails at the tails — a stray apology before the brace, a trailing comment after it. When format is load-bearing, use the provider's structured-output / constrained-decoding features, which make invalid output impossible rather than unlikely. The prompt states the schema's meaning; the decoder enforces its syntax.
CONSTRAINTS — boundaries, exclusions, tie-breakers
Constraints are the scaffold part that prevents the failure you didn't think to forbid. Three species:
- Boundaries — scope fences: "only modify the function below," "use only facts from the attached document."
- Exclusions — outputs that must not appear: topics, claims, styles, files.
- Tie-breakers — explicit precedence for the conflicts your goals will inevitably have: "if brevity and completeness conflict, cut the least decision-relevant material." Without a tie-breaker, the model splits the difference — and a 50/50 hedge between two goals usually serves neither.
Positive phrasing beats negative phrasing — with caveats. Two mechanisms argue for stating what to do rather than only what to avoid. First, negation is a documented weak spot: benchmark families built around negated questions have shown models — strikingly, sometimes larger models — agreeing with the negated form of statements they correctly reject in positive form. Second, salience: every token in a prohibition becomes attendable context (Ch 01), so "do not mention the outage" places outage squarely in the model's working set — the pink-elephant problem, mechanized. The honest caveats: the worst negation failures were measured on older and base models, and modern instruction-tuned models follow plain prohibitions reasonably well in short contexts. The brittleness returns under pressure — long contexts, many competing instructions, conversational drift. Best practice is therefore not "never say don't" but pair every don't with a do: "do not blame their team; attribute the delay to the access timeline, neutrally" gives the model a road, not just a wall.
The constraint budget. EQ P2.3 applies to constraints too: each one is another requirement in the product, another claim on attention. Ten constraints comply worse than four. Prune to the ones whose violation you'd actually reject the output for — and promote the rest to a review checklist on your side of the table.
The refusal rule — the constraint that kills hallucinations
One constraint earns its place in almost every extraction, summarization, and analysis prompt, and it is the one most people omit: license the model to refuse. Ground the answer in named sources — "use only the policy excerpt below," "answer only from DOC A and DOC B" — and then add the clause the model will not supply on its own: if a fact is not in the source, write "not specified" — never invent. This is the difference between a prompt the model can satisfy honestly and one whose only satisfiable completion is fiction.
The mechanism is worth stating precisely, because it explains why this clause is so much more than politeness. A bare instruction — "extract the contract value" — is an open generation task: the model's job is to produce the most probable continuation, and when the value is absent from the source, the most probable continuation is still a plausible-looking number, because that is what contracts contain. The refusal license converts open generation into constrained extraction: it adds "not specified" to the model's set of acceptable answers, so abstention stops competing with fabrication and starts winning. The model will not abstain unless you license it to — left to its priors, refusing looks like failing the task. You have to make refusal a legal move.
Citation discipline is the second half of the same constraint. Grounding the model is worthless if you cannot check it, so demand traceability: every claim carries the source span it came from. The operating rule for anything load-bearing — numbers without sources are not numbers. They are confident guesses wearing a number's clothes, and they fail silently, which is the worst way to fail. Two cheap habits make the discipline enforceable rather than aspirational. First, sample-check: when the model returns 50 extracted fields, you do not verify all 50 — you verify 5 chosen at random against the source. A single fabricated citation in the sample condemns the whole batch and sends it back. Second, demand an ASSUMPTION: prefix on anything the model inferred rather than read — "ASSUMPTION: contract assumed annual, term not stated." A flagged assumption is a reviewable claim; an unflagged inference is a landmine.
# Copy-ready hardening clauses — append to the CONSTRAINTS block
SOURCE — Answer using ONLY the document(s) provided below.
Treat your own prior knowledge as out of scope for this task.
REFUSAL — If a requested fact is not present in the source,
write exactly "not specified". Do NOT infer, estimate, or invent.
CITATION — After every factual claim, cite the source span it
came from (section, line, or quoted phrase). Numbers without a
source citation are not permitted.
ASSUMPTION — If you must infer to proceed, prefix the line with
"ASSUMPTION:" and state what you assumed and why.
VERIFICATION — Output is sample-checked: 5 of every batch are
verified against the source. One fabricated citation fails the batch.
The six-figure citation that never existed. In regulated-industry field practice the recurring incident is identical across firms: an analyst pastes a contract or a filing into a model, asks for a summary table of key terms, and ships the result into a memo that drives a real decision — a renewal, a reserve, a disclosure. Buried in the table is a clean, specific, entirely fabricated figure: a liability cap, a penalty rate, an effective date the source never stated. The number looks exactly like the real ones around it, which is precisely why it survives review, and the cost of unwinding it once it has moved a decision runs comfortably into six figures before anyone traces it back to a prompt that never licensed the model to say "not specified." The fix costs one clause. The omission costs a remediation project.
Here is the rule in numbers. Take an extraction task over 50 fields where 8 are genuinely absent from the source. Without a refusal license, every absent field is an invitation to fabricate; with the "not specified" clause, the model abstains on most of them. The cell below simulates both regimes and prints the two fabrication rates.
import numpy as np
# experiment: the refusal rule in numbers — fabrication with vs without a license
rng = np.random.default_rng(0)
N, ABSENT = 50, 8 # 50 fields; 8 are genuinely not in the source
present = np.ones(N, dtype=bool)
present[rng.choice(N, ABSENT, replace=False)] = False
# WITHOUT a refusal license: open generation. On an absent field the model
# almost always emits a plausible-looking value (fabricates).
p_fab_unlicensed = 0.88
# WITH "not specified": absent fields are mostly abstained; rare leakage.
p_abstain_licensed = 0.85
absent_idx = np.where(~present)[0]
fab_no_license = rng.random(len(absent_idx)) < p_fab_unlicensed
fab_licensed = rng.random(len(absent_idx)) >= p_abstain_licensed # leaked = fabricated
rate_no = fab_no_license.mean()
rate_yes = fab_licensed.mean()
print(f"absent fields : {ABSENT} of {N}")
print(f"fabrication rate NO LICENSE : {rate_no:.0%} ({fab_no_license.sum()}/{ABSENT} invented)")
print(f"fabrication rate 'not spec.' : {rate_yes:.0%} ({fab_licensed.sum()}/{ABSENT} invented)")
print(f"reduction : {(rate_no - rate_yes) / rate_no:.0%} fewer fabrications")
With no refusal license the model invents on roughly 7 of the 8 absent fields; with the single "not specified" clause it leaks on at most one or two. The clause does not make the model smarter — it makes abstention an allowed answer, and that one change does most of the work. The residual leakage is the reason the citation discipline above is not optional: the constraint cuts the fabrication rate, sample-checking catches what slips through.
Assembled: three case studies
The same five moves, three domains. Each AFTER block is annotated with what the part buys; none of them is clever, all of them are specific.
A — The deadline email
# BEFORE — the prompt most people actually send
write an email to the client telling them the project is delayed
ROLE — You are a senior account manager at a small consultancy.
← register: accountable, first-person-plural, no groveling
TASK — Write an email to our client requesting a two-week extension
on the dashboard delivery.
← one verb, one deliverable, one concrete ask with a number
CONTEXT — Reader: Dana, VP Ops — direct, values plans over apologies.
Cause: their API access arrived 11 days late. Relationship: 3 years,
good. This is the second extension on this project.
← four facts the model cannot guess; each reshapes a sentence
FORMAT — Under 150 words. Subject line + 3 short paragraphs:
situation, revised plan with a date, one question to confirm.
← structural length bound; paragraph order is argument order
CONSTRAINTS — Do not blame their team; attribute the delay to the
access timeline, neutrally. No discounts or scope cuts offered.
If apology and confidence conflict, choose confidence.
← exclusion + paired do/don't + an explicit tie-breaker
The BEFORE version produces a serviceable, slightly servile email of unpredictable length that apologizes too much and asks for nothing specific. The scaffold's biggest single contributor here is CONTEXT — "second extension" and "values plans over apologies" change the email's entire posture — followed by the tie-breaker, which kills the apologetic hedge outright.
B — The code review
# BEFORE
review this code
ROLE — You are a staff engineer reviewing a Python PR for a
payments service.
← "payments" loads a threat model; the role carries a checklist
TASK — Review the diff below for correctness and security issues only.
← one verb, scoped: this is a defect hunt, not a style debate
CONTEXT — The function moves money between accounts and runs inside
a Postgres transaction. Style is already enforced by linters.
Author is mid-level, second week on the team.
← stakes, environment, and the audience for the feedback's tone
FORMAT — Numbered findings, most severe first. For each: line
reference, the risk, a concrete fix. End with a verdict:
APPROVE / REQUEST CHANGES.
← evidence before verdict — field order is computation order
CONSTRAINTS — Skip style and naming; linters own those. Flag anything
that could double-charge as BLOCKER. If unsure an issue is real,
say so rather than inventing one.
← boundary + severity rule + an honesty valve against confabulated findings
BEFORE yields a polite tour of the code: a style nit, a docstring suggestion, a vague "consider adding error handling." The CONSTRAINTS block does the heavy lifting — banning style commentary redirects the entire attention budget to defects, and the honesty valve measurably cuts invented issues, the chronic failure of review prompts.
C — The churn analysis
# BEFORE
analyze this churn data
ROLE — You are a product analyst preparing evidence for a
pricing decision.
← the decision downstream selects which findings matter
TASK — Identify the three strongest correlates of churn in the
table below.
← bounded deliverable: three, ranked — not "insights"
CONTEXT — B2B SaaS, monthly plans. Columns: plan, seats,
tenure_months, support_tickets, churned. Leadership suspects the
Starter plan — check that hypothesis explicitly. Deadline Friday.
← schema + the live hypothesis the analysis must confirm or kill
FORMAT — Per correlate: one sentence of evidence with numbers,
one caveat. Then a 3-sentence summary a non-analyst can read.
← forces quantified claims and pre-allocates space for caveats
CONSTRAINTS — Correlation only — do not claim causation. If the data
cannot answer something, state what is missing instead of guessing.
← the two clauses that separate analysis from confident fiction
BEFORE produces a wandering narrative of every column, with causal language sprinkled freely ("churn is driven by..."). The scaffold's stars here are TASK — "three strongest correlates" converts an open prompt into a ranking problem — and the anti-causation constraint, which is the difference between a memo you can forward and one you have to retract.
The scaffold tells the model what you want; sometimes telling hits a ceiling. Chapter 03: few-shot prompting — when two good examples beat two hundred words of instruction, how models infer the rule from the cases, and the example-selection traps that quietly teach the wrong lesson.
Further reading
- Brown, T. et al. (2020). Language Models are Few-Shot Learners. — establishes that the prompt itself is the conditioning interface, the premise the whole scaffold rests on.
- Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). — explains why instruction-tuned models parse ROLE/TASK headers at all, and why the "competent expert" prior is already baked in.
- Zheng, M. et al. (2024). When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performance. — the controlled persona study behind §2.2's "roles are often cargo cult" claim.
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. — the canonical case that grounding answers in supplied sources beats relying on model priors, the engine behind the refusal rule.
- Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. — taxonomy and causes of fabrication, the failure mode the "not specified" license is designed to suppress.
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. — demonstrates that output structure and field order (§2.5's "field order is computation order") shape the reasoning, not just the formatting.
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. — shows how small, specific instructional phrasing changes behavior, sharpening the case for explicit TASK and CONSTRAINTS over vibes.