05 · Structured Output & Tool-Ready Prompts

5.1

Why structure: code, evals, agents

Three consumers force the issue. Downstream code needs to index into the answer — result["sentiment"] either exists or your pipeline throws at 3 a.m. Eval harnesses need to grade thousands of outputs mechanically; if the answer's location varies, you end up grading the extractor instead of the model. Agents are the extreme case: every tool call is a structured output, and every loop iteration re-parses one. A format that holds 97% of the time feels reliable in a chat window and is a disaster in a chain:

EQ P5.1 — RELIABILITY COMPOUNDS AGAINST YOU $$ \Pr[\text{pipeline survives}] \;=\; \prod_{i=1}^{k} p_i \;\xrightarrow{\;p_i \,=\, p\;}\; p^{\,k}, \qquad 0.97^{20} \approx 0.54 $$

$p_i$ is the probability call $i$ yields parseable, schema-valid output. An agent that makes twenty calls per task at 97% per-call validity fails almost half its runs on formatting alone — before any reasoning error is counted. Structure work is reliability work, not cosmetics.

There is a real tension to keep in view throughout: the tighter you clamp the format, the less room the model has to think en route to the answer (Chapter 04). The professional pattern is to separate the two — free-form reasoning first, clamped answer last — and every technique below composes with that split.

An agent makes $10$ tool calls per task, each yielding schema-valid output with probability $p = 0.95$. Using EQ P5.1, what is the probability the whole pipeline survives on formatting alone?

$\Pr[\text{survives}] = p^{k} = 0.95^{10} = 0.5987\ldots \approx$ 0.60. Four in ten runs die on formatting before a single reasoning error is counted — which is why structure work is reliability work, and why rung 6 (a decoder that cannot emit invalid output) earns its latency cost.

5.2

The toolbox, ranked

Six rungs, ordered by how strong a guarantee you get. Each rung up costs a little flexibility and buys a lot of validity. Most production systems run rung 3 or 4 with the rung-6 safety net where the API offers it.

FIG P5.1THE STRUCTURE LADDER — GUARANTEE STRENGTH BY TECHNIQUE

Rungs 1–5 shape a probability distribution; rung 6 truncates it. Everything left of rung 6 can still fail, which is why §5.6 exists.

Rung 1 — ask in prose. "Classify the sentiment and give a confidence." The model decides the format per-call: sometimes a sentence, sometimes a bulleted list, sometimes a table. Fine for humans, hostile to parsers.

# Rung 1 — hope as a strategy
Classify the sentiment of this review and give a confidence score.

Rung 2 — show a template. Models imitate far better than they obey. An exact output skeleton in the prompt collapses most of the format variance at the cost of a few input tokens:

# Rung 2 — show, don't describe
Respond in exactly this format, nothing else:

SENTIMENT: <positive | neutral | negative>
CONFIDENCE: <0.00-1.00>

Rung 3 — XML tags. Tags delimit fields without escaping rules: the content between them can contain quotes, newlines, code, even JSON, and a one-line regex still extracts it. Claude-family models are conspicuously good at this rung — §5.3 explains why.

# Rung 3 — tags delimit; content stays free
<analysis>
  <sentiment>negative</sentiment>
  <confidence>0.87</confidence>
  <quote>the hinge snapped after two weeks</quote>
</analysis>

Rung 4 — JSON with a schema in the prompt. When downstream code wants typed data, show the model the actual JSON Schema. Compliance is still probabilistic, but the schema's description strings double as per-field instructions (§5.5):

# Rung 4 — the schema is part of the prompt
Return a single JSON object matching this schema. No markdown fences.
{ "type": "object",
  "properties": {
    "sentiment":  { "type": "string", "enum": ["positive","neutral","negative"] },
    "confidence": { "type": "number", "description": "calibrated, in [0,1]" } },
  "required": ["sentiment","confidence"] }

Rung 5 — prefill the assistant turn. Don't ask for JSON — start writing it yourself and let the model continue. The preamble ("Sure! Here's…") becomes impossible rather than discouraged (§5.4):

# Rung 5 — start the answer yourself
{"role": "user",      "content": "Classify ... Respond with JSON only."}
{"role": "assistant", "content": "{\"sentiment\":"}   ← prefill: the reply MUST continue from here

Rung 6 — constrained decoding / structured-output APIs. The serving stack compiles your schema to a grammar and masks every token that would violate it (§5.7). Invalid JSON is not unlikely; it is unrepresentable:

# Rung 6 — the decoder cannot emit invalid JSON
output_format = { "type": "json_schema", "schema": { ... } }
# or: tools=[...] with the choice forced — the arguments ARE the structured output

Rung	Guarantee	Costs you	Reach for it when
1 · Prose ask	none	your weekend	A human reads the output
2 · Template	weak	~20 input tokens	Simple flat fields, quick scripts
3 · XML tags	moderate	verbosity	Free-text fields; Claude-family; streaming extraction
4 · JSON + schema	moderate+	schema tokens	Typed data, nested objects
5 · Prefill	strong start	no extended thinking	Killing preambles; forcing the first token
6 · Constrained	syntactic certainty	latency, some quality	Anything load-bearing the API supports

5.3

XML tags as attention anchors

Why do tags work so well — and why especially on Claude? Four reasons, in decreasing order of how confident you should be in them:

Tags are rare, distinctive token sequences. <sentiment> appears nowhere in ordinary prose, so it makes an unambiguous key for attention to bind to. The induction-head circuit (Vol II · Ch 03) — find an earlier occurrence of the current pattern, copy what followed — is precisely the machinery that, having seen an opening tag in the instructions, reproduces it and later closes it. A tag is an address the model can attend to exactly, where "the second paragraph of your answer" is not.
Training distribution. Anthropic's own system prompts, tool harnesses, and post-training data lean heavily on XML-style scaffolding, and their documentation has recommended tags since the first Claude. The model has seen millions of examples where tags delimit semantically distinct regions and the structure is always respected. This is behavioral and circumstantial evidence — no public mechanistic study isolates "XML compliance" in the weights — but the effect size in practice is large and stable.
No escaping rules. JSON dies on an unescaped quote or newline inside a string. Tag content is free text; the only collision is the literal closing tag appearing in the payload, which for a name like <verbatim_quote> is essentially never.
Streaming-friendly. A tag block is parseable the moment it closes, mid-generation. JSON is all-or-nothing until the final brace.

Craft rules: name tags semantically (<diagnosis>, not <output2>) — the name itself is a micro-instruction; refer to tags by name in the instructions ("put your reasoning in <thinking>"); nest shallowly; and keep tag vocabulary consistent across few-shot examples, because the model will imitate your inconsistencies just as faithfully as your structure.

5.4

Prefilling: forcing the first token

Every instruction in the prompt merely tilts the output distribution. Prefilling edits the sample itself: you submit the conversation with a final assistant message already begun, and the model has no choice but to continue from your text. The arithmetic is the autoregressive factorization — there is no step at which "Sure, here's the JSON…" can be emitted, because those positions are already spent:

EQ P5.2 — PREFILL AS HARD CONDITIONING $$ y \;\sim\; \prod_{t=1}^{T} p_\theta\!\left(y_t \mid x,\; c,\; y_{<t}\right) $$

$x$ is the prompt, $c$ the prefill — tokens fixed by you, not sampled. Instructions change $p_\theta$'s tilt; the prefill changes the support: every generated token is conditioned on $c$ having already been said. Start $c = \texttt{\{}$ and the preamble has probability zero by construction, not by persuasion.

The standard plays:

Kill the preamble. Prefill { for JSON, or <analysis> for a tag block. Pair with a stop sequence on the matching closer ("}" won't work for nested JSON — but </analysis> works perfectly for tags) and the response is the payload, whole and nothing but.
Skip a rehearsed opening. In extraction loops where the model re-explains its task every call, prefilling past the boilerplate saves output tokens — the expensive kind.
Hold a role. A prefilled in-character first sentence is a stronger anchor against persona drift than another paragraph of system prompt.

Caveats, honestly stated. Prefilling is incompatible with extended-thinking modes on current Anthropic APIs (the model must open its own reasoning block); a prefill ending in trailing whitespace is rejected; and a prefill is a strong start, not a guarantee — the model can close your brace and append commentary. Prefill shapes the head of the sequence; stop sequences guard the tail; the validator (§5.6) catches what slips between.

5.5

Schemas and function calling

A JSON Schema does double duty. In-prompt (rung 4), it is documentation the model reads and probably follows. API-enforced — OpenAI structured outputs, Anthropic's structured outputs and strict tool use (GA'd via beta in late 2025), open-stack guided decoding — the same schema is compiled into the decoder and compliance stops being the model's decision (§5.7). Either way, the highest-leverage tokens in the schema are the description strings:

Field descriptions are mini-prompts. The description is what the model reads at the moment it fills that field — instruction placed at the exact point of decision, which Chapter 02 taught you is the best real estate in the context. Write them as imperatives with edge cases: not "the quote" but "verbatim quote copied character-for-character from the input; never paraphrase; empty array if none". Teams that A/B their tool descriptions routinely find double-digit accuracy swings from description wording alone — it is the cheapest fine-tuning you will ever do.

Function calling is structured output wearing a dispatcher. A tool definition is a name, a description ("when to call me"), and an input_schema ("how to call me"). The model's tool call is a structured output validated against that schema; the agent loop parses it, executes, and returns a result. Everything in this chapter applies verbatim — a flaky tool-argument format is exactly the compounding failure of EQ P5.1. And one warning carries over with extra force: schema enforcement guarantees shape, not truth. A guaranteed-well-formed "confidence": 0.93 is still a made-up number unless you've done the calibration work of Chapter 04.

INSTRUMENT P5.1 — SCHEMA→PROMPT BUILDERFIELDS IN · SCHEMA + XML TEMPLATE + EXAMPLE OUT

FIELD NAME

TYPE

DESCRIPTION (THE MINI-PROMPT)

ENUM VALUES (COMMA-SEP · ENUM ONLY)

REQUIRED

A — JSON SCHEMA (PASTE INTO output_format / input_schema)

B — PROMPT SECTION WITH XML TEMPLATE (RUNG 3)

C — A VALID OUTPUT (WHAT A GOOD CALL RETURNS)

FIELDS

—

REQUIRED

—

SCHEMA SIZE (≈ TOKENS)

—

Add a field and watch all three panes regenerate: the machine-facing schema, the model-facing XML prompt section, and a realistic valid output. Note how your DESCRIPTION text lands in both the schema and the per-field rules — write it as an instruction, because that is what it is. Token estimate is chars/4, illustrative.

5.6

Failure modes and defensive parsing

Below rung 6, model output is a probable format, and the tail of that distribution is where pipelines die. The recurring villains: trailing commas (legal in every JavaScript file the model trained on, illegal in JSON), markdown fences wrapping the payload, preamble and postamble prose, hallucinated enum values that parse cleanly and fail silently, Python literals ('single quotes', True, None) from code-heavy training data, and truncation when max_tokens lands mid-string. A production parser is a pipeline, not a call:

# The defensive parsing pipeline — every stage logs what it touched
extract:  take the outermost balanced {...} block  # strips fences + prose
repair:   trailing commas · smart quotes · True/None  # mechanical, logged
parse:    strict JSON.parse — no eval, ever
validate: schema check (types, enums, ranges, required)
retry:    re-prompt with the validator's error message verbatim, ≤ 2 attempts
surface:  persistent failure is a signal, not noise — count it in your evals

INSTRUMENT P5.2 — PARSE ROULETTESIX REAL FAILURE MODES · LIVE JSON.parse IN YOUR BROWSER

BATCH

RAW OUTPUTS SURVIVING PARSE+SCHEMA

— / 6

AFTER DEFENSIVE PIPELINE

— / 6

Each card is a plausible model reply. PARSE RAW runs an actual JSON.parse plus a schema check (enum + types) — the verdict text is the real engine error. DEFEND applies that card's documented fix and re-runs. Note card 4: it parses green and validates red — the failure mode a try/catch alone never catches. And card 6's bracket-balancer "succeeds" by silently amputating data: detection and re-request beat repair.

Here is the first two stages of that pipeline as code you can run and break — extract and repair over six realistic messy replies. Watch the last case fail: syntax-only repair cannot touch Python literals, which is exactly why the pipeline's honest answer to that one is a retry, not a cleverer regex.

PYTHON · RUNNABLE IN-BROWSER

# defensive JSON repair — strip fences, grab first {...}, fix trailing commas, json.loads
import json, re
np = __import__("numpy"); np.random.default_rng(0)            # seeded per house style; logic is deterministic

def repair(s):
    s = re.sub(r"```[a-zA-Z]*", "", s).replace("```", "")    # 1. strip code fences
    a, b = s.find("{"), s.rfind("}")
    if a < 0 or b <= a: return None                          # 2. find outermost {...}
    s = re.sub(r",\s*([}\]])", r"\1", s[a:b+1])              # 3. kill trailing commas
    try: return json.loads(s)                                # 4. strict parse — never eval
    except json.JSONDecodeError: return None

cases = [
    '{"label": "negative", "score": 0.91}',                  # already clean
    '```json\n{"label": "neutral", "score": 0.5}\n```',      # markdown fence
    'Sure! Here you go:\n{"label": "positive", "score": 0.8}\nHope that helps!',  # pre/postamble
    '{"label": "negative", "score": 0.87,}',                 # trailing comma
    '{"label": "positive", "tags": ["a", "b",],}',           # nested trailing commas
    "{'label': 'negative'}",                                 # python literals — syntax repair can't help
]
ok = 0
for i, c in enumerate(cases, 1):
    r = repair(c); ok += r is not None
    print(f"case {i}: {'PARSED' if r is not None else 'FAILED'}  -> {r}")
print(f"\nrepaired {ok}/{len(cases)}; only the Python-literal case defeats syntax-only repair -> retry")

edits are live — break it on purpose

Five of six recover, and the sixth fails loudly — which is the point. The single-quote / True / None dialect from code-heavy training data is genuinely ambiguous (an apostrophe inside a value will break any quote-swapping regex), so the disciplined move is to surface the failure and re-prompt with the parser's error rather than paper over it. Repair what is mechanical; escalate what is semantic.

The defensive parsing pipeline runs over $6$ messy model replies. Mechanical repair (strip fences, fix trailing commas, grab the outermost braces) recovers $5$; the Python-literal case defeats it and gets escalated to a retry. What fraction does the pipeline recover without a retry?

$5 / 6 \approx$ 0.833. Five recover silently; the sixth fails loudly and re-prompts — the discipline is to repair what is mechanical and escalate what is semantic, never to paper over an ambiguous dialect with a cleverer regex.

RULE

Repair syntax mechanically; never repair semantics silently. Stripping a fence loses nothing. Mapping "slightly_negative" to "negative" changes the answer — do it only with a log line, or better, send the validator error back to the model and let it correct itself (Chapter 06 builds this into a full critique loop).

5.7

Constrained decoding under the hood

Rung 6 is not prompting at all — it is surgery on the sampling step you met in Vol II · Ch 08. The schema (or regex, or context-free grammar: llama.cpp's GBNF, Outlines, XGrammar, vLLM guided decoding) is compiled into an automaton over the tokenizer's vocabulary. At every step the automaton's current state $q_s$ defines the set of tokens that keep the output grammatical; everything else is masked to $-\infty$ before softmax:

EQ P5.3 — LOGIT MASKING BY GRAMMAR $$ \tilde{p}(v \mid s) \;=\; \frac{ p_\theta(v \mid s)\; \mathbf{1}\!\left[v \in \mathcal{A}(q_s)\right] }{ \sum_{u \,\in\, \mathcal{A}(q_s)} p_\theta(u \mid s) } $$

$\mathcal{A}(q_s)$ is the allowed-token set in automaton state $q_s$. The model's distribution is renormalized over the legal moves only, then temperature and top-p (Vol II · EQ 8.2 machinery) apply to the survivors. Validity becomes a property of the decoder, not of the model's cooperation.

The toy below is EQ P5.3 in eight lines. A tiny vocabulary holds three legal enum members alongside near-misses the model is fond of (slightly_negative, POSITIVE) and structural debris ({, </s>). We hand the model logits that prefer the wrong tokens, then sample 200 times with and without the grammar mask. Free sampling scatters across the vocabulary; masked sampling cannot leave the enum no matter what the model wanted:

PYTHON · RUNNABLE IN-BROWSER

# constrained decoding toy — mask logits to a grammar so only valid enum tokens survive
import numpy as np
rng = np.random.default_rng(0)
vocab   = ["positive","neutral","negative","slightly_negative","POSITIVE","maybe","{","</s>"]
allowed = {"positive","neutral","negative"}                  # the grammar: enum members only
mask    = np.array([1.0 if t in allowed else 0.0 for t in vocab])

def sample(logits, constrain):
    z = np.where(mask > 0, logits, -np.inf) if constrain else logits.copy()  # EQ P5.3
    p = np.exp(z - z.max()); p /= p.sum()                    # softmax over survivors
    return vocab[rng.choice(len(vocab), p=p)]

logits  = rng.normal(0, 2, len(vocab))                       # what the model "wants"
free    = [sample(logits, False) for _ in range(200)]
grammar = [sample(logits, True ) for _ in range(200)]

bad_free = sum(t not in allowed for t in free)
bad_con  = sum(t not in allowed for t in grammar)
print("argmax token (what the model wanted):", vocab[int(logits.argmax())])
print(f"FREE sampling  : {bad_free:3d}/200 invalid   e.g. {sorted(set(free))[:3]}")
print(f"GRAMMAR-MASKED : {bad_con:3d}/200 invalid   emitted set = {sorted(set(grammar))}")
print("invalid output is not unlikely under masking — it is unrepresentable")

edits are live — break it on purpose

The model's single most-wanted token here is { — structurally useless for an enum field — yet the masked column emits only the three legal values, 0/200 violations. That is the whole promise of rung 6: validity is a property of the decoder, not of the model's cooperation. Now read the second subtlety below in that light — the mask renormalizes the model's distribution, it does not improve it, so a model that wanted { is being dragged somewhere it assigns low joint probability.

At one decode step the model's softmax puts probability $0.30$ on the legal token "negative" and $0.10$ on the legal token "neutral"; every other token is masked out by the grammar. After the renormalization in EQ P5.3, what probability does "negative" get?

$\tilde p(\texttt{negative}) = \dfrac{0.30}{0.30 + 0.10} = \dfrac{0.30}{0.40} =$ 0.75. The mask keeps the model's relative preference among legal moves and discards the $1 - 0.40 = 0.60$ of mass it wanted to spend on illegal tokens — renormalizing the distribution, not improving it.

Two subtleties separate the good implementations from the slow or broken ones:

Token–grammar misalignment. The grammar is defined over characters, but the model emits tokens — and "true" might be one token or four, with thousands of vocabulary entries spanning any given character boundary. Engines precompute, for each automaton state, which of the ~100K+ tokens are admissible (Outlines' FSM indexing, XGrammar's adaptive token-mask cache), turning a per-step vocabulary scan into a lookup. Done naively, masking dominates decode latency; done well, it is near-free.
Distribution distortion. Masking is greedy with respect to the grammar: each step keeps locally-legal tokens, but the model may be forced down a path it assigns low joint probability — valid JSON it never meant to write, with quality falling where the mask bit hardest. Measured effects on reasoning-heavy tasks are real but contested in size, and the consensus mitigation is the split this chapter keeps returning to: let the model reason unconstrained, then constrain only the final answer — either two calls, or thinking tags followed by an enforced answer block.

So the ladder closes its loop: rung 6 guarantees syntax by reaching into the sampler, and rungs 1–5 remain the art of making the model want what the grammar permits — because a constrained decoder dragging an unwilling distribution through a schema produces exactly the hallucinated-but-well-formed fields that §5.5 warned about.

Structure makes output checkable — now make the model check it. Chapter 06: self-critique loops, rubric-driven revision, red-team prompts that attack your own system, and councils of models that grade each other's work.

§

Structured Output & Tool-Ready Prompts

Why structure: code, evals, agents

The toolbox, ranked

XML tags as attention anchors

Prefilling: forcing the first token

Schemas and function calling

Failure modes and defensive parsing

Constrained decoding under the hood

Further reading