Why structure: code, evals, agents
Three consumers force the issue. Downstream code needs to index into the answer — result["sentiment"] either exists or your pipeline throws at 3 a.m. Eval harnesses need to grade thousands of outputs mechanically; if the answer's location varies, you end up grading the extractor instead of the model. Agents are the extreme case: every tool call is a structured output, and every loop iteration re-parses one. A format that holds 97% of the time feels reliable in a chat window and is a disaster in a chain:
There is a real tension to keep in view throughout: the tighter you clamp the format, the less room the model has to think en route to the answer (Chapter 04). The professional pattern is to separate the two — free-form reasoning first, clamped answer last — and every technique below composes with that split.
The toolbox, ranked
Six rungs, ordered by how strong a guarantee you get. Each rung up costs a little flexibility and buys a lot of validity. Most production systems run rung 3 or 4 with the rung-6 safety net where the API offers it.
Rung 1 — ask in prose. "Classify the sentiment and give a confidence." The model decides the format per-call: sometimes a sentence, sometimes a bulleted list, sometimes a table. Fine for humans, hostile to parsers.
# Rung 1 — hope as a strategy
Classify the sentiment of this review and give a confidence score.
Rung 2 — show a template. Models imitate far better than they obey. An exact output skeleton in the prompt collapses most of the format variance at the cost of a few input tokens:
# Rung 2 — show, don't describe
Respond in exactly this format, nothing else:
SENTIMENT: <positive | neutral | negative>
CONFIDENCE: <0.00-1.00>
Rung 3 — XML tags. Tags delimit fields without escaping rules: the content between them can contain quotes, newlines, code, even JSON, and a one-line regex still extracts it. Claude-family models are conspicuously good at this rung — §5.3 explains why.
# Rung 3 — tags delimit; content stays free
<analysis>
<sentiment>negative</sentiment>
<confidence>0.87</confidence>
<quote>the hinge snapped after two weeks</quote>
</analysis>
Rung 4 — JSON with a schema in the prompt. When downstream code wants typed data, show the model the actual JSON Schema. Compliance is still probabilistic, but the schema's description strings double as per-field instructions (§5.5):
# Rung 4 — the schema is part of the prompt
Return a single JSON object matching this schema. No markdown fences.
{ "type": "object",
"properties": {
"sentiment": { "type": "string", "enum": ["positive","neutral","negative"] },
"confidence": { "type": "number", "description": "calibrated, in [0,1]" } },
"required": ["sentiment","confidence"] }
Rung 5 — prefill the assistant turn. Don't ask for JSON — start writing it yourself and let the model continue. The preamble ("Sure! Here's…") becomes impossible rather than discouraged (§5.4):
# Rung 5 — start the answer yourself
{"role": "user", "content": "Classify ... Respond with JSON only."}
{"role": "assistant", "content": "{\"sentiment\":"} ← prefill: the reply MUST continue from here
Rung 6 — constrained decoding / structured-output APIs. The serving stack compiles your schema to a grammar and masks every token that would violate it (§5.7). Invalid JSON is not unlikely; it is unrepresentable:
# Rung 6 — the decoder cannot emit invalid JSON
output_format = { "type": "json_schema", "schema": { ... } }
# or: tools=[...] with the choice forced — the arguments ARE the structured output
| Rung | Guarantee | Costs you | Reach for it when |
|---|---|---|---|
| 1 · Prose ask | none | your weekend | A human reads the output |
| 2 · Template | weak | ~20 input tokens | Simple flat fields, quick scripts |
| 3 · XML tags | moderate | verbosity | Free-text fields; Claude-family; streaming extraction |
| 4 · JSON + schema | moderate+ | schema tokens | Typed data, nested objects |
| 5 · Prefill | strong start | no extended thinking | Killing preambles; forcing the first token |
| 6 · Constrained | syntactic certainty | latency, some quality | Anything load-bearing the API supports |
XML tags as attention anchors
Why do tags work so well — and why especially on Claude? Four reasons, in decreasing order of how confident you should be in them:
- Tags are rare, distinctive token sequences.
<sentiment>appears nowhere in ordinary prose, so it makes an unambiguous key for attention to bind to. The induction-head circuit (Vol II · Ch 03) — find an earlier occurrence of the current pattern, copy what followed — is precisely the machinery that, having seen an opening tag in the instructions, reproduces it and later closes it. A tag is an address the model can attend to exactly, where "the second paragraph of your answer" is not. - Training distribution. Anthropic's own system prompts, tool harnesses, and post-training data lean heavily on XML-style scaffolding, and their documentation has recommended tags since the first Claude. The model has seen millions of examples where tags delimit semantically distinct regions and the structure is always respected. This is behavioral and circumstantial evidence — no public mechanistic study isolates "XML compliance" in the weights — but the effect size in practice is large and stable.
- No escaping rules. JSON dies on an unescaped quote or newline inside a string. Tag content is free text; the only collision is the literal closing tag appearing in the payload, which for a name like
<verbatim_quote>is essentially never. - Streaming-friendly. A tag block is parseable the moment it closes, mid-generation. JSON is all-or-nothing until the final brace.
Craft rules: name tags semantically (<diagnosis>, not <output2>) — the name itself is a micro-instruction; refer to tags by name in the instructions ("put your reasoning in <thinking>"); nest shallowly; and keep tag vocabulary consistent across few-shot examples, because the model will imitate your inconsistencies just as faithfully as your structure.
Prefilling: forcing the first token
Every instruction in the prompt merely tilts the output distribution. Prefilling edits the sample itself: you submit the conversation with a final assistant message already begun, and the model has no choice but to continue from your text. The arithmetic is the autoregressive factorization — there is no step at which "Sure, here's the JSON…" can be emitted, because those positions are already spent:
The standard plays:
- Kill the preamble. Prefill
{for JSON, or<analysis>for a tag block. Pair with a stop sequence on the matching closer ("}"won't work for nested JSON — but</analysis>works perfectly for tags) and the response is the payload, whole and nothing but. - Skip a rehearsed opening. In extraction loops where the model re-explains its task every call, prefilling past the boilerplate saves output tokens — the expensive kind.
- Hold a role. A prefilled in-character first sentence is a stronger anchor against persona drift than another paragraph of system prompt.
Caveats, honestly stated. Prefilling is incompatible with extended-thinking modes on current Anthropic APIs (the model must open its own reasoning block); a prefill ending in trailing whitespace is rejected; and a prefill is a strong start, not a guarantee — the model can close your brace and append commentary. Prefill shapes the head of the sequence; stop sequences guard the tail; the validator (§5.6) catches what slips between.
Schemas and function calling
A JSON Schema does double duty. In-prompt (rung 4), it is documentation the model reads and probably follows. API-enforced — OpenAI structured outputs, Anthropic's structured outputs and strict tool use (GA'd via beta in late 2025), open-stack guided decoding — the same schema is compiled into the decoder and compliance stops being the model's decision (§5.7). Either way, the highest-leverage tokens in the schema are the description strings:
Field descriptions are mini-prompts. The description is what the model reads at the moment it fills that field — instruction placed at the exact point of decision, which Chapter 02 taught you is the best real estate in the context. Write them as imperatives with edge cases: not "the quote" but "verbatim quote copied character-for-character from the input; never paraphrase; empty array if none". Teams that A/B their tool descriptions routinely find double-digit accuracy swings from description wording alone — it is the cheapest fine-tuning you will ever do.
Function calling is structured output wearing a dispatcher. A tool definition is a name, a description ("when to call me"), and an input_schema ("how to call me"). The model's tool call is a structured output validated against that schema; the agent loop parses it, executes, and returns a result. Everything in this chapter applies verbatim — a flaky tool-argument format is exactly the compounding failure of EQ P5.1. And one warning carries over with extra force: schema enforcement guarantees shape, not truth. A guaranteed-well-formed "confidence": 0.93 is still a made-up number unless you've done the calibration work of Chapter 04.
Failure modes and defensive parsing
Below rung 6, model output is a probable format, and the tail of that distribution is where pipelines die. The recurring villains: trailing commas (legal in every JavaScript file the model trained on, illegal in JSON), markdown fences wrapping the payload, preamble and postamble prose, hallucinated enum values that parse cleanly and fail silently, Python literals ('single quotes', True, None) from code-heavy training data, and truncation when max_tokens lands mid-string. A production parser is a pipeline, not a call:
# The defensive parsing pipeline — every stage logs what it touched
extract: take the outermost balanced {...} block # strips fences + prose
repair: trailing commas · smart quotes · True/None # mechanical, logged
parse: strict JSON.parse — no eval, ever
validate: schema check (types, enums, ranges, required)
retry: re-prompt with the validator's error message verbatim, ≤ 2 attempts
surface: persistent failure is a signal, not noise — count it in your evals
Here is the first two stages of that pipeline as code you can run and break — extract and repair over six realistic messy replies. Watch the last case fail: syntax-only repair cannot touch Python literals, which is exactly why the pipeline's honest answer to that one is a retry, not a cleverer regex.
# defensive JSON repair — strip fences, grab first {...}, fix trailing commas, json.loads
import json, re
np = __import__("numpy"); np.random.default_rng(0) # seeded per house style; logic is deterministic
def repair(s):
s = re.sub(r"```[a-zA-Z]*", "", s).replace("```", "") # 1. strip code fences
a, b = s.find("{"), s.rfind("}")
if a < 0 or b <= a: return None # 2. find outermost {...}
s = re.sub(r",\s*([}\]])", r"\1", s[a:b+1]) # 3. kill trailing commas
try: return json.loads(s) # 4. strict parse — never eval
except json.JSONDecodeError: return None
cases = [
'{"label": "negative", "score": 0.91}', # already clean
'```json\n{"label": "neutral", "score": 0.5}\n```', # markdown fence
'Sure! Here you go:\n{"label": "positive", "score": 0.8}\nHope that helps!', # pre/postamble
'{"label": "negative", "score": 0.87,}', # trailing comma
'{"label": "positive", "tags": ["a", "b",],}', # nested trailing commas
"{'label': 'negative'}", # python literals — syntax repair can't help
]
ok = 0
for i, c in enumerate(cases, 1):
r = repair(c); ok += r is not None
print(f"case {i}: {'PARSED' if r is not None else 'FAILED'} -> {r}")
print(f"\nrepaired {ok}/{len(cases)}; only the Python-literal case defeats syntax-only repair -> retry")
Five of six recover, and the sixth fails loudly — which is the point. The single-quote / True / None dialect from code-heavy training data is genuinely ambiguous (an apostrophe inside a value will break any quote-swapping regex), so the disciplined move is to surface the failure and re-prompt with the parser's error rather than paper over it. Repair what is mechanical; escalate what is semantic.
Repair syntax mechanically; never repair semantics silently. Stripping a fence loses nothing. Mapping "slightly_negative" to "negative" changes the answer — do it only with a log line, or better, send the validator error back to the model and let it correct itself (Chapter 06 builds this into a full critique loop).
Constrained decoding under the hood
Rung 6 is not prompting at all — it is surgery on the sampling step you met in Vol II · Ch 08. The schema (or regex, or context-free grammar: llama.cpp's GBNF, Outlines, XGrammar, vLLM guided decoding) is compiled into an automaton over the tokenizer's vocabulary. At every step the automaton's current state \(q_s\) defines the set of tokens that keep the output grammatical; everything else is masked to \(-\infty\) before softmax:
The toy below is EQ P5.3 in eight lines. A tiny vocabulary holds three legal enum members alongside near-misses the model is fond of (slightly_negative, POSITIVE) and structural debris ({, </s>). We hand the model logits that prefer the wrong tokens, then sample 200 times with and without the grammar mask. Free sampling scatters across the vocabulary; masked sampling cannot leave the enum no matter what the model wanted:
# constrained decoding toy — mask logits to a grammar so only valid enum tokens survive
import numpy as np
rng = np.random.default_rng(0)
vocab = ["positive","neutral","negative","slightly_negative","POSITIVE","maybe","{","</s>"]
allowed = {"positive","neutral","negative"} # the grammar: enum members only
mask = np.array([1.0 if t in allowed else 0.0 for t in vocab])
def sample(logits, constrain):
z = np.where(mask > 0, logits, -np.inf) if constrain else logits.copy() # EQ P5.3
p = np.exp(z - z.max()); p /= p.sum() # softmax over survivors
return vocab[rng.choice(len(vocab), p=p)]
logits = rng.normal(0, 2, len(vocab)) # what the model "wants"
free = [sample(logits, False) for _ in range(200)]
grammar = [sample(logits, True ) for _ in range(200)]
bad_free = sum(t not in allowed for t in free)
bad_con = sum(t not in allowed for t in grammar)
print("argmax token (what the model wanted):", vocab[int(logits.argmax())])
print(f"FREE sampling : {bad_free:3d}/200 invalid e.g. {sorted(set(free))[:3]}")
print(f"GRAMMAR-MASKED : {bad_con:3d}/200 invalid emitted set = {sorted(set(grammar))}")
print("invalid output is not unlikely under masking — it is unrepresentable")
The model's single most-wanted token here is { — structurally useless for an enum field — yet the masked column emits only the three legal values, 0/200 violations. That is the whole promise of rung 6: validity is a property of the decoder, not of the model's cooperation. Now read the second subtlety below in that light — the mask renormalizes the model's distribution, it does not improve it, so a model that wanted { is being dragged somewhere it assigns low joint probability.
"negative" and \(0.10\) on the legal token "neutral"; every other token is masked out by the grammar. After the renormalization in EQ P5.3, what probability does "negative" get?Two subtleties separate the good implementations from the slow or broken ones:
- Token–grammar misalignment. The grammar is defined over characters, but the model emits tokens — and
"true"might be one token or four, with thousands of vocabulary entries spanning any given character boundary. Engines precompute, for each automaton state, which of the ~100K+ tokens are admissible (Outlines' FSM indexing, XGrammar's adaptive token-mask cache), turning a per-step vocabulary scan into a lookup. Done naively, masking dominates decode latency; done well, it is near-free. - Distribution distortion. Masking is greedy with respect to the grammar: each step keeps locally-legal tokens, but the model may be forced down a path it assigns low joint probability — valid JSON it never meant to write, with quality falling where the mask bit hardest. Measured effects on reasoning-heavy tasks are real but contested in size, and the consensus mitigation is the split this chapter keeps returning to: let the model reason unconstrained, then constrain only the final answer — either two calls, or thinking tags followed by an enforced answer block.
So the ladder closes its loop: rung 6 guarantees syntax by reaching into the sampler, and rungs 1–5 remain the art of making the model want what the grammar permits — because a constrained decoder dragging an unwilling distribution through a schema produces exactly the hallucinated-but-well-formed fields that §5.5 warned about.
Structure makes output checkable — now make the model check it. Chapter 06: self-critique loops, rubric-driven revision, red-team prompts that attack your own system, and councils of models that grade each other's work.
Further reading
- Crockford, D. (2006). The application/json Media Type for JavaScript Object Notation (JSON), RFC 4627. — the canonical grammar this whole chapter's parsers and repairs are defending; note it forbids trailing commas.
- JSON Schema Org. (2020). JSON Schema Draft 2020-12: Core and Validation. — the specification that rung 4's prompts paste in and rung 6's decoders compile; the source of the
description/enum/requiredkeywords used throughout. - Willard, B. T. & Louf, R. (2023). Efficient Guided Generation for Large Language Models. — the Outlines paper; reframes constrained decoding as FSM indexing over the vocabulary, the mechanism behind EQ P5.3's near-free masking.
- Dong, Y. et al. (2024). XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. — adaptive token-mask caching for context-free grammars; the production answer to token–grammar misalignment.
- Geng, S. et al. (2023). Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. — shows grammar masking guarantees syntax but can distort the joint distribution, motivating §5.7's reason-then-constrain split.
- Anthropic. (2025). Claude Documentation: Tool Use, Structured Outputs, and Prompt Engineering with XML Tags. — primary source for rungs 3–6 in practice: prefilling, XML-tag conventions, strict tool use, and structured outputs.
- OpenAI. (2024). Introducing Structured Outputs in the API. — describes constrained decoding compiled from JSON Schema with guaranteed schema conformance, the commercial instantiation of rung 6.