Chain of thought: the original magic words
In 2022 two papers changed how everyone prompted. Wei et al. showed that putting worked reasoning inside few-shot examples — not just question → answer, but question → derivation → answer — lifted large models from near-chance to strong performance on math word problems. Months later, Kojima et al. showed you didn't even need the examples: appending “Let's think step by step” to a bare question triggered the same behavior zero-shot. The effect was strongly scale-dependent — small models produced fluent nonsense chains; large models produced chains that actually landed on answers.
Why does emitting intermediate text help a fixed network? Two complementary explanations, both load-bearing.
1. Tokens are compute. A transformer performs a fixed amount of serial computation per emitted token: one pass through \(L\) layers. Whatever can't be computed in \(L\) sequential steps can't be computed in one token. Theory makes this sharp: constant-depth transformers (under realistic precision assumptions) sit in a weak circuit class and provably cannot solve certain iterative problems — multi-step arithmetic, graph reachability — in a single forward pass. But a model that writes \(T\) intermediate tokens gets \(O(T \cdot L)\) serial steps, and each written token becomes readable working memory for every later step. With a polynomially long chain, transformers can express polynomial-time computation (Merrill & Sabharwal; Feng et al., 2023). The scratchpad is not commentary — it is the computation.
2. The chain is a latent variable. Statistically, a reasoning path \(z\) is an unobserved route from question to answer, and the model's true answer distribution marginalizes over all of them:
Honesty clause. The emitted chain is a sample from a distribution over plausible rationales, not a printout of the model's internal circuitry. Faithfulness studies (Turpin et al., 2023; Anthropic's 2025 reasoning-faithfulness evals) show models can produce clean-looking chains while their answer was actually driven by a bias planted in the prompt — and the chain never mentions it. CoT reliably buys accuracy; it only sometimes buys a true explanation. Keep the two claims separate.
Decomposition patterns
“Think step by step” leaves the shape of the thinking entirely to the model. The second-generation patterns impose structure on \(z\) — and each one targets a specific failure mode of free-form chains:
| Pattern | The move | Fixes | Shines when… |
|---|---|---|---|
| Least-to-most (Zhou et al., 2022) | decompose into subquestions, solve easiest → hardest, feed each answer forward | chains that tackle the hard part first and collapse | the test problem is harder than any example — compositional generalization |
| Plan-then-solve (Wang et al., 2023) | “first devise a plan, then carry it out step by step” | diving in mid-problem and skipping steps | multi-constraint tasks where missed requirements, not bad arithmetic, kill you |
| Step-back (Zheng et al., 2023) | ask the abstraction first (“what principle governs this?”), then apply it | retrieving the wrong fact or formula and reasoning flawlessly from it | knowledge-heavy domains — physics, law, history — where the bottleneck is recall, not logic |
All three are the same bet placed differently: a chain conditioned on a good skeleton spends its probability mass in a better region of \(\mathcal{Z}\) than a chain improvising its own structure. Least-to-most reorders the work; plan-then-solve separates deciding-what-to-do from doing it; step-back inserts a retrieval step before inference.
One call or many? Every pattern above runs either inside a single prompt or as a chain of calls — decompose in call one, solve subproblems in calls two through five. Splitting costs latency and plumbing but buys per-step inspection, retries, and the ability to bolt a tool or a retrieval pass between steps. The single-prompt version is the cheap prototype; the multi-call version is what ships in pipelines (Chapter 06 returns to this as context orchestration).
Self-consistency: vote over the paths
EQ P4.1 says the model's real answer distribution is a marginal over reasoning paths — yet greedy CoT decoding samples exactly one path and trusts wherever it lands. Self-consistency (Wang et al., 2022) does the obvious-in-hindsight thing: sample \(N\) chains at nonzero temperature, extract each final answer, and take the plurality:
The classical intuition is Condorcet's jury theorem: if each chain is independently correct with probability \(p\) and errors were a single binary alternative, the vote would be right with probability \( \sum_{k=\lceil N/2 \rceil}^{N} \binom{N}{k}\, p^{k} (1-p)^{N-k} \) — which races to 1 as \(N\) grows whenever \(p > 0.5\), and to 0 when \(p < 0.5\). Reality is kinder than the binary case: wrong chains rarely coordinate on one wrong answer, so the correct answer needs only a plurality, and voting can help even when \(p\) is somewhat below one half. Reality is also crueler: chains come from the same model reading the same prompt, so their errors correlate, and the independence the theorem assumes is exactly what you don't have. Both effects are visible in the instrument below.
The instrument animates the curve; the code below is the curve. A barely-above-chance chain (p = 0.55) is no use alone, but voting integrates the path out — and the run prints the diminishing-returns shape EQ P4.2 promises: the climb from N = 1 to N = 9 dwarfs everything past it.
# self-consistency: majority vote over N CoT chains, each correct w.p. p = 0.55
import numpy as np
rng = np.random.default_rng(0)
p, trials = 0.55, 4000
Ns, accs = [1, 3, 5, 9, 15, 25, 41], []
for N in Ns:
draws = (rng.random((trials, N)) < p).astype(int) # 1 = chain hit the right answer
correct = draws.sum(axis=1)
distract = np.zeros(trials, dtype=int) # wrong chains scatter over 6 distractors
for i in range(trials):
w = N - correct[i]
if w: distract[i] = np.bincount(rng.integers(0, 6, w).astype(np.intc), minlength=6).max()
win = (correct > distract) | ((correct == distract) & (rng.random(trials) < 0.5))
accs.append(win.mean())
print(f"N={N:2d} vote acc = {win.mean():.3f}")
print(f"single chain {accs[0]:.3f} -> N=41 {accs[-1]:.3f} (gain +{accs[-1]-accs[0]:.3f})")
plot_xy(Ns, accs)
Try p = 0.45 — below the Condorcet half — and watch the curve still rise: a single chain that is wrong more often than right can still vote its way to a correct plurality, because its errors disperse over six distractors while its correct answers all pile on one value. That is the gap between the binary jury theorem and the multi-way reality the chapter flags. Then set every wrong chain to the same distractor (delete the scatter) and the gift evaporates — correlated errors are the failure mode voting cannot fix.
What the simulator's independence assumption hides, the next instrument shows: five concrete chains for one problem, including where the wrong ones leave the road.
Pull EQ P4.1 down to arithmetic. Take the five chains above as the only draws from \(p(z \mid q)\), read off each one's answer \(a\), and the marginal \(p(a \mid q)\) is just the histogram. A single sample is wrong two times in five — yet the mode of the histogram is the right answer. That is the whole trick on one line of np.unique:
# CoT as marginalization: 5 hand-written reasoning paths, majority recovers the answer
import numpy as np
rng = np.random.default_rng(0)
# bakery problem (INSTRUMENT P4.2): true answer is 74. 3 paths land on 74, 2 derail.
paths = [("parse boxes, subtract, double", 74),
("fuse 'baked - left', double", 74),
("misread 'all but 5' as 'sold 5'", 10), # language slip
("restate, double the sales", 74),
("double BAKED, not SOLD", 84)] # wrong operand
answers = np.array([a for _, a in paths])
for route, a in paths:
print(f" a = {a:>2} via {route}")
vals, counts = np.unique(answers, return_counts=True) # the marginal p(a | q)
winner = vals[np.argmax(counts)]
print("tally: " + ", ".join(f"{v}:{c}" for v, c in zip(vals, counts)))
draws = rng.choice(answers, size=2000) # sample one path at a time
print(f"single random path correct: {(draws==74).mean():.0%} plurality vote: a = {winner}")
print("=> EQ P4.1 marginal concentrates on 74, though any single sample may miss")
The output is the chapter's thesis in five lines: 74:3, 10:1, 84:1. No single path is trustworthy — sample one and you are right 58% of the time across draws — but the mode of the path-marginal is correct. Flip a third path to a wrong-but-distinct value and 74 still wins on a plurality; flip it to match one of the existing wrong answers and the vote ties. Truth concentrates, scattered error doesn't, correlated error does — exactly what §4.3 argues in prose.
Where this sits in 2026. Self-consistency is the verifier-free baseline of a whole family: best-of-\(N\) with a reward model or verifier picking instead of counting, weighted votes, and tree search over partial chains (Tree-of-Thoughts) all spend parallel samples to buy accuracy. Frontier evals still report cons@64 for exactly this reason. The economics never change, though: gains saturate around \(N \approx 10\)–\(20\) while cost stays linear in \(N\) — past the knee you are buying noise.
The reasoning-model plot twist
September 2024: OpenAI's o1 ships with accuracy that climbs as it is allowed to think longer. January 2025: DeepSeek-R1 publishes the recipe — reinforcement learning with verifiable rewards (Vol II · Ch 05): sample chains, grade only the final answer against a checkable target, and reinforce whatever reasoning led there. Out of pure outcome pressure, the models learned to emit long internal chains — with backtracking, self-checks, and “wait, let me reconsider” moves nobody wrote into a prompt. Chain of thought stopped being a prompting trick and became a trained behavior, usually hidden inside think-tags or a private reasoning channel.
The consequence for prompt engineers was abrupt: “think step by step” became redundant on reasoning-class models — the model was going to think anyway — and vendor guidance now explicitly advises against manual CoT instructions for them. At best you pay twice for the same behavior; at worst your hand-rolled procedure fights the reasoning policy RL actually optimized, and quality drops. What replaced the magic words is a control surface in the API, not the prompt:
| Surface | Where | Shape |
|---|---|---|
| Effort dial | OpenAI o-series / GPT-5: reasoning_effort | categorical — minimal · low · medium · high; the model budgets its own tokens per tier |
| Thinking budget | Anthropic: budget_tokens · Gemini: thinking_budget | explicit token ceiling for the thinking block; 0 ≈ off, or dynamic |
| Hybrid toggle | open models (Qwen3, R1 distills): enable_thinking, /think | template-level switch — one checkpoint serves both fast and thinking modes |
Parameter names drift across providers and versions; the shape is stable — a scalar dial trading thinking tokens for accuracy. On verifiable domains the published curves rise roughly log-linearly with thinking tokens (illustrative shape, not a law): each doubling of budget buys a similar increment, until the task's ceiling. This is serial test-time compute; self-consistency (§4.3) is parallel test-time compute. They compose — R1-style evals vote over 64 long-thinking chains — and the dial is almost always the cheaper first lever, because one chain of \(2T\) tokens shares state across its whole length while two chains of \(T\) start from scratch.
The deeper reframe: EQ P4.1's latent sum did not go away — RLVR reshaped \(p(z \mid q)\) so that high-probability paths are the productive ones, and the dial controls how far into \(\mathcal{Z}\) the model is allowed to wander before committing. You stopped steering the path and started budgeting it.
Modern guidance: when to still prompt for reasoning
“CoT prompting is dead” overshoots. What died is the incantation — content-free instructions to think. Four situations still reward explicit reasoning prompts:
| Situation | Reach for | Why it still works |
|---|---|---|
| Non-reasoning models | zero/few-shot CoT · §4.2 decomposition | Small, cheap, and most open instruct models never internalized the behavior — the 2022 results still hold for them, often worth double-digit points |
| Structured intermediate outputs | quote → analysis → answer field ordering | You want the intermediate work as an artifact: cited spans for extraction, per-criterion notes for rubric scoring. The reasoning is product, not just compute — and ordering analysis before answer forces computation before commitment (Chapter 05) |
| Audits & review | “show the derivation” + human-legible format | Reviewers, regulators, and graders need a rationale they can read — with §4.1's faithfulness caveat attached in writing, not assumed away |
| Domain checklists | content-specific checks, even on reasoning models | “Verify the units; check n = 0; reconcile the dates against the calendar” steers what gets verified. Generic “think carefully” adds nothing; specific checklists still move accuracy because they carry information the model lacks |
The operating rule: on a reasoning model, prompt the outcome, dial the effort. Describe the task, the constraints, and what a correct answer must satisfy; let the trained policy choose the path; raise effort or the thinking budget when the task is hard and verifiable. Reserve procedural reasoning instructions for models that need them or outputs where the procedure itself is the deliverable.
Verification prompts: ask for the check, not just the answer
Generating a solution and checking one are different computations — and checking is usually the cheaper, more reliable of the two. A model that confidently mis-multiplies will often catch the error when asked to substitute the result back. The pattern is to make the check an explicit, separate demand:
# Verification skeleton — works on reasoning and non-reasoning models
solve: produce the answer with whatever reasoning the task needs
verify: substitute the answer back into the original constraints;
recompute the key quantity by an independent route;
state PASS or FAIL per check, with the arithmetic shown
revise: if any check fails, fix the answer and re-run the checks
emit: final answer + the completed check log (machine-parseable)
Chain-of-Verification (Dhuliawala et al., 2023) hardens this for factual claims: draft an answer, generate verification questions about it, answer those questions independently — without the draft in context — then revise. The independence is the load-bearing detail: a model shown its own draft tends to confirm it; a fresh context answers the sub-questions on their merits.
Self-correction without an external signal is weak. Asking a model to “review your answer and fix any mistakes,” with no new information, frequently flips correct answers to wrong ones (Huang et al., 2024). Verification prompts earn their keep when the check is grounded: substitute-back arithmetic the model can actually compute, code that gets executed, a schema that gets validated, a retrieval pass that confronts the claim with a source. The best verification prompt produces a machine-checkable artifact — which is precisely where this volume goes next.
A verified answer still has to arrive in a shape software can consume. Chapter 05: structured output — schemas and constrained decoding, why field order is a reasoning control in disguise, and how to design outputs that feed tools without a parsing layer of duct tape.
Further reading
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. — the few-shot CoT result that opened §4.1; shows worked rationales in exemplars unlock multi-step reasoning at scale.
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. — the “Let's think step by step” paper; the magic words themselves, with the scale dependence §4.1 stresses.
- Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. — the source of EQ P4.2 and both instruments: sample many chains, vote on the answer.
- Feng, G. et al. (2023). Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective. — formalizes “tokens are compute”: bounded-depth transformers gain expressivity with a long enough scratchpad.
- Turpin, M. et al. (2023). Language Models Don't Always Say What They Think. — the faithfulness caveat in §4.1: a clean chain can rationalize an answer actually driven by an unmentioned prompt bias.
- Dhuliawala, S. et al. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models. — the §4.6 verification pattern; the load-bearing trick is answering check questions in a fresh context.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. — the §4.4 plot twist: outcome-graded RL trains long internal chains, turning CoT from a prompt into a policy.