04 · Reasoning Controls: CoT to Effort Dials

4.1

Chain of thought: the original magic words

In 2022 two papers changed how everyone prompted. Wei et al. showed that putting worked reasoning inside few-shot examples — not just question → answer, but question → derivation → answer — lifted large models from near-chance to strong performance on math word problems. Months later, Kojima et al. showed you didn't even need the examples: appending “Let's think step by step” to a bare question triggered the same behavior zero-shot. The effect was strongly scale-dependent — small models produced fluent nonsense chains; large models produced chains that actually landed on answers.

Why does emitting intermediate text help a fixed network? Two complementary explanations, both load-bearing.

1. Tokens are compute. A transformer performs a fixed amount of serial computation per emitted token: one pass through $L$ layers. Whatever can't be computed in $L$ sequential steps can't be computed in one token. Theory makes this sharp: constant-depth transformers (under realistic precision assumptions) sit in a weak circuit class and provably cannot solve certain iterative problems — multi-step arithmetic, graph reachability — in a single forward pass. But a model that writes $T$ intermediate tokens gets $O(T \cdot L)$ serial steps, and each written token becomes readable working memory for every later step. With a polynomially long chain, transformers can express polynomial-time computation (Merrill & Sabharwal; Feng et al., 2023). The scratchpad is not commentary — it is the computation.

2. The chain is a latent variable. Statistically, a reasoning path $z$ is an unobserved route from question to answer, and the model's true answer distribution marginalizes over all of them:

EQ P4.1 — REASONING AS A LATENT VARIABLE $$ p(a \mid q) \;=\; \sum_{z \,\in\, \mathcal{Z}} p(z \mid q)\; p(a \mid q, z) $$

$q$ the question, $a$ the final answer, $z$ a reasoning path — a token sequence the model may emit between them. Direct answering forces the model to compress this entire sum into one forward pass. CoT prompting instead samples one $z$ explicitly and conditions on it: regions of $\mathcal{Z}$ where $p(a \mid q, z)$ is sharp and accurate get visited rather than averaged away. Any single sampled chain is one draw from the posterior over paths — which is exactly the loose thread §4.3 pulls: why settle for one draw?

Honesty clause. The emitted chain is a sample from a distribution over plausible rationales, not a printout of the model's internal circuitry. Faithfulness studies (Turpin et al., 2023; Anthropic's 2025 reasoning-faithfulness evals) show models can produce clean-looking chains while their answer was actually driven by a bias planted in the prompt — and the chain never mentions it. CoT reliably buys accuracy; it only sometimes buys a true explanation. Keep the two claims separate.

4.2

Decomposition patterns

“Think step by step” leaves the shape of the thinking entirely to the model. The second-generation patterns impose structure on $z$ — and each one targets a specific failure mode of free-form chains:

Pattern	The move	Fixes	Shines when…
Least-to-most (Zhou et al., 2022)	decompose into subquestions, solve easiest → hardest, feed each answer forward	chains that tackle the hard part first and collapse	the test problem is harder than any example — compositional generalization
Plan-then-solve (Wang et al., 2023)	“first devise a plan, then carry it out step by step”	diving in mid-problem and skipping steps	multi-constraint tasks where missed requirements, not bad arithmetic, kill you
Step-back (Zheng et al., 2023)	ask the abstraction first (“what principle governs this?”), then apply it	retrieving the wrong fact or formula and reasoning flawlessly from it	knowledge-heavy domains — physics, law, history — where the bottleneck is recall, not logic

All three are the same bet placed differently: a chain conditioned on a good skeleton spends its probability mass in a better region of $\mathcal{Z}$ than a chain improvising its own structure. Least-to-most reorders the work; plan-then-solve separates deciding-what-to-do from doing it; step-back inserts a retrieval step before inference.

One call or many? Every pattern above runs either inside a single prompt or as a chain of calls — decompose in call one, solve subproblems in calls two through five. Splitting costs latency and plumbing but buys per-step inspection, retries, and the ability to bolt a tool or a retrieval pass between steps. The single-prompt version is the cheap prototype; the multi-call version is what ships in pipelines (Chapter 06 returns to this as context orchestration).

4.3

Self-consistency: vote over the paths

EQ P4.1 says the model's real answer distribution is a marginal over reasoning paths — yet greedy CoT decoding samples exactly one path and trusts wherever it lands. Self-consistency (Wang et al., 2022) does the obvious-in-hindsight thing: sample $N$ chains at nonzero temperature, extract each final answer, and take the plurality:

EQ P4.2 — MAJORITY VOTE OVER SAMPLED CHAINS $$ \hat{a}_{\mathrm{SC}} \;=\; \arg\max_{a} \sum_{i=1}^{N} \mathbb{1}\!\left[\, a_i = a \,\right], \qquad z_i \sim p(z \mid q), \quad a_i \sim p(a \mid q, z_i) $$

A Monte-Carlo estimate of $\arg\max_a p(a \mid q)$ from EQ P4.1: paths that derail scatter their answers across many wrong values, while correct paths — however different their routes — agree on the same $a$. Voting integrates out the path. It only works where answers are short and extractable (a number, a choice, a name); free-form prose has no vote to count.

The classical intuition is Condorcet's jury theorem: if each chain is independently correct with probability $p$ and errors were a single binary alternative, the vote would be right with probability $ \sum_{k=\lceil N/2 \rceil}^{N} \binom{N}{k}\, p^{k} (1-p)^{N-k} $ — which races to 1 as $N$ grows whenever $p > 0.5$, and to 0 when $p < 0.5$. Reality is kinder than the binary case: wrong chains rarely coordinate on one wrong answer, so the correct answer needs only a plurality, and voting can help even when $p$ is somewhat below one half. Reality is also crueler: chains come from the same model reading the same prompt, so their errors correlate, and the independence the theorem assumes is exactly what you don't have. Both effects are visible in the instrument below.

Three reasoning chains each land on the correct answer independently with probability $p = 0.7$; a wrong chain never agrees with another wrong chain (so the correct answer needs a strict majority). Using the Condorcet sum, what is the probability the $3$-chain majority vote is correct?

$P = \binom{3}{2}p^2(1-p) + \binom{3}{3}p^3 = 3(0.7)^2(0.3) + (0.7)^3 = 3(0.49)(0.3) + 0.343 = 0.441 + 0.343 = 0.784 \approx$ 0.78. Voting lifts a single 70% chain to ~78% — the variance-reduction EQ P4.2 buys, here in its cleanest binary form.

INSTRUMENT P4.1 — SELF-CONSISTENCY SIMSEEDED MONTE CARLO · 2,500 TRIALS PER POINT · EQ P4.2

PER-CHAIN ACCURACY p 0.70

SAMPLED CHAINS N 9

VOTE ACCURACY @ N

—

Δ VS SINGLE CHAIN

—

OUTPUT TOKENS (≈180/CHAIN)

—

Each chain is correct with probability p; wrong chains scatter over six distractor answers (weighted — some wrong answers are more attractive than others); the plurality wins, ties broken at random. Watch three things: the steep early climb (most of the gain arrives by N ≈ 5–9), the saw-tooth (even N invites ties — papers sample odd N for a reason), and p = 0.40: below the Condorcet threshold, yet voting still helps, because errors disperse while truth concentrates. The token readout is the bill — accuracy saturates, cost stays linear.

The instrument animates the curve; the code below is the curve. A barely-above-chance chain (p = 0.55) is no use alone, but voting integrates the path out — and the run prints the diminishing-returns shape EQ P4.2 promises: the climb from N = 1 to N = 9 dwarfs everything past it.

PYTHON · RUNNABLE IN-BROWSER

# self-consistency: majority vote over N CoT chains, each correct w.p. p = 0.55
import numpy as np
rng = np.random.default_rng(0)
p, trials = 0.55, 4000
Ns, accs = [1, 3, 5, 9, 15, 25, 41], []
for N in Ns:
    draws = (rng.random((trials, N)) < p).astype(int)   # 1 = chain hit the right answer
    correct = draws.sum(axis=1)
    distract = np.zeros(trials, dtype=int)              # wrong chains scatter over 6 distractors
    for i in range(trials):
        w = N - correct[i]
        if w: distract[i] = np.bincount(rng.integers(0, 6, w).astype(np.intc), minlength=6).max()
    win = (correct > distract) | ((correct == distract) & (rng.random(trials) < 0.5))
    accs.append(win.mean())
    print(f"N={N:2d}   vote acc = {win.mean():.3f}")
print(f"single chain {accs[0]:.3f}  ->  N=41 {accs[-1]:.3f}   (gain +{accs[-1]-accs[0]:.3f})")
plot_xy(Ns, accs)

edits are live — break it on purpose

Try p = 0.45 — below the Condorcet half — and watch the curve still rise: a single chain that is wrong more often than right can still vote its way to a correct plurality, because its errors disperse over six distractors while its correct answers all pile on one value. That is the gap between the binary jury theorem and the multi-way reality the chapter flags. Then set every wrong chain to the same distractor (delete the scatter) and the gift evaporates — correlated errors are the failure mode voting cannot fix.

What the simulator's independence assumption hides, the next instrument shows: five concrete chains for one problem, including where the wrong ones leave the road.

INSTRUMENT P4.2 — PATHS VISUALIZER5 SAMPLED CHAINS · ONE WORD PROBLEM · CLICK A CHAIN

PROBLEM q

A bakery packs muffins six to a box. On Monday it bakes 7 boxes and sells all but 5 muffins. On Tuesday it bakes twice as many muffins as it sold on Monday. How many muffins does it bake on Tuesday?

VOTE TALLY — EQ P4.2 AT N = 5

Three chains reach 74 by different routes — distinct z, same a, the agreement EQ P4.2 counts on. The two failures derail at marked steps and scatter (10 and 84), so 74 wins 3–1–1. The fragility is visible too: had both wrong chains made the same misreading — a correlated error — the vote would stand 3–2 and one more bad sample flips it. Voting fixes scattered errors, not shared ones.

Pull EQ P4.1 down to arithmetic. Take the five chains above as the only draws from $p(z \mid q)$, read off each one's answer $a$, and the marginal $p(a \mid q)$ is just the histogram. A single sample is wrong two times in five — yet the mode of the histogram is the right answer. That is the whole trick on one line of np.unique:

PYTHON · RUNNABLE IN-BROWSER

# CoT as marginalization: 5 hand-written reasoning paths, majority recovers the answer
import numpy as np
rng = np.random.default_rng(0)
# bakery problem (INSTRUMENT P4.2): true answer is 74. 3 paths land on 74, 2 derail.
paths = [("parse boxes, subtract, double", 74),
         ("fuse 'baked - left', double", 74),
         ("misread 'all but 5' as 'sold 5'", 10),   # language slip
         ("restate, double the sales", 74),
         ("double BAKED, not SOLD", 84)]            # wrong operand
answers = np.array([a for _, a in paths])
for route, a in paths:
    print(f"  a = {a:>2}   via {route}")
vals, counts = np.unique(answers, return_counts=True)   # the marginal p(a | q)
winner = vals[np.argmax(counts)]
print("tally: " + ", ".join(f"{v}:{c}" for v, c in zip(vals, counts)))
draws = rng.choice(answers, size=2000)                  # sample one path at a time
print(f"single random path correct: {(draws==74).mean():.0%}   plurality vote: a = {winner}")
print("=> EQ P4.1 marginal concentrates on 74, though any single sample may miss")

edits are live — break it on purpose

The output is the chapter's thesis in five lines: 74:3, 10:1, 84:1. No single path is trustworthy — sample one and you are right 58% of the time across draws — but the mode of the path-marginal is correct. Flip a third path to a wrong-but-distinct value and 74 still wins on a plurality; flip it to match one of the existing wrong answers and the vote ties. Truth concentrates, scattered error doesn't, correlated error does — exactly what §4.3 argues in prose.

Five sampled chains return these final answers: $74,\ 84,\ 74,\ 10,\ 74$. Under self-consistency (EQ P4.2), what answer does the plurality vote select?

Tally: $74$ appears 3 times, $84$ once, $10$ once. The plurality is 74 (3 of 5). Two chains derailed to different wrong values, so their errors scattered and the correct answer still won — the scattered-error gift voting depends on.

Where this sits in 2026. Self-consistency is the verifier-free baseline of a whole family: best-of-$N$ with a reward model or verifier picking instead of counting, weighted votes, and tree search over partial chains (Tree-of-Thoughts) all spend parallel samples to buy accuracy. Frontier evals still report cons@64 for exactly this reason. The economics never change, though: gains saturate around $N \approx 10$–$20$ while cost stays linear in $N$ — past the knee you are buying noise.

You run self-consistency with $N = 9$ sampled chains, each emitting about $180$ output tokens. Roughly how many output tokens does the vote cost in total?

$N \times 180 = 9 \times 180 =$ 1620 output tokens. Accuracy saturates around $N \approx 10\text{–}20$ but the bill stays strictly linear in $N$ — which is why the token readout climbs forever while the accuracy curve flattens.

4.4

The reasoning-model plot twist

September 2024: OpenAI's o1 ships with accuracy that climbs as it is allowed to think longer. January 2025: DeepSeek-R1 publishes the recipe — reinforcement learning with verifiable rewards (Vol II · Ch 05): sample chains, grade only the final answer against a checkable target, and reinforce whatever reasoning led there. Out of pure outcome pressure, the models learned to emit long internal chains — with backtracking, self-checks, and “wait, let me reconsider” moves nobody wrote into a prompt. Chain of thought stopped being a prompting trick and became a trained behavior, usually hidden inside think-tags or a private reasoning channel.

The consequence for prompt engineers was abrupt: “think step by step” became redundant on reasoning-class models — the model was going to think anyway — and vendor guidance now explicitly advises against manual CoT instructions for them. At best you pay twice for the same behavior; at worst your hand-rolled procedure fights the reasoning policy RL actually optimized, and quality drops. What replaced the magic words is a control surface in the API, not the prompt:

Surface	Where	Shape
Effort dial	OpenAI o-series / GPT-5: reasoning_effort	categorical — minimal · low · medium · high; the model budgets its own tokens per tier
Thinking budget	Anthropic: budget_tokens · Gemini: thinking_budget	explicit token ceiling for the thinking block; 0 ≈ off, or dynamic
Hybrid toggle	open models (Qwen3, R1 distills): enable_thinking, /think	template-level switch — one checkpoint serves both fast and thinking modes

Parameter names drift across providers and versions; the shape is stable — a scalar dial trading thinking tokens for accuracy. On verifiable domains the published curves rise roughly log-linearly with thinking tokens (illustrative shape, not a law): each doubling of budget buys a similar increment, until the task's ceiling. This is serial test-time compute; self-consistency (§4.3) is parallel test-time compute. They compose — R1-style evals vote over 64 long-thinking chains — and the dial is almost always the cheaper first lever, because one chain of $2T$ tokens shares state across its whole length while two chains of $T$ start from scratch.

The deeper reframe: EQ P4.1's latent sum did not go away — RLVR reshaped $p(z \mid q)$ so that high-probability paths are the productive ones, and the dial controls how far into $\mathcal{Z}$ the model is allowed to wander before committing. You stopped steering the path and started budgeting it.

4.5

Modern guidance: when to still prompt for reasoning

“CoT prompting is dead” overshoots. What died is the incantation — content-free instructions to think. Four situations still reward explicit reasoning prompts:

Situation	Reach for	Why it still works
Non-reasoning models	zero/few-shot CoT · §4.2 decomposition	Small, cheap, and most open instruct models never internalized the behavior — the 2022 results still hold for them, often worth double-digit points
Structured intermediate outputs	quote → analysis → answer field ordering	You want the intermediate work as an artifact: cited spans for extraction, per-criterion notes for rubric scoring. The reasoning is product, not just compute — and ordering analysis before answer forces computation before commitment (Chapter 05)
Audits & review	“show the derivation” + human-legible format	Reviewers, regulators, and graders need a rationale they can read — with §4.1's faithfulness caveat attached in writing, not assumed away
Domain checklists	content-specific checks, even on reasoning models	“Verify the units; check n = 0; reconcile the dates against the calendar” steers what gets verified. Generic “think carefully” adds nothing; specific checklists still move accuracy because they carry information the model lacks

The operating rule: on a reasoning model, prompt the outcome, dial the effort. Describe the task, the constraints, and what a correct answer must satisfy; let the trained policy choose the path; raise effort or the thinking budget when the task is hard and verifiable. Reserve procedural reasoning instructions for models that need them or outputs where the procedure itself is the deliverable.

4.6

Verification prompts: ask for the check, not just the answer

Generating a solution and checking one are different computations — and checking is usually the cheaper, more reliable of the two. A model that confidently mis-multiplies will often catch the error when asked to substitute the result back. The pattern is to make the check an explicit, separate demand:

# Verification skeleton — works on reasoning and non-reasoning models
solve:    produce the answer with whatever reasoning the task needs
verify:   substitute the answer back into the original constraints;
          recompute the key quantity by an independent route;
          state PASS or FAIL per check, with the arithmetic shown
revise:   if any check fails, fix the answer and re-run the checks
emit:     final answer + the completed check log (machine-parseable)

Chain-of-Verification (Dhuliawala et al., 2023) hardens this for factual claims: draft an answer, generate verification questions about it, answer those questions independently — without the draft in context — then revise. The independence is the load-bearing detail: a model shown its own draft tends to confirm it; a fresh context answers the sub-questions on their merits.

CAVEAT

Self-correction without an external signal is weak. Asking a model to “review your answer and fix any mistakes,” with no new information, frequently flips correct answers to wrong ones (Huang et al., 2024). Verification prompts earn their keep when the check is grounded: substitute-back arithmetic the model can actually compute, code that gets executed, a schema that gets validated, a retrieval pass that confronts the claim with a source. The best verification prompt produces a machine-checkable artifact — which is precisely where this volume goes next.

A verified answer still has to arrive in a shape software can consume. Chapter 05: structured output — schemas and constrained decoding, why field order is a reasoning control in disguise, and how to design outputs that feed tools without a parsing layer of duct tape.

§