Inference-Time Compute Scaling

12.1

Why spending compute at inference raises accuracy

Everything earlier in this volume spent its compute before deployment: pre-training (Chapter 04), alignment (Chapter 05), the one forward pass per token at serving time (Chapter 08). Inference-time scaling adds a second budget. Instead of accepting the model's first greedy answer, you let it run longer or more times and then pick — or build — a better answer from what it produced. The accuracy gains are real and, on reasoning benchmarks, often larger per FLOP than the same FLOPs spent enlarging the model.

The reason it works at all is that a language model is a distribution, not a function. Sample it many times on a hard problem and you get a spread of answers. Two facts about that spread make extra compute pay off:

Coverage rises with samples. Even if any single sample is often wrong, the probability that at least one of $N$ samples is correct climbs fast. This is the ceiling that best-of-$N$ and search chase.
Correct answers cluster. On problems with a single right answer, the model's mistakes tend to scatter while its correct reasoning paths agree. Taking the most common answer (majority vote) therefore beats taking a random one — no extra model required.

The cleanest quantity is coverage, sometimes written pass@$N$: the chance that the best of $N$ independent samples is correct, given a per-sample success probability $p$. If draws are independent, the only way to fail is for all $N$ to be wrong:

EQ 12.1 — COVERAGE: P(AT LEAST ONE CORRECT IN N) $$ \text{pass@}N \;=\; 1 - (1-p)^{N} $$

$p$ is the per-sample probability the model produces a correct answer; $N$ is the number of independent samples. The failure event is "every one of the $N$ draws is wrong", probability $(1-p)^N$; coverage is its complement. The curve is concave: the first few samples buy the most, then returns diminish geometrically. Coverage is an upper bound on what any selection method can deliver — you cannot pick a right answer that was never sampled. The whole rest of the chapter is about how close to this ceiling a real selector can get.

Two warnings before we build on EQ 12.1. First, independence is an idealization: real samples from one model on one prompt are correlated, so empirical coverage grows a little slower than $1-(1-p)^N$, though Brown et al. (2024) found the log-coverage-versus-log-$N$ relationship is strikingly clean across orders of magnitude. Second, coverage is not accuracy. Knowing a correct answer is somewhere in the $N$ samples is worthless unless you can identify it. Closing the gap between coverage and delivered accuracy is the job of voting (§12.2) and verifiers (§12.3).

A model solves a problem on any single try with probability $p = 0.3$. You draw $N = 5$ independent samples. Using EQ 12.1, what is the coverage — the probability that at least one sample is correct? (Give a probability in $[0,1]$.)

$\text{pass@}5 = 1 - (1-0.3)^5 = 1 - 0.7^5 = 1 - 0.16807 = 0.83193$. So five tries lift a 30%-per-try model to about 0.832 coverage — provided you can pick the right one out of the five.

12.2

Sampling-based scaling: best-of-N and majority vote

The simplest way to turn coverage into accuracy needs no extra model. Sample $N$ answers and aggregate them. Three variants, in rising order of what they assume:

Best-of-$N$ (oracle / pass@$N$). Keep the sample that is actually correct. This is only achievable with a perfect checker (a unit test that compiles, a numeric answer you can verify). Its accuracy equals coverage, EQ 12.1 — the ceiling.
Self-consistency / majority vote. Sample $N$ chains of thought, extract each final answer, and return the most frequent one. No verifier, no labels. It works because correct reasoning paths tend to converge on the same answer while wrong ones disperse.
Weighted voting. Same as majority vote but each answer's tally is weighted by a confidence score — the model's own sequence probability, or a verifier's score (§12.3). It strictly generalizes plain voting and usually edges it out.

Self-consistency (Wang et al., 2022) is the workhorse. Formally, for an answer value $a$ and sampled chains $c_1,\dots,c_N$ with extracted answers $\hat a_i$:

EQ 12.2 — SELF-CONSISTENCY (MAJORITY / WEIGHTED VOTE) $$ \hat a^{\star} \;=\; \arg\max_{a}\; \sum_{i=1}^{N} w_i \,\mathbb{1}\!\left[\hat a_i = a\right], \qquad w_i = \begin{cases} 1 & \text{plain majority} \\ s(c_i) & \text{weighted} \end{cases} $$

$\mathbb{1}[\cdot]$ is the indicator, $s(c_i)$ a confidence or verifier score for chain $i$. Plain voting sets every weight to 1 and just counts. The key property: marginalizing over reasoning paths recovers the answer the model "believes" most, which is far more reliable than any single greedy decode. Voting helps only when correct answers are modal — when the single most likely answer is the right one more often than not. On tasks where the model is confidently and consistently wrong, more votes entrench the error.

How well does voting track the oracle ceiling? Under independence and a single dominant correct answer, majority accuracy is governed by the binomial: if $p > \tfrac12$ for a two-way choice, the probability the majority is correct rises toward 1 with $N$ (the Condorcet jury theorem). Reasoning problems are not binary, but the intuition holds: when the correct answer is the most probable single answer, voting converges to it. When $p$ is low or split across plausible distractors, voting can saturate well below the best-of-$N$ oracle — that gap is exactly what a verifier is for.

You sample 5 answers to an arithmetic problem and read off these final values: 13, 13, 7, 13, 9. A unit test is unavailable, so you use plain self-consistency (majority vote, EQ 12.2). What final answer does it return?

Tally the votes: 13 appears 3 times, 7 once, 9 once. The mode is 13, with 3 of 5 votes. Majority vote returns the most frequent extracted answer regardless of any single chain's probability — here three independent chains agreeing on 13 outweighs the two scattered wrong answers.

PYTHON · RUNNABLE IN-BROWSER

# Self-consistency: sample noisy answers, take the mode, plot accuracy vs N (EQ 12.2)
import numpy as np
rng = np.random.default_rng(0)

correct = 7                              # the true answer to one hard problem
p = 0.42                                 # per-sample chance of getting it right
distractors = [3, 5, 9, 11, 13]          # wrong answers the model scatters across
Ns = [1, 3, 5, 11, 21, 41, 81]
trials = 4000

def one_sample():
    if rng.random() < p: return correct
    return distractors[rng.integers(len(distractors))]

accs = []
for N in Ns:
    wins = 0
    for _ in range(trials):
        votes = [one_sample() for _ in range(N)]
        vals, counts = np.unique(votes, return_counts=True)
        if vals[np.argmax(counts)] == correct:   # majority vote (ties -> lowest value)
            wins += 1
    accs.append(wins / trials)
    print(f"N={N:>3}  majority-vote accuracy = {accs[-1]:.3f}   (single-sample p={p})")

print("\nvoting lifts a p<0.5 model well above p because correct answers are modal;")
print("it plateaus once the mode is locked in -- diminishing returns past ~N=40.")
plot_xy(Ns, accs)                        # accuracy rises then saturates

edits are live — break it on purpose

INSTRUMENT 12.1 — BEST-OF-N vs MAJORITY VOTEACCURACY AS N GROWS · EQ 12.1 / 12.2

PER-SAMPLE p(correct) 0.35

WRONG-ANSWER SPREAD k 4

SAMPLES N 16

COVERAGE pass@N (ORACLE)

—

MAJORITY-VOTE ACC

—

VOTE / ORACLE GAP

—

The red curve is the best-of-$N$ oracle (coverage, EQ 12.1) — what a perfect verifier could reach. The mint curve is plain majority vote (no verifier), modelled as: each wrong sample lands on one of $k$ distractors uniformly. Drag $p$: below 0.5 with few distractors voting still climbs, but the gap to the oracle widens — that gap is the value a verifier (§12.3) can recover. Push the spread $k$ up and voting degrades while coverage barely moves, because scattered wrongs never form a false majority but also never get picked without a checker.

12.3

Verifier-guided selection: outcome vs process rewards

Majority vote throws away the model's confidence. A verifier — a learned model that scores candidate solutions — does better, and it is the bridge from coverage to delivered accuracy. There are two kinds, differing in what they score.

Outcome reward model (ORM). Scores the final answer of a complete solution: one scalar per candidate, "is this likely correct?". Trained on (solution, correct/incorrect) labels. Cheap to label, but it cannot localize where a wrong solution went off the rails, and it can be fooled by a wrong chain that happens to reach a right-looking answer.
Process reward model (PRM). Scores each reasoning step, producing a per-step correctness signal. The solution's score is then an aggregate of step scores (often the product, or the minimum). Lightman et al. (2023) showed step-level supervision substantially outperforms outcome-only on hard math, because it rewards valid reasoning, not just lucky answers, and gives search (§12.4) a dense signal to climb.

Best-of-$N$ with a verifier replaces "pick the correct one" (oracle) and "pick the most common one" (voting) with "pick the highest-scored one":

EQ 12.3 — VERIFIER-WEIGHTED SELECTION (ORM & PRM) $$ \hat a^{\star} = \arg\max_{a}\; \sum_{i:\,\hat a_i = a} V(c_i), \qquad V_{\text{ORM}}(c) = r_\theta(c), \quad V_{\text{PRM}}(c) = \prod_{t=1}^{T} r_\theta\!\big(c_{1:t}\big) $$

$V(\cdot)$ is the verifier's score for a chain. The ORM emits a single score $r_\theta(c)$ for the whole solution; the PRM scores each prefix $c_{1:t}$ and aggregates (here a product, so one bad step tanks the chain). Setting $V \equiv 1$ recovers plain majority vote (EQ 12.2), and using a perfect $V$ recovers the oracle (EQ 12.1) — so verifier quality is the dial between the two. Weighted best-of-$N$ (sum scores per answer) usually beats both naive best-of-$N$ and plain voting, because it combines agreement with confidence.

The verifier is the bottleneck, and an honest one. Every gain above the voting line is borrowed against the verifier's reliability, and verifiers are imperfect in ways that bite exactly when you lean on them hardest. Push $N$ high enough and best-of-$N$ starts reward-hacking: it selects samples that score well not because they are correct but because they exploit the verifier's blind spots. So accuracy-versus-$N$ under a learned verifier is often hump-shaped — it rises, peaks, then declines as the search finds adversarial high-scoring wrong answers. PRMs are also expensive to train (step labels are costly, whether human-annotated as in Lightman et al. or estimated by rollouts) and can be miscalibrated on out-of-distribution problems. A verifier is only as trustworthy as its weakest covered case.

A PRM scores a 3-step solution with per-step correctness probabilities $0.9,\ 0.7,\ 0.667$. Using the product aggregation $V_{\text{PRM}} = \prod_t r_\theta(c_{1:t})$ from EQ 12.3, what is the chain's overall verifier score? (Round to two decimals.)

Product of the step scores: $0.9 \times 0.7 \times 0.667 = 0.4202$. Rounded, 0.42. Product aggregation is deliberately strict — a single weak step (here the last, 0.667) drags the whole-chain score down, which is why PRMs penalize a single broken step far more harshly than an ORM scoring only the final answer.

PYTHON · RUNNABLE IN-BROWSER

# Best-of-N with a TOY NOISY VERIFIER: oracle vs verifier vs vote (EQ 12.3)
import numpy as np
rng = np.random.default_rng(1)

p = 0.40                                  # per-sample chance the answer is correct
sigma = 0.30                              # verifier noise: higher = less reliable
Ns = [1, 2, 4, 8, 16, 32, 64]
trials = 3000

def run(N):
    orc = ver = 0
    for _ in range(trials):
        correct = rng.random(N) < p                      # which samples are truly right
        # verifier score: correct ~ N(1,sigma), wrong ~ N(0,sigma), clipped to [0,1]
        score = np.where(correct, 1.0, 0.0) + rng.normal(0, sigma, N)
        score = np.clip(score, 0, 1)
        orc += correct.any()                             # oracle: a correct sample exists?
        ver += correct[int(np.argmax(score))]            # pick the top-scored sample
    return orc / trials, ver / trials

print(f"{'N':>3} {'oracle pass@N':>14} {'verifier acc':>13}")
for N in Ns:
    o, v = run(N)
    print(f"{N:>3} {o:>14.3f} {v:>13.3f}")

print(f"\nverifier (sigma={sigma}) trails the oracle and the gap GROWS with N:")
print("more samples => more chances for a noisy-high-scoring WRONG answer to win.")
print("set sigma=0 for a perfect verifier (matches the oracle); raise it to see hacking.")

edits are live — break it on purpose

12.4

Search: beam, lookahead, tree-of-thought, MCTS

Sampling treats each solution as atomic: generate the whole chain, then judge it. Search instead expands the solution incrementally, scoring partial states and steering generation toward promising ones. A step-level verifier (a PRM) is what makes this possible — it scores prefixes, so you can prune dead branches before paying to finish them. The same test-time budget buys more when it is spent guiding generation rather than just filtering finished outputs.

Beam search over reasoning steps. Keep the $B$ highest-PRM-scored partial solutions at each step, expand each into several continuations, re-score, and prune back to $B$. It concentrates compute on the live frontier instead of spreading it across independent full samples.
Lookahead / step-level rollouts. Before committing to a step, roll the model forward a few steps to estimate where that branch is heading, then score the branch by its lookahead. More accurate per node, more expensive per node.
Tree-of-Thought (ToT). Treat reasoning as an explicit tree of partial thoughts, expanding, evaluating, and backtracking with breadth- or depth-first control. Useful when a problem benefits from exploring and abandoning alternatives (puzzles, planning) rather than one linear chain.
MCTS-style search. Monte-Carlo tree search balances exploring new branches against exploiting known-good ones (a UCT-style bonus), using rollouts or a value model to estimate each node. It is the most sample-efficient on the hardest problems and the most engineering-heavy to run.

Snell et al. (2024) compared these against plain best-of-$N$ at matched compute and found the honest, useful answer: the best test-time strategy depends on the difficulty of the question. On easier problems, where the model's first attempts are usually close, cheap sequential refinement and small $N$ win. On harder problems, where the model needs to find a rare correct path, search and verifier-guided selection with more compute pull ahead. There is no single dominant method — matching the strategy to the question is itself part of being compute-optimal (§12.6).

Search inherits the verifier's flaws and adds its own. A search that climbs a learned PRM will, given enough budget, find states the PRM over-rates — the reward-hacking of §12.3, now actively optimized for. Beam and MCTS can also collapse diversity, repeatedly refining one flawed line of attack instead of exploring genuinely different approaches, which can leave search below simple sampling on problems whose only solution requires a different opening move. More search is not monotonically better.

Method	Compute axis	Needs verifier?	Best when…
Majority vote	parallel	no	Correct answer is modal; no checker available
Best-of-N + ORM	parallel	outcome	Cheap final-answer scoring; moderate N
Beam / lookahead (PRM)	parallel + step	process	Hard problems; dense step signal exists
Tree-of-Thought / MCTS	tree	process/value	Backtracking helps; budget is large
Long-CoT (§12.5)	sequential	internalized	Model trained to self-correct in one chain

12.5

Thinking longer: long chain-of-thought & budget forcing

Every method so far spends compute in parallel — many samples, scored or searched. The 2024–2025 shift was to spend it sequentially: train a single model to produce one very long chain of thought that plans, explores, checks itself, and backtracks within the same generation, before emitting an answer. This is the o1 / DeepSeek-R1 paradigm, and it changed what "test-time compute" means.

The defining empirical result, established by OpenAI's o1 and reproduced openly by DeepSeek-AI's R1 (2025), is two coupled scaling curves: accuracy rises smoothly both with training-time RL compute and with test-time thinking length. R1 demonstrated that large-scale reinforcement learning with simple, verifiable rewards (correct answer / valid format) — without a learned PRM in the loop — can induce long-CoT behaviors on its own: the model learns to allocate more reasoning tokens to harder problems, to re-examine its steps, and to recover from mistakes mid-chain. The "verifier" here is internalized into the policy by RL rather than run as a separate selector at inference.

EQ 12.4 — TEST-TIME SCALING AS A FUNCTION OF THINKING TOKENS $$ \text{Acc}(B) \;\approx\; a + b\,\log B, \qquad B = \text{thinking-token budget}, \quad \text{cost} \propto B $$

Across a wide band, accuracy on reasoning benchmarks grows roughly linearly in the log of the thinking budget $B$ — doubling the tokens you let the model spend buys a roughly constant accuracy increment. Because cost is linear in $B$, each additional point of accuracy costs exponentially more compute: the same diminishing-returns geometry as EQ 12.1, now along the sequential axis. The relation is empirical and saturates: beyond some budget the model loops, second-guesses correct answers, or drifts, and accuracy flattens or dips.

Budget forcing (Muennighoff et al., 2025, "s1") is the clean control knob for the sequential axis. To make a model think more, suppress its end-of-thinking token and append a continuation cue (literally "Wait") so it keeps reasoning; to make it think less, force the end-of-thinking token after a token cap and elicit the answer. With a tiny curated dataset and this one inference-time trick, s1 reproduced a clear test-time scaling curve, showing the effect is a property you can steer at decode time, not only a property baked in by massive RL.

Sequential is not strictly better than parallel. Long-CoT models excel where self-correction within one chain pays off, but they have their own failure modes: overthinking easy problems (spending thousands of tokens to confirm an answer reached in the first sentence), latency that can be tens of seconds per query, and degradation past a sweet-spot budget. The strongest 2026 systems combine axes — a long-CoT policy, sampled a few times, with the final answers voted or verified — because parallel and sequential scaling attack partly independent sources of error.

A model's test-time scaling fits $\text{Acc}(B) = a + b\log_2 B$ (EQ 12.4) with $a = 0.50$ and $b = 0.02$ per doubling of the thinking-token budget $B$. What accuracy does it reach at a budget of $B = 4096$ tokens?

$\log_2 4096 = 12$, so the budget is 12 doublings above one token. $\text{Acc} = 0.50 + 0.02 \times 12 = 0.50 + 0.24 = 0.74$. The answer is 0.74. Each doubling of $B$ adds only the fixed increment $b = 0.02$, so the accuracy grows linearly in $\log B$ while compute grows linearly in $B$ itself — the EQ 12.4 diminishing-returns law along the sequential axis.

12.6

Compute-optimal scaling and the cost of thinking

The strategic question is not "does test-time compute help" — it does — but "given a fixed total FLOP budget, where should it go?" You can spend FLOPs three ways: a bigger model (more training and more per-token inference cost), more test-time samples/search on a smaller model, or longer thinking per query. Snell et al. (2024) framed this as a compute-optimal allocation problem and reached a nuanced conclusion worth stating precisely.

EQ 12.5 — COMPUTE-OPTIMAL TEST-TIME ALLOCATION $$ \theta^{\star}(q, C) \;=\; \arg\max_{\theta}\; \mathbb{E}\big[\text{Acc} \mid q,\ \theta,\ C\big], \qquad C \approx \underbrace{2 P}_{\text{model FLOPs/token}} \times \underbrace{N \cdot L}_{\text{tokens generated}} $$

$\theta$ is the test-time strategy (samples $N$, search width, thinking length $L$), chosen per prompt $q$ under a FLOP budget $C$. A forward pass over a $P$-parameter dense model costs about $2P$ FLOPs per token (Vol II · Ch 08), so total inference FLOPs scale with $P \times N \times L$. The optimum is not a constant: it depends on prompt difficulty $q$. Easy prompts want little extra compute; hard prompts want a lot. Allocating compute per-difficulty rather than uniformly was worth several-fold efficiency in Snell et al.

The headline findings, stated honestly with their boundaries:

When test-time compute wins. On easy-to-moderate problems and at modest total budgets, a smaller model plus optimally allocated test-time compute can match or beat a model ~14× larger run once. The small-model-plus-thinking configuration is cheaper to serve and easier to update.
When a bigger model wins. On the hardest problems, or when you face many queries (amortizing the bigger model's training over huge inference volume), pretraining a larger model is the better spend. Test-time compute cannot manufacture capability the base model fundamentally lacks: if coverage (EQ 12.1) is near zero because the model never samples a correct path, no amount of voting, verifying, or searching helps.
Diminishing returns are real on both axes. Coverage (EQ 12.1), voting (§12.2), and thinking length (EQ 12.4) all bend over. Past their knees you pay exponentially more FLOPs, dollars, and latency for vanishing accuracy — and with learned verifiers, over-spending can actively lower accuracy via reward hacking (§12.3).

The economics are blunt. Parallel scaling multiplies your dollar cost per query by $N$ (you pay for every sample) but can be done concurrently, so latency need not rise. Sequential scaling multiplies latency by the thinking length (the user waits through the whole chain) and the dollar cost with it. A reasoning model emitting tens of thousands of thinking tokens can cost orders of magnitude more per answered query than a single greedy decode, which is why production systems gate long thinking behind difficulty routing — cheap path for easy queries, expensive path only when it earns its keep.

A sample costs 1 unit. With per-sample $p = 0.5$ you draw $N = 4$ samples and have a perfect verifier (so you solve the task iff at least one of the four is correct). What is the expected cost per solved task, $\dfrac{N \times \text{cost-per-sample}}{\text{pass@}N}$? (Give the cost in units, to two decimals.)

Success rate: $\text{pass@}4 = 1 - (1-0.5)^4 = 1 - 0.5^4 = 1 - 0.0625 = 0.9375$. You always pay for all 4 samples, so cost per attempt is 4 units. Cost per solved task $= 4 / 0.9375 = 4.2667$, i.e. 4.27 units. The lesson: as per-sample $p$ falls, coverage falls and the wasted samples on unsolved attempts inflate cost-per-solve super-linearly — low-$p$ problems are expensive even with a perfect checker.

INSTRUMENT 12.2 — COMPUTE-OPTIMAL ALLOCATIONACCURACY SURFACE OVER {MODEL SIZE × SAMPLES} AT FIXED FLOPs · EQ 12.5

FIXED FLOP BUDGET 1.0e15

PROBLEM DIFFICULTY 0.50

OPTIMAL MODEL SIZE

—

OPTIMAL SAMPLES N

—

PEAK ACCURACY

—

A fixed FLOP budget $C \approx 2P\cdot N\cdot L$ (EQ 12.5) buys an iso-cost frontier: spend it on more parameters $P$ (bigger, smarter per sample, but fewer tries) or more samples $N$ (smaller model, more shots). The curve plots best-of-$N$ accuracy along that frontier; the marker is the optimum. Drag difficulty down (easy problems): the optimum slides toward many samples on a small model — test-time compute wins. Drag it up (hard problems): the optimum slides toward a bigger model, because a tiny model's per-sample $p$ is too low for sampling to rescue. There is no universal best split — it moves with the problem.

The synthesis for 2026: inference-time scaling is now a first-class lever, co-equal with model size and data. The skill is no longer "make the model bigger" but "spend each FLOP where it returns the most accuracy" — sample when answers cluster, verify when you have a trustworthy checker, search when steps can be scored, think longer when the model was trained to self-correct, and route by difficulty so you never overspend on easy queries. And throughout, respect the two hard ceilings: you cannot select an answer the model never sampled (coverage), and you cannot trust selection beyond the reliability of whatever does the selecting (the verifier).

We have been treating the model as a black box that emits distributions we sample, score, and search. The next chapter cracks it open. Mechanistic interpretability and sparse autoencoders (SAEs) ask what the activations actually represent — the features, circuits, and directions inside the residual stream — turning "the model believes X" from a behavioral guess into something we can locate, read off, and intervene on.

12.R

References

Snell, C., Lee, J., Xu, K. & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters. The compute-optimal allocation and difficulty-dependent strategy of EQ 12.5 / §12.6.
Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 — majority / weighted vote over sampled chains (EQ 12.2, §12.2).
Brown, B. et al. (2024). Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Coverage / pass@N scaling with repeated sampling (EQ 12.1, §12.1–12.2).
Lightman, H. et al. (2023). Let's Verify Step by Step. OpenAI — process reward models (PRM) vs outcome supervision (EQ 12.3, §12.3).
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Open long-CoT reasoning via RL with verifiable rewards (EQ 12.4, §12.5).
Muennighoff, N. et al. (2025). s1: Simple Test-Time Scaling. Budget forcing as a decode-time control on thinking length (§12.5).
Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023 — explicit search over reasoning trees (§12.4).
Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems (GSM8K). OpenAI — outcome verifiers + best-of-N selection on math (§12.2–12.3).