Why spending compute at inference raises accuracy
Everything earlier in this volume spent its compute before deployment: pre-training (Chapter 04), alignment (Chapter 05), the one forward pass per token at serving time (Chapter 08). Inference-time scaling adds a second budget. Instead of accepting the model's first greedy answer, you let it run longer or more times and then pick — or build — a better answer from what it produced. The accuracy gains are real and, on reasoning benchmarks, often larger per FLOP than the same FLOPs spent enlarging the model.
The reason it works at all is that a language model is a distribution, not a function. Sample it many times on a hard problem and you get a spread of answers. Two facts about that spread make extra compute pay off:
- Coverage rises with samples. Even if any single sample is often wrong, the probability that at least one of \(N\) samples is correct climbs fast. This is the ceiling that best-of-\(N\) and search chase.
- Correct answers cluster. On problems with a single right answer, the model's mistakes tend to scatter while its correct reasoning paths agree. Taking the most common answer (majority vote) therefore beats taking a random one — no extra model required.
The cleanest quantity is coverage, sometimes written pass@\(N\): the chance that the best of \(N\) independent samples is correct, given a per-sample success probability \(p\). If draws are independent, the only way to fail is for all \(N\) to be wrong:
Two warnings before we build on EQ 12.1. First, independence is an idealization: real samples from one model on one prompt are correlated, so empirical coverage grows a little slower than \(1-(1-p)^N\), though Brown et al. (2024) found the log-coverage-versus-log-\(N\) relationship is strikingly clean across orders of magnitude. Second, coverage is not accuracy. Knowing a correct answer is somewhere in the \(N\) samples is worthless unless you can identify it. Closing the gap between coverage and delivered accuracy is the job of voting (§12.2) and verifiers (§12.3).
Sampling-based scaling: best-of-N and majority vote
The simplest way to turn coverage into accuracy needs no extra model. Sample \(N\) answers and aggregate them. Three variants, in rising order of what they assume:
- Best-of-\(N\) (oracle / pass@\(N\)). Keep the sample that is actually correct. This is only achievable with a perfect checker (a unit test that compiles, a numeric answer you can verify). Its accuracy equals coverage, EQ 12.1 — the ceiling.
- Self-consistency / majority vote. Sample \(N\) chains of thought, extract each final answer, and return the most frequent one. No verifier, no labels. It works because correct reasoning paths tend to converge on the same answer while wrong ones disperse.
- Weighted voting. Same as majority vote but each answer's tally is weighted by a confidence score — the model's own sequence probability, or a verifier's score (§12.3). It strictly generalizes plain voting and usually edges it out.
Self-consistency (Wang et al., 2022) is the workhorse. Formally, for an answer value \(a\) and sampled chains \(c_1,\dots,c_N\) with extracted answers \(\hat a_i\):
How well does voting track the oracle ceiling? Under independence and a single dominant correct answer, majority accuracy is governed by the binomial: if \(p > \tfrac12\) for a two-way choice, the probability the majority is correct rises toward 1 with \(N\) (the Condorcet jury theorem). Reasoning problems are not binary, but the intuition holds: when the correct answer is the most probable single answer, voting converges to it. When \(p\) is low or split across plausible distractors, voting can saturate well below the best-of-\(N\) oracle — that gap is exactly what a verifier is for.
# Self-consistency: sample noisy answers, take the mode, plot accuracy vs N (EQ 12.2)
import numpy as np
rng = np.random.default_rng(0)
correct = 7 # the true answer to one hard problem
p = 0.42 # per-sample chance of getting it right
distractors = [3, 5, 9, 11, 13] # wrong answers the model scatters across
Ns = [1, 3, 5, 11, 21, 41, 81]
trials = 4000
def one_sample():
if rng.random() < p: return correct
return distractors[rng.integers(len(distractors))]
accs = []
for N in Ns:
wins = 0
for _ in range(trials):
votes = [one_sample() for _ in range(N)]
vals, counts = np.unique(votes, return_counts=True)
if vals[np.argmax(counts)] == correct: # majority vote (ties -> lowest value)
wins += 1
accs.append(wins / trials)
print(f"N={N:>3} majority-vote accuracy = {accs[-1]:.3f} (single-sample p={p})")
print("\nvoting lifts a p<0.5 model well above p because correct answers are modal;")
print("it plateaus once the mode is locked in -- diminishing returns past ~N=40.")
plot_xy(Ns, accs) # accuracy rises then saturates
Verifier-guided selection: outcome vs process rewards
Majority vote throws away the model's confidence. A verifier — a learned model that scores candidate solutions — does better, and it is the bridge from coverage to delivered accuracy. There are two kinds, differing in what they score.
- Outcome reward model (ORM). Scores the final answer of a complete solution: one scalar per candidate, "is this likely correct?". Trained on (solution, correct/incorrect) labels. Cheap to label, but it cannot localize where a wrong solution went off the rails, and it can be fooled by a wrong chain that happens to reach a right-looking answer.
- Process reward model (PRM). Scores each reasoning step, producing a per-step correctness signal. The solution's score is then an aggregate of step scores (often the product, or the minimum). Lightman et al. (2023) showed step-level supervision substantially outperforms outcome-only on hard math, because it rewards valid reasoning, not just lucky answers, and gives search (§12.4) a dense signal to climb.
Best-of-\(N\) with a verifier replaces "pick the correct one" (oracle) and "pick the most common one" (voting) with "pick the highest-scored one":
The verifier is the bottleneck, and an honest one. Every gain above the voting line is borrowed against the verifier's reliability, and verifiers are imperfect in ways that bite exactly when you lean on them hardest. Push \(N\) high enough and best-of-\(N\) starts reward-hacking: it selects samples that score well not because they are correct but because they exploit the verifier's blind spots. So accuracy-versus-\(N\) under a learned verifier is often hump-shaped — it rises, peaks, then declines as the search finds adversarial high-scoring wrong answers. PRMs are also expensive to train (step labels are costly, whether human-annotated as in Lightman et al. or estimated by rollouts) and can be miscalibrated on out-of-distribution problems. A verifier is only as trustworthy as its weakest covered case.
# Best-of-N with a TOY NOISY VERIFIER: oracle vs verifier vs vote (EQ 12.3)
import numpy as np
rng = np.random.default_rng(1)
p = 0.40 # per-sample chance the answer is correct
sigma = 0.30 # verifier noise: higher = less reliable
Ns = [1, 2, 4, 8, 16, 32, 64]
trials = 3000
def run(N):
orc = ver = 0
for _ in range(trials):
correct = rng.random(N) < p # which samples are truly right
# verifier score: correct ~ N(1,sigma), wrong ~ N(0,sigma), clipped to [0,1]
score = np.where(correct, 1.0, 0.0) + rng.normal(0, sigma, N)
score = np.clip(score, 0, 1)
orc += correct.any() # oracle: a correct sample exists?
ver += correct[int(np.argmax(score))] # pick the top-scored sample
return orc / trials, ver / trials
print(f"{'N':>3} {'oracle pass@N':>14} {'verifier acc':>13}")
for N in Ns:
o, v = run(N)
print(f"{N:>3} {o:>14.3f} {v:>13.3f}")
print(f"\nverifier (sigma={sigma}) trails the oracle and the gap GROWS with N:")
print("more samples => more chances for a noisy-high-scoring WRONG answer to win.")
print("set sigma=0 for a perfect verifier (matches the oracle); raise it to see hacking.")
Search: beam, lookahead, tree-of-thought, MCTS
Sampling treats each solution as atomic: generate the whole chain, then judge it. Search instead expands the solution incrementally, scoring partial states and steering generation toward promising ones. A step-level verifier (a PRM) is what makes this possible — it scores prefixes, so you can prune dead branches before paying to finish them. The same test-time budget buys more when it is spent guiding generation rather than just filtering finished outputs.
- Beam search over reasoning steps. Keep the \(B\) highest-PRM-scored partial solutions at each step, expand each into several continuations, re-score, and prune back to \(B\). It concentrates compute on the live frontier instead of spreading it across independent full samples.
- Lookahead / step-level rollouts. Before committing to a step, roll the model forward a few steps to estimate where that branch is heading, then score the branch by its lookahead. More accurate per node, more expensive per node.
- Tree-of-Thought (ToT). Treat reasoning as an explicit tree of partial thoughts, expanding, evaluating, and backtracking with breadth- or depth-first control. Useful when a problem benefits from exploring and abandoning alternatives (puzzles, planning) rather than one linear chain.
- MCTS-style search. Monte-Carlo tree search balances exploring new branches against exploiting known-good ones (a UCT-style bonus), using rollouts or a value model to estimate each node. It is the most sample-efficient on the hardest problems and the most engineering-heavy to run.
Snell et al. (2024) compared these against plain best-of-\(N\) at matched compute and found the honest, useful answer: the best test-time strategy depends on the difficulty of the question. On easier problems, where the model's first attempts are usually close, cheap sequential refinement and small \(N\) win. On harder problems, where the model needs to find a rare correct path, search and verifier-guided selection with more compute pull ahead. There is no single dominant method — matching the strategy to the question is itself part of being compute-optimal (§12.6).
Search inherits the verifier's flaws and adds its own. A search that climbs a learned PRM will, given enough budget, find states the PRM over-rates — the reward-hacking of §12.3, now actively optimized for. Beam and MCTS can also collapse diversity, repeatedly refining one flawed line of attack instead of exploring genuinely different approaches, which can leave search below simple sampling on problems whose only solution requires a different opening move. More search is not monotonically better.
| Method | Compute axis | Needs verifier? | Best when… |
|---|---|---|---|
| Majority vote | parallel | no | Correct answer is modal; no checker available |
| Best-of-N + ORM | parallel | outcome | Cheap final-answer scoring; moderate N |
| Beam / lookahead (PRM) | parallel + step | process | Hard problems; dense step signal exists |
| Tree-of-Thought / MCTS | tree | process/value | Backtracking helps; budget is large |
| Long-CoT (§12.5) | sequential | internalized | Model trained to self-correct in one chain |
Thinking longer: long chain-of-thought & budget forcing
Every method so far spends compute in parallel — many samples, scored or searched. The 2024–2025 shift was to spend it sequentially: train a single model to produce one very long chain of thought that plans, explores, checks itself, and backtracks within the same generation, before emitting an answer. This is the o1 / DeepSeek-R1 paradigm, and it changed what "test-time compute" means.
The defining empirical result, established by OpenAI's o1 and reproduced openly by DeepSeek-AI's R1 (2025), is two coupled scaling curves: accuracy rises smoothly both with training-time RL compute and with test-time thinking length. R1 demonstrated that large-scale reinforcement learning with simple, verifiable rewards (correct answer / valid format) — without a learned PRM in the loop — can induce long-CoT behaviors on its own: the model learns to allocate more reasoning tokens to harder problems, to re-examine its steps, and to recover from mistakes mid-chain. The "verifier" here is internalized into the policy by RL rather than run as a separate selector at inference.
Budget forcing (Muennighoff et al., 2025, "s1") is the clean control knob for the sequential axis. To make a model think more, suppress its end-of-thinking token and append a continuation cue (literally "Wait") so it keeps reasoning; to make it think less, force the end-of-thinking token after a token cap and elicit the answer. With a tiny curated dataset and this one inference-time trick, s1 reproduced a clear test-time scaling curve, showing the effect is a property you can steer at decode time, not only a property baked in by massive RL.
Sequential is not strictly better than parallel. Long-CoT models excel where self-correction within one chain pays off, but they have their own failure modes: overthinking easy problems (spending thousands of tokens to confirm an answer reached in the first sentence), latency that can be tens of seconds per query, and degradation past a sweet-spot budget. The strongest 2026 systems combine axes — a long-CoT policy, sampled a few times, with the final answers voted or verified — because parallel and sequential scaling attack partly independent sources of error.
Compute-optimal scaling and the cost of thinking
The strategic question is not "does test-time compute help" — it does — but "given a fixed total FLOP budget, where should it go?" You can spend FLOPs three ways: a bigger model (more training and more per-token inference cost), more test-time samples/search on a smaller model, or longer thinking per query. Snell et al. (2024) framed this as a compute-optimal allocation problem and reached a nuanced conclusion worth stating precisely.
The headline findings, stated honestly with their boundaries:
- When test-time compute wins. On easy-to-moderate problems and at modest total budgets, a smaller model plus optimally allocated test-time compute can match or beat a model ~14× larger run once. The small-model-plus-thinking configuration is cheaper to serve and easier to update.
- When a bigger model wins. On the hardest problems, or when you face many queries (amortizing the bigger model's training over huge inference volume), pretraining a larger model is the better spend. Test-time compute cannot manufacture capability the base model fundamentally lacks: if coverage (EQ 12.1) is near zero because the model never samples a correct path, no amount of voting, verifying, or searching helps.
- Diminishing returns are real on both axes. Coverage (EQ 12.1), voting (§12.2), and thinking length (EQ 12.4) all bend over. Past their knees you pay exponentially more FLOPs, dollars, and latency for vanishing accuracy — and with learned verifiers, over-spending can actively lower accuracy via reward hacking (§12.3).
The economics are blunt. Parallel scaling multiplies your dollar cost per query by \(N\) (you pay for every sample) but can be done concurrently, so latency need not rise. Sequential scaling multiplies latency by the thinking length (the user waits through the whole chain) and the dollar cost with it. A reasoning model emitting tens of thousands of thinking tokens can cost orders of magnitude more per answered query than a single greedy decode, which is why production systems gate long thinking behind difficulty routing — cheap path for easy queries, expensive path only when it earns its keep.
The synthesis for 2026: inference-time scaling is now a first-class lever, co-equal with model size and data. The skill is no longer "make the model bigger" but "spend each FLOP where it returns the most accuracy" — sample when answers cluster, verify when you have a trustworthy checker, search when steps can be scored, think longer when the model was trained to self-correct, and route by difficulty so you never overspend on easy queries. And throughout, respect the two hard ceilings: you cannot select an answer the model never sampled (coverage), and you cannot trust selection beyond the reliability of whatever does the selecting (the verifier).
We have been treating the model as a black box that emits distributions we sample, score, and search. The next chapter cracks it open. Mechanistic interpretability and sparse autoencoders (SAEs) ask what the activations actually represent — the features, circuits, and directions inside the residual stream — turning "the model believes X" from a behavioral guess into something we can locate, read off, and intervene on.
References
- Snell, C., Lee, J., Xu, K. & Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
- Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models.
- Brown, B. et al. (2024). Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.
- Lightman, H. et al. (2023). Let's Verify Step by Step.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
- Muennighoff, N. et al. (2025). s1: Simple Test-Time Scaling.
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.
- Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems (GSM8K).