03 · Show, Don't Tell: Few-Shot & Examples

3.1

Why examples work: in-context learning, mechanically

Chapter 01 framed a prompt as conditioning: every token you place in context reshapes the distribution over what comes next. Examples are the most aggressive form of conditioning available, because they exploit a circuit the model already runs on every forward pass. Vol II · Chapter 03 introduced induction heads — attention heads that find an earlier occurrence of the current pattern and copy what followed it. A few-shot prompt is, structurally, bait for exactly that circuit: input → output, input → output, input → ? is the repeated-pattern format induction heads were discovered completing. No weights change. The "learning" in in-context learning is pattern-matching over the residual stream, executed fresh on every call.

Two findings sharpen the picture, and both should change how you write prompts:

Examples mostly tell the model which task, not how to do it. Min et al. (2022) showed that on many classification benchmarks, replacing the gold labels in few-shot examples with random labels barely dents accuracy. What carried the performance was the input distribution, the label space, and the format — the demonstration's shape. The model already knew how to classify sentiment; the examples told it that sentiment classification, in this exact format, is what's happening here. This is often called task recognition as opposed to task learning.
But bigger models do read the labels. Wei et al. (2023) found that as scale grows, models increasingly override their semantic priors and follow flipped or arbitrary label mappings in the examples. Frontier models genuinely extract input→output rules from context — a capability some theoretical work models as implicit regression or gradient-descent-like updating inside the forward pass. That account remains contested; the empirical part is not.

CONSEQUENCE

Both regimes reward the same practice: examples are doing format and task-boundary work first, rule-induction work second. So you optimize examples for coverage, consistency, and format fidelity (§3.3, §3.5) before you optimize them for cleverness — and you never assume the model "understood the rule" just because it matched three demonstrations.

3.2

How many shots: the saturation curve

Since the GPT-3 paper plotted accuracy against shot count in 2020, the same shape has recurred across tasks and model generations: a steep rise that flattens fast. The single biggest jump is zero → one — the first example resolves the format, the label space, and most of the task ambiguity at once. Each additional example refines boundaries with diminishing returns. A saturating exponential captures the shape well enough to reason with:

EQ P3.1 — THE SHOT-COUNT CURVE (CONCEPTUAL) $$ \mathrm{acc}(k) \;\approx\; a_{\infty} - \left( a_{\infty} - a_{0} \right) e^{-k/\kappa} $$

$a_0$ is zero-shot accuracy, $a_\infty$ the few-shot ceiling, and $\kappa$ the task's saturation constant — how many examples it takes to close $63\%$ of the remaining gap. This is a shape, not a law: it summarizes the typical empirical curve, it is not derived from anything. Its useful predictions: the marginal value of example $k{+}1$ decays geometrically, and the gap $a_\infty - a_0$ — how much examples can help at all — varies enormously by task type.

The parameters cluster by task family. Format-following tasks (emit this JSON, this tag style, this report skeleton) have a huge $a_\infty - a_0$ gap and tiny $\kappa$: one or two examples and you're done, because format is precisely what demonstrations transmit best. Classification and extraction saturate more slowly — around 4–8 shots — since later examples still sharpen category boundaries. Reasoning-heavy tasks barely move: a worked example changes the style of the solution trace, not the model's ability to solve (Chapter 04 picks up that thread). Explore the three regimes:

INSTRUMENT P3.1 — SHOT-COUNT EXPLORERHAND-BUILT CURVES · ILLUSTRATIVE · EQ P3.1

SHOTS k 4

TOKENS PER EXAMPLE 120

FORMAT-FOLLOWING

—

CLASSIFICATION

—

REASONING

—

PROMPT OVERHEAD

—

Curves are hand-built to match the shapes reported across the few-shot literature — they are illustrative, not measurements. Slide k from 0 to 1 and watch where each curve makes its largest jump; then note that by k = 2 the format curve has nothing left to gain, while every added example keeps costing tokens on every single call. Cost readout assumes an indicative $3 / 1M input tokens.

The same exponential, in code. Each task family gets its own $(a_\infty, a_\infty{-}a_0, \kappa)$; the table makes the diminishing marginal return concrete — read down any column and watch the per-shot gain collapse. Edit tau for reasoning and watch its curve refuse to move regardless:

PYTHON · RUNNABLE IN-BROWSER

# k-shot accuracy curve (ILLUSTRATIVE) — acc(k) = ceil - gap*exp(-k/tau)
import numpy as np
k = np.arange(0, 9)
tasks = {  # (ceiling, gap, tau) — hand-built per EQ P3.1, NOT measured
    "format": (0.97, 0.55, 0.7),
    "classif": (0.88, 0.33, 3.2),
    "reason":  (0.66, 0.05, 4.0),
}
print("ILLUSTRATIVE — hand-built shapes, not measurements")
print("  k " + "".join(f"{n:>9}" for n in tasks))
for kk in k:
    row = [ceil - gap*np.exp(-kk/tau) for ceil, gap, tau in tasks.values()]
    print(f"{kk:>3} " + "".join(f"{a:>9.3f}" for a in row))
fmt = tasks["format"]
print("\nformat marginal gain, shot 1 vs shot 2:")
g1 = (fmt[0]-fmt[1]*np.exp(-1/fmt[2])) - (fmt[0]-fmt[1])
g2 = (fmt[0]-fmt[1]*np.exp(-2/fmt[2])) - (fmt[0]-fmt[1]*np.exp(-1/fmt[2]))
print(f"  +{g1*100:5.1f} pts then +{g2*100:5.1f} pts  (geometric decay)")
acc_fmt = [fmt[0]-fmt[1]*np.exp(-kk/fmt[2]) for kk in k]
plot_xy(k.tolist(), acc_fmt)

edits are live — break it on purpose

The printed table is the same EQ P3.1 the instrument above draws — the value of this view is the marginal-gain line: format-following banks most of its lift on the first shot and almost nothing after the second, exactly the geometric decay the equation predicts. The reasoning column barely leaves its starting value, the algebraic face of "examples don't teach a model to reason."

A task has zero-shot accuracy $a_0 = 0.5$, few-shot ceiling $a_\infty = 0.9$, and saturation constant $\kappa = 2$. Using the shot-count curve (EQ P3.1), what accuracy does $k = 2$ examples predict?

$\mathrm{acc}(2) = a_\infty - (a_\infty - a_0)\,e^{-k/\kappa} = 0.9 - (0.9-0.5)\,e^{-2/2} = 0.9 - 0.4\cdot e^{-1} = 0.9 - 0.4(0.3679) = 0.9 - 0.1472 \approx$ 0.75. Two shots already close most of the $a_\infty - a_0$ gap; the curve flattens fast from here.

The long-context caveat. "Many-shot" in-context learning (Agarwal et al., 2024) showed that with hundreds to thousands of examples — feasible once contexts crossed 100K tokens — some tasks keep improving well past where the classic curve flattens, occasionally approaching fine-tuning quality. The exponential above describes the 0–32 shot regime where almost all practical prompting lives; treat the far tail as a separate tool with fine-tuning-like economics (Vol II · Chapter 06), amortized only if you cache the prefix.

3.3

Example selection: cover the edges, not the center

The instinct is to pick your prettiest, most typical examples — three clean inputs with three clean outputs. That teaches the model a task narrower than yours. Production inputs are mostly edge cases wearing a trench coat: the empty field, the two-languages-in-one-sentence ticket, the review that praises the product while demanding a refund. Since examples define task boundaries (§3.1), an example spent on the happy path is a wasted boundary — the model already assumed the happy path.

Principle	Practice	Failure it prevents
Edge cases over prototypes	1 typical case, rest spent on boundaries: ambiguous, malformed, "none of the above"	Confident misclassification of anything atypical
Diversity beats quantity	8 examples spanning input clusters > 16 near-duplicates	Redundant shots that buy tokens, not coverage
Show the null action	include an input where the right output is "no match" / empty list / escalate	The model inventing an answer because every demo had one
Balance the label space	roughly even labels across shots (classification)	Majority-label bias — skew toward whichever label dominates the demos
Real over idealized	lightly cleaned production inputs, typos intact	A model calibrated to inputs that never occur

Dynamic selection. When you have a pool of candidate examples, retrieving the nearest neighbors of the current input — embed the query, embed the pool, take top-$k$ by cosine similarity — reliably beats a fixed example set (the KATE result, Liu et al. 2021, since reproduced broadly). It is the same move as RAG, aimed at demonstrations instead of facts. Two cautions. First, similarity retrieval quietly destroys diversity: five neighbors of an unambiguous input are five near-identical demos, so production systems usually blend retrieved neighbors with a fixed diverse core. Second, retrieval changes the prompt prefix per request, which invalidates prefix caching (Vol II · Chapter 08) — at scale, the static-set discount is real money.

3.4

Ordering and recency bias

Few-shot prompts are not sets; they are sequences, and the model reads them with position-dependent attention. Zhao et al. (2021) measured the damage on GPT-3: across permutations of the same four examples, SST-2 sentiment accuracy ranged from near-chance to state-of-the-art. They isolated three biases — majority-label bias (predictions drift toward the most frequent label in the demos), recency bias (the last example's label bleeds into the prediction most), and common-token bias. Recency is the one ordering controls. A minimal model of the skew:

EQ P3.2 — RECENCY-WEIGHTED LABEL PRIOR (TOY MODEL) $$ \tilde{p}(y) \;=\; \frac{\sum_{i=1}^{k} w_i \,\mathbb{1}\!\left[ y_i = y \right]}{\sum_{i=1}^{k} w_i}, \qquad w_i = e^{\beta i},\quad \beta > 0 $$

Position $i$ runs from the earliest demo (1) to the last ($k$); $\beta$ sets how steeply late examples dominate. $\beta = 0$ recovers pure majority-label bias; $\beta > 0$ adds recency. On an ambiguous input, this prior — not the input — decides the prediction. A toy, but it reproduces the qualitative finding: with perfectly balanced labels, ordering alone manufactures a skewed prior.

INSTRUMENT P3.2 — ORDER SHUFFLER4 DEMOS · 2 LABELS · TOY MODEL (EQ P3.2, β = 0.65) · ILLUSTRATIVE

FEW-SHOT BLOCK (POSITION 1 → 4)

TEST INPUT (DELIBERATELY AMBIGUOUS)

"It's fine, I guess."

PREDICTED-LABEL PRIOR ON THE TEST INPUT

PERMUTE THE SAME FOUR EXAMPLES

P(POSITIVE)

—

P(NEGATIVE)

—

FINAL EXAMPLE'S LABEL

—

SKEW TOWARD FINAL LABEL

—

The labels are perfectly balanced — two POSITIVE, two NEGATIVE — yet every shuffle produces a skewed prior, always toward the final example's class, strongest when the last two demos share a label. The numbers come from EQ P3.2, not from a model, but the phenomenon is the one Zhao et al. measured: same examples, different order, different answer.

The instrument shows one ordering at a time; the cell below sweeps all orderings of the same balanced four-demo block, applies EQ P3.2, and reports how often the recency-weighted prior lands on the final example's label. With perfectly balanced labels a position-blind model would sit at 50%:

PYTHON · RUNNABLE IN-BROWSER

# recency bias over orderings (ILLUSTRATIVE) — EQ P3.2, beta>0
import numpy as np
from itertools import permutations
rng = np.random.default_rng(0)
labels = np.array([1, 1, 0, 0])    # 2 POSITIVE (1), 2 NEGATIVE (0) — balanced
beta = 0.65
w = np.exp(beta * np.arange(1, 5))  # recency weights, last demo heaviest
perms = list(set(permutations(range(4))))
agree, mass = [], []                # match + prior mass on the LAST label's class
for _ in range(2000):
    order = perms[rng.integers(len(perms))]
    lab = labels[list(order)]
    p_pos = (w * lab).sum() / w.sum()
    p_last = p_pos if lab[-1] == 1 else 1 - p_pos
    agree.append(int((p_pos >= 0.5) == (lab[-1] == 1)))
    mass.append(p_last)
agree, mass = np.array(agree), np.array(mass)
print("balanced labels, position-blind baseline: 50.0%")
print(f"shuffles simulated         : {agree.size}")
print(f"prior matches LAST label   : {100*agree.mean():.1f}%  (recency skew)")
print(f"mean prior mass on LAST cls : {100*mass.mean():.1f}%  (vs 50% if blind)")

edits are live — break it on purpose

Set beta = 0 and the match rate falls to the coin-flip baseline — pure majority-label bias with no recency. Any $\beta > 0$ pushes the prior toward whatever label sits last, which is why "end on your most representative example" is the single cheapest ordering fix. This is the toy model, not a transformer; what it reproduces is the direction and the order-sensitivity, not a specific model's magnitude.

Three demos sit at positions 1→3 with recency weights $w = (1,\, 2,\, 4)$ (latest heaviest, per EQ P3.2). Their labels are POSITIVE, NEGATIVE, POSITIVE. What is the recency-weighted prior $\tilde p(\text{POSITIVE})$?

Weights on POSITIVE demos: $w_1 + w_3 = 1 + 4 = 5$. Total weight: $1 + 2 + 4 = 7$. $\tilde p(\text{POSITIVE}) = 5/7 \approx$ 0.714. The labels are evenly split (2 vs 1 here is close), but because the heaviest, last demo is POSITIVE the prior tilts that way — pure recency, no input read.

What to do about it. Four mitigations, in increasing order of effort: (1) balance labels and end on the most representative example, since the last slot leaks hardest; (2) never sort examples by label — alternate or randomize within the balanced set; (3) for evaluation, average over several orders rather than trusting one (Chapter 07); (4) contextual calibration — measure the model's output on a content-free input like "N/A" and divide it out — recovers most of the lost accuracy when you control the decoding stack. Modern instruct models are meaningfully better calibrated than the GPT-3 these biases were measured on, but the bleed has not gone to zero — it has gone subtle, which is worse for debugging.

3.5

Format leakage as a feature

Everything in your examples leaks into the output: the casing of keys, the order of fields, trailing punctuation, whether lists end with a period, the average response length. Most discussions treat this as a hazard. It is also the most reliable format-specification mechanism that exists — more reliable than describing the format in prose, because a description must be parsed and interpreted while a demonstration is simply continued. Compare:

# DESCRIBED — the model must translate prose into structure
Return JSON with keys "sentiment" (one of positive|negative|mixed),
"confidence" (a float between 0 and 1, two decimals), and "evidence"
(an array of verbatim quotes, at most two).

# DEMONSTRATED — the model continues the pattern
Input:  "Battery life is superb but the hinge broke in a week."
Output: {"sentiment": "mixed", "confidence": 0.86, "evidence": ["superb", "broke in a week"]}

One example pins down a dozen micro-decisions the description left open: key order, float precision, quote style, whether evidence is verbatim or paraphrased. The production pattern is describe once, demonstrate twice — a short prose spec for the rules that examples can't carry (ranges, enums, fallbacks), then two examples that settle everything else. Chapter 05 replaces this with constrained decoding where available; few-shot formatting remains the portable fallback that works on every model.

LEAK

The leak does not discriminate. The model copies your examples' flaws with the same fidelity as their format: one demo with a trailing comma teaches trailing commas; demos that are all ~40 tokens teach 40-token answers even when the right answer needs 400; an inconsistent pair of examples teaches that the format is negotiable. Audit examples the way you audit code — they are executable.

3.6

Contrastive examples: good vs bad, with the why

Positive examples define the target; they say nothing about the boundary. When the failure mode is the model doing something almost right — summaries that editorialize, refusals that over-trigger, SQL that works but scans the whole table — the fastest fix is a contrastive pair: the same input with a good output, a bad output, and an explicit label on each saying why. The WHY annotation is what separates this from merely doubling your shot count: it converts an instance into a rule the model can apply to unseen cases.

# Contrastive pair for a support-summary task
Input: [47-message thread about a delayed refund]

GOOD:  "Customer requested refund 12 May; agent escalated 19 May;
       refund pending finance approval. Customer contacted support 4×."
// WHY GOOD: only verifiable facts, dates preserved, no sentiment language

BAD:   "Frustrated customer has been chasing an overdue refund for weeks
       while support repeatedly dropped the ball."
// WHY BAD: editorializes ("dropped the ball"), drops dates, asserts blame

Three rules keep contrastive prompts from backfiring. Label loudly — the GOOD/BAD markers must be unmissable, because an unlabeled bad output is just another demonstration and will be imitated. Make the why specific — "too informal" teaches less than "uses sentiment adjectives instead of dates". And end on good: §3.4's recency bias applies to quality exactly as it applies to labels, so the last thing in the example block should always be behavior you want continued. Contrastive pairs are also the natural home for near-misses harvested from production — every bad output your evals catch (Chapter 07) is a free BAD half waiting for its annotation.

3.7

When few-shot hurts

Examples are a lever, not a ritual. Three situations where adding them subtracts value:

Situation	What goes wrong	Do instead
Strong instruct model, simple task	Zero-shot is already near ceiling; your examples drag the model off its native — often better — style, and anchor length, tone, and structure to your demos	Start zero-shot; add examples only when evals show a gap they would close
Reasoning models	Few-shot CoT exemplars interfere with the model's own reasoning trace; vendor guidance (o1-class onward) is explicit that minimal prompts often beat shot-heavy ones	State the task and constraints; control effort with dials, not demos (Chapter 04)
Example overfitting	The model latches onto surface artifacts — your demos' entities reappear in outputs, answer lengths mimic demo lengths, one weird demo skews everything	Diversify demos (§3.3), check outputs for demo-bleed, rotate example sets in evals

A fixed few-shot block holds $8$ examples, each $125$ tokens long. How many tokens of overhead does that block add to every single call?

$8 \times 125 =$ 1000 tokens per call. Rounding error on one request — but every shot is paid on every call forever, so at scale this is the line item prefix caching exists to discount.

And always, the unglamorous one: cost. Every shot is paid on every call, forever. Eight 120-token examples are ~960 tokens of overhead — per request, that is rounding error; at ten million requests a month it is ten billion input tokens spent re-teaching a model the same four boundary cases. Prefix caching (Vol II · Chapter 08) discounts a static example block substantially, which is a real argument for fixed sets over per-query retrieval at high volume — and when the example block stops fitting the budget at all, the escalation path is the one Vol II · Chapter 06 opened: distill the behavior into the weights and delete the demos.

PITFALLS

The four classic few-shot failures: (1) happy-path demos — every example typical, every edge case unguarded; (2) sorted labels — all positives then all negatives, manufacturing both majority and recency bias at once; (3) the inconsistent demo — one example formatted differently, teaching that format is optional; (4) fossilized examples — the demo set written on day one, never revisited after the task, the model, or the traffic changed.

Examples shape what the model produces; the next lever shapes how long it thinks before producing it. Chapter 04: chain of thought and its descendants — decomposition, self-consistency, effort dials — and an honest account of which of those techniques reasoning models quietly made obsolete.

§