A prompt conditions a distribution
Strip away the chat window, the typing indicator, the first-person voice. What remains is a frozen function: a network with parameters \(\theta\) that maps a token sequence to a probability distribution over the next token (Vol II · Ch 01). Generation is that function applied repeatedly. Everything prompting will ever do is contained in one equation:
This is why wording matters in a way that feels disproportionate. "Please" is not politeness to the model; it is two tokens of evidence that shift the response toward registers where "please" appeared in training data. A typo, a code fence, ALL-CAPS, the choice between : and — — each perturbs the condition, and the distribution moves. There is no separate channel for "what you meant." The tokens are the entire input.
A useful — and honest — second lens treats the model as inferring what kind of text this is before continuing it. Pre-training data is a mixture of countless documents, each written under some latent context: an author, a genre, a task, a level of care. One way to describe trained behavior:
You will watch EQ P1.1 happen in Instrument P1.2 below: four one-line edits to a prompt, each visibly dragging probability mass across eight candidate replies — with the weights untouched throughout.
Before the instrument, run the mechanism yourself. The cell below is EQ P1.1 in eight candidates: a hand-set logit vector is the frozen model's neutral verdict; a role/constraint clause is one additive evidence term over the same vocabulary (the latent-task lens of EQ P1.2). Softmax both, and watch wording — not weights — relocate the mass:
# prompt-as-conditioning: one role/constraint term reweights an 8-candidate reply distribution
import numpy as np
cands = ["I'm sorry -", "Hi there!", "Sure, let's", "Unfortunately",
"Per policy,", '{"action"', "REFUND-101:", "Certainly."]
base = np.array([2.1, 1.3, 0.9, 0.4, -0.7, -2.6, -3.2, 0.6]) # frozen model's neutral logits
def softmax(z):
z = z - z.max()
e = np.exp(z)
return e / e.sum()
# a role/constraint is additive evidence over the SAME vocabulary (EQ P1.2) - theta never moves
role = np.array([-1.2, -1.9, -1.3, 0.4, 3.3, -0.2, 0.0, -0.4]) # "be a terse, policy-exact specialist"
p0 = softmax(base) # before: bare task
p1 = softmax(base + role) # after: same weights, conditioned prompt
print("candidate before after delta")
for c, a, b in zip(cands, p0, p1):
print(f"{c:13s} {a*100:7.1f}% {b*100:7.1f}% {(b-a)*100:+7.1f}")
moved = 0.5 * np.abs(p1 - p0).sum() # total variation = fraction of mass relocated
print(f"\ntop before: {cands[p0.argmax()]!r} top after: {cands[p1.argmax()]!r}")
print(f"one role line relocated {moved*100:.1f} of 100 points of probability mass")
The anatomy of a real request
You send an API call with a messages array — system, user, assistant turns. The model never sees that JSON. A chat template serializes the array into one flat token stream, stitched together with special tokens: reserved vocabulary items that ordinary text can never tokenize into. Here is what a two-message request actually looks like by the time the transformer reads it:
# messages=[{system}, {user}] → the literal token stream (Llama-3-family template)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a concise geography tutor.<|eot_id|><|start_header_id|>user<|end_header_id|>
What is the capital of Australia?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
← generation starts here; it ends when the model itself emits <|eot_id|>
Four mechanical facts fall out of this picture, and each one is load-bearing for the rest of the volume:
- There is one flat stream. "System", "user" and "assistant" are token conventions, not channels. The model attends across all of it with the same machinery (Vol II · Ch 03).
- System privilege is trained, not architectural. Post-training taught the model to weight the system region heavily; nothing in attention enforces it. That gap between convention and enforcement is exactly why prompt injection is possible at all (Vol IV).
- The trailing assistant header is the generation cue. The template ends mid-conversation, on purpose: the highest-probability continuation is the assistant's reply. Pre-loading text after that cue is prefilling — Chapter 05's favorite trick.
- Wrong template, silent failure. A model fine-tuned on one template and served with another still answers — just measurably worse, with degraded formatting and instruction-following. Vol II · §6.5 calls the mismatched chat template "silent killer #1" for good reason.
Past assistant turns deserve a special mention: they are just more context. The model has no memory of "having said" them — re-send a conversation with an edited assistant message and the model will treat the edit as its own words. History is a document, and you are its editor.
Each segment in the instrument is a different way of supplying evidence to EQ P1.1 — and EQ P1.2 says different segments should move the distribution differently. Watch it happen. The logits below are hand-crafted to mimic typical model behavior (this page calls no API), but the softmax, entropy and KL divergence are computed live from them:
What the model attends to
Attention is content-based addressing: information flows wherever query meets key, regardless of distance (Vol II · EQ 3.1). In principle, position 1 and position 100,000 are equally reachable. In practice, trained models carry strong positional priors, and three of them shape how you should lay out a prompt:
- Primacy. The opening of the context is disproportionately influential. Part of this is training data (documents front-load their framing); part is mechanical — early tokens double as attention sinks, accumulating probability mass that softmax has to park somewhere (Vol II · §3.7).
- Recency. The end of the context is closest to the tokens being generated, and models are trained on data where the most recent text is the most relevant. The final tokens before the generation cue punch far above their weight.
- Lost in the middle. Liu et al. (2023) measured retrieval accuracy as a function of where in a long context the answer-bearing document sat, and found a U: strong at the start, strong at the end, a trough in the middle — sometimes below the model's closed-book score.
The engineering consequences are blunt. Put the task last, after everything it depends on — the question should be the freshest thing in the model's window when generation begins. Keep binding constraints near the task, not 40,000 tokens upstream. And when you must ship a huge context (a codebase, a contract, a transcript), state the instructions before the dump and restate them after it — buying both primacy and recency for the price of a few dozen tokens. A long context window is not uniformly readable; it is a stage with bright edges and a dim center.
Caveat worth keeping: position effects are model- and version-specific, and frontier long-context models have closed much of the gap on needle-in-a-haystack benchmarks — which are easier than real multi-fact reasoning over long inputs. The U is weakest exactly where benchmarks are strongest. Measure on your own task (§1.5) before trusting any curve, including FIG P1.1.
Tokens are the unit of cost and attention
The model does not read words, lines, or pages — it reads tokens (Vol II · Ch 01), and tokens are the currency of all three budgets you are spending: money (APIs price per token, input and output separately), latency (time-to-first-token scales with prefill length; every prompt token is paid for on every call unless prefix caching saves you — Vol II · Ch 08), and attention (a 200K window sounds infinite until thirty retrieved documents at 4K tokens each eat 120K of it, most of which lands in the dim middle of FIG P1.1).
Less obviously, formatting is not free styling — it changes the token stream, and therefore the condition in EQ P1.1. The same content, serialized differently, is a different prompt:
| Choice | Token-level effect | Behavioral effect |
|---|---|---|
| Headers, delimiters, XML tags | a few extra structure tokens | Anchors for attention; sections become addressable. Usually the cheapest reliability upgrade available. |
| JSON vs YAML vs prose | JSON spends heavily on quotes, braces, escapes; YAML is often leaner | Shifts the latent register toward "config file" / "API payload"; affects compliance, verbosity, and what the model thinks it is writing. |
| Trailing whitespace | BPE merges leading spaces into words; "is:" and "is: " end in different tokens | A trailing space can strand the model off its preferred token boundary and measurably degrade the completion. End prompts cleanly. |
| ALL CAPS, typos, sloppy text | fragments into rarer, longer token sequences | Evidence (EQ P1.2) that this is low-care text — distributions drift toward the registers where such text lived in training. |
| Repeated boilerplate per call | thousands of identical prefill tokens | Pure cost and latency unless served with prefix caching; also crowds the window the task actually needs. |
Do not micro-optimize tokens at the expense of clarity — a clear 400-token instruction beats a cryptic 150-token one every time, and the table's effects are second-order next to what you actually say. The point is narrower: format choices are real inputs with real consequences, not decoration. Budget them like you budget words.
You do not need a real tokenizer to develop budget instincts — the len/4 rule of thumb (≈ 4 characters per token for English) is close enough to feel the difference. The cell below estimates a bloated, over-polite prompt against the same instruction tightened, prints both counts and the percentage saved, then projects the cost gap across a day of traffic — because every input token is paid for on every call:
# token-budget: len/4 heuristic on a bloated vs tightened prompt, plus % and cost saved
bloated = (
"Hello there! I was hoping that you might possibly be able to help me out "
"with something today, if that is at all okay with you. What I would really "
"like for you to do, if you would be so kind, is to take the following "
"customer message and let me know whether or not it sounds like the person "
"is feeling positive, negative, or somewhere neutral in between, and then "
"kindly explain your reasoning to me in a few sentences. Thank you so much!"
)
tight = "Classify the sentiment of the message below as positive, negative, or neutral.\nMessage:"
def est_tokens(s):
return max(1, round(len(s) / 4)) # ~4 chars/token rule of thumb
tb, tt = est_tokens(bloated), est_tokens(tight)
saved = (tb - tt) / tb * 100
print(f"bloated prompt : {len(bloated):4d} chars ~{tb:4d} tokens")
print(f"tight prompt : {len(tight):4d} chars ~{tt:4d} tokens")
print(f"tokens saved : {tb - tt:4d} ({saved:.0f}% smaller)")
calls = 100_000 # every call re-pays the prompt (no prefix cache)
print(f"at {calls:,} calls/day, $3 / 1M input tok: "
f"${tb*calls/1e6*3:.2f}/day bloated -> ${tt*calls/1e6*3:.2f}/day tight")
The empirical mindset
You cannot inspect \(p_\theta(y \mid x)\) directly — no gradients, no documentation, billions of opaque parameters. Prompting is therefore an experimental science run against a black box: form a hypothesis, change one variable, hold everything else fixed, and measure. The discipline matters because the noise is vicious — at any temperature above zero a single run is an anecdote, and even "deterministic" settings wobble under batched serving and expert routing on modern stacks. The protocol that survives contact with this:
# The prompt experiment, minimum viable rigor
baseline: current prompt, frozen — including its chat template
change: ONE edit (role line, example, ordering, format) per variant
decoding: fix temperature / top-p across variants; n ≥ 20 samples each
metric: programmatic check > rubric scored blind > vibes (never vibes)
record: prompt version, model ID, date — served models drift under you
decide: keep the edit only if the gain holds on a second, held-out set
Run Instrument P1.2 again with this lens: each button is a one-variable experiment, and the KL readout is the measured effect size. That is the whole methodology in miniature — the rest of this volume is a catalog of which edits are worth testing first.
Two honest limits. First: incantations — "take a deep breath", offering tips, threats — show real but small, model-specific effects that routinely evaporate across model versions; the folklore survives because single runs are anecdotes and confirmation bias does the rest. Test them like anything else; expect them to lose to one good example. Second, the hard ceiling: prompting selects among behaviors the frozen \(\theta\) can already express — it cannot add capability. If the model fails at best-of-50 sampling, no phrasing will fix it; you need a stronger model, tools, or fine-tuning (Vol II · §6.1's escalation ladder). Knowing which side of that line you are on is the most valuable prompt skill there is.
You now know why prompts work: they condition a frozen distribution, read through a template, with bright edges and a priced-by-the-token interior. Chapter 02 turns mechanics into method — the five-part scaffold of Role · Task · Context · Format · Constraints, assembled live with before/after pairs.
Further reading
- Radford, Wu, Child, Luan, Amodei & Sutskever (2019). Language Models are Unsupervised Multitask Learners. — the GPT-2 report; shows tasks can be elicited from raw text conditioning alone, the seed of all prompting.
- Brown et al. (2020). Language Models are Few-Shot Learners. — the GPT-3 paper that established in-context learning: examples in the prompt steer behavior with the weights frozen (EQ P1.1).
- Xie, Raghunathan, Liang & Ma (2021). An Explanation of In-context Learning as Implicit Bayesian Inference. — the latent-task lens behind EQ P1.2: the prompt as evidence about which "document type" to continue.
- Ouyang et al. (2022). Training Language Models to Follow Instructions with Human Feedback. — InstructGPT; explains why instruction- and system-prompt deference is installed by post-training, not architecture.
- Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. — the measured U-shaped position effect behind FIG P1.1 and the "put the task last" rule.
- Sclar, Choi, Tsvetkov & Suhr (2024). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design. — FormatSpread; shows formatting choices alone swing accuracy widely, grounding §1.4's "format is a real input."