01 · How Models Read Prompts

1.1

A prompt conditions a distribution

Strip away the chat window, the typing indicator, the first-person voice. What remains is a frozen function: a network with parameters $\theta$ that maps a token sequence to a probability distribution over the next token (Vol II · Ch 01). Generation is that function applied repeatedly. Everything prompting will ever do is contained in one equation:

EQ P1.1 — THE CONDITIONAL $$ p_\theta\!\left(y \mid x\right) \;=\; \prod_{t=1}^{|y|} p_\theta\!\left(y_t \,\middle|\, x,\; y_{<t}\right) $$

$x$ is your prompt, $y$ the response, $\theta$ the weights — frozen. No gradient flows at inference; nothing you type changes a single parameter. What you control is $x$, and through attention (Vol II · EQ 3.1) every token of $x$ is visible to every step of $y$. Change one token and you are indexing into a different conditional distribution — sometimes imperceptibly, sometimes catastrophically. Prompting is the art of choosing the condition.

This is why wording matters in a way that feels disproportionate. "Please" is not politeness to the model; it is two tokens of evidence that shift the response toward registers where "please" appeared in training data. A typo, a code fence, ALL-CAPS, the choice between : and — — each perturbs the condition, and the distribution moves. There is no separate channel for "what you meant." The tokens are the entire input.

A useful — and honest — second lens treats the model as inferring what kind of text this is before continuing it. Pre-training data is a mixture of countless documents, each written under some latent context: an author, a genre, a task, a level of care. One way to describe trained behavior:

EQ P1.2 — THE LATENT-TASK LENS $$ p_\theta\!\left(y \mid x\right) \;\approx\; \sum_{\tau} p_\theta\!\left(y \mid \tau, x\right)\, p_\theta\!\left(\tau \mid x\right) $$

$\tau$ ranges over latent "document types" — task, persona, register, quality tier. The prompt acts as evidence about which $\tau$ you are in; roles and examples sharpen the posterior $p(\tau \mid x)$. To be clear: no such sum is computed anywhere in the network — this is a behavioral description (the implicit-Bayesian-inference account of in-context learning, Xie et al. 2021), not an architecture fact. But it predicts real phenomena: why examples often beat instructions, why personas shift style more than skill, why sloppy prompts beget sloppy answers.

WHAT PROMPTING MOVES

p(y | x)

Every token of the condition re-weights the entire output distribution.

WHAT IT NEVER TOUCHES

θ

The weights are frozen. Capability is fixed before you type a word.

WHAT THAT IMPLIES

selection, not creation

A prompt elicits behaviors the network can already express — it cannot add new ones.

You will watch EQ P1.1 happen in Instrument P1.2 below: four one-line edits to a prompt, each visibly dragging probability mass across eight candidate replies — with the weights untouched throughout.

Before the instrument, run the mechanism yourself. The cell below is EQ P1.1 in eight candidates: a hand-set logit vector is the frozen model's neutral verdict; a role/constraint clause is one additive evidence term over the same vocabulary (the latent-task lens of EQ P1.2). Softmax both, and watch wording — not weights — relocate the mass:

PYTHON · RUNNABLE IN-BROWSER

# prompt-as-conditioning: one role/constraint term reweights an 8-candidate reply distribution
import numpy as np

cands = ["I'm sorry -", "Hi there!", "Sure, let's", "Unfortunately",
         "Per policy,", '{"action"', "REFUND-101:", "Certainly."]
base = np.array([2.1, 1.3, 0.9, 0.4, -0.7, -2.6, -3.2, 0.6])   # frozen model's neutral logits

def softmax(z):
    z = z - z.max()
    e = np.exp(z)
    return e / e.sum()

# a role/constraint is additive evidence over the SAME vocabulary (EQ P1.2) - theta never moves
role = np.array([-1.2, -1.9, -1.3, 0.4, 3.3, -0.2, 0.0, -0.4])  # "be a terse, policy-exact specialist"
p0 = softmax(base)            # before: bare task
p1 = softmax(base + role)     # after:  same weights, conditioned prompt

print("candidate        before    after    delta")
for c, a, b in zip(cands, p0, p1):
    print(f"{c:13s} {a*100:7.1f}% {b*100:7.1f}% {(b-a)*100:+7.1f}")
moved = 0.5 * np.abs(p1 - p0).sum()     # total variation = fraction of mass relocated
print(f"\ntop before: {cands[p0.argmax()]!r}   top after: {cands[p1.argmax()]!r}")
print(f"one role line relocated {moved*100:.1f} of 100 points of probability mass")

edits are live — break it on purpose

A frozen model's neutral verdict puts logit $2.0$ on candidate A and $1.0$ on candidate B (just two candidates). Apply softmax (temperature $\tau = 1$). What probability does the model assign to candidate A?

Softmax: $p_A = \frac{e^{2.0}}{e^{2.0}+e^{1.0}} = \frac{7.389}{7.389+2.718} = \frac{7.389}{10.107} \approx$ 0.73. EQ P1.1 in two candidates: the logit gap of 1.0 nat becomes a 73/27 split.

1.2

The anatomy of a real request

You send an API call with a messages array — system, user, assistant turns. The model never sees that JSON. A chat template serializes the array into one flat token stream, stitched together with special tokens: reserved vocabulary items that ordinary text can never tokenize into. Here is what a two-message request actually looks like by the time the transformer reads it:

# messages=[{system}, {user}] → the literal token stream (Llama-3-family template)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a concise geography tutor.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of Australia?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

← generation starts here; it ends when the model itself emits <|eot_id|>

Four mechanical facts fall out of this picture, and each one is load-bearing for the rest of the volume:

There is one flat stream. "System", "user" and "assistant" are token conventions, not channels. The model attends across all of it with the same machinery (Vol II · Ch 03).
System privilege is trained, not architectural. Post-training taught the model to weight the system region heavily; nothing in attention enforces it. That gap between convention and enforcement is exactly why prompt injection is possible at all (Vol IV).
The trailing assistant header is the generation cue. The template ends mid-conversation, on purpose: the highest-probability continuation is the assistant's reply. Pre-loading text after that cue is prefilling — Chapter 05's favorite trick.
Wrong template, silent failure. A model fine-tuned on one template and served with another still answers — just measurably worse, with degraded formatting and instruction-following. Vol II · §6.5 calls the mismatched chat template "silent killer #1" for good reason.

Past assistant turns deserve a special mention: they are just more context. The model has no memory of "having said" them — re-send a conversation with an edited assistant message and the model will treat the edit as its own words. History is a document, and you are its editor.

INSTRUMENT P1.1 — PROMPT ANATOMYHOVER A SEGMENT · TOGGLE THE TEMPLATE VIEW

VIEW

Hover (or tap) each segment to see its mechanical job and its failure mode when dropped. Then toggle to the chat-template view and notice the reshuffle: FORMAT rides inside the system turn, the EXAMPLE becomes a fake conversation turn the model cannot distinguish from real history, and grey machinery — the special tokens — holds it all together. Hover the machinery too.

Each segment in the instrument is a different way of supplying evidence to EQ P1.1 — and EQ P1.2 says different segments should move the distribution differently. Watch it happen. The logits below are hand-crafted to mimic typical model behavior (this page calls no API), but the softmax, entropy and KL divergence are computed live from them:

INSTRUMENT P1.2 — MASS SHIFTERILLUSTRATIVE LOGITS · LIVE SOFTMAX · EQ P1.1

PROMPT VARIANT (ONE EDIT AT A TIME)

SAMPLING TEMPERATURE τ 1.00

TOP CANDIDATE

—

ENTROPY OF DISTRIBUTION

—

KL FROM NEUTRAL

—

Solid bars are the current variant; ghost outlines are the NEUTRAL baseline at the same temperature. One role line moves ~60 points of probability mass onto the policy-voiced opening; one worked example moves ~85 points onto its format. Now sweep τ: temperature flattens or sharpens the distribution but never reorders it — sampling settings rescale the conditional, only the prompt can rewrite it.

Across two reply candidates, the neutral prompt gives the distribution $q = (0.6,\, 0.4)$; after one role line the model gives $p = (0.2,\, 0.8)$. The mass relocated is the total variation $\tfrac{1}{2}\sum_i |p_i - q_i|$. What fraction of the probability mass moved?

$\tfrac{1}{2}\big(|0.2-0.6| + |0.8-0.4|\big) = \tfrac{1}{2}(0.4 + 0.4) = \tfrac{1}{2}(0.8) =$ 0.4. The weights never moved (EQ P1.1); one role line dragged 40% of the mass across — that is the readout the Mass Shifter calls "mass relocated".

1.3

What the model attends to

Attention is content-based addressing: information flows wherever query meets key, regardless of distance (Vol II · EQ 3.1). In principle, position 1 and position 100,000 are equally reachable. In practice, trained models carry strong positional priors, and three of them shape how you should lay out a prompt:

Primacy. The opening of the context is disproportionately influential. Part of this is training data (documents front-load their framing); part is mechanical — early tokens double as attention sinks, accumulating probability mass that softmax has to park somewhere (Vol II · §3.7).
Recency. The end of the context is closest to the tokens being generated, and models are trained on data where the most recent text is the most relevant. The final tokens before the generation cue punch far above their weight.
Lost in the middle. Liu et al. (2023) measured retrieval accuracy as a function of where in a long context the answer-bearing document sat, and found a U: strong at the start, strong at the end, a trough in the middle — sometimes below the model's closed-book score.

FIG P1.1POSITION OF THE RELEVANT FACT vs RETRIEVAL ACCURACY — ILLUSTRATIVE

Illustrative curves, after Liu et al., "Lost in the Middle" (2023). Newer long-context models flatten the U substantially — mostly via training on synthetic long-range retrieval data, not architectural change — but the trough has narrowed, not vanished. Plan as if the middle of a long context is the cheapest real estate you own.

The engineering consequences are blunt. Put the task last, after everything it depends on — the question should be the freshest thing in the model's window when generation begins. Keep binding constraints near the task, not 40,000 tokens upstream. And when you must ship a huge context (a codebase, a contract, a transcript), state the instructions before the dump and restate them after it — buying both primacy and recency for the price of a few dozen tokens. A long context window is not uniformly readable; it is a stage with bright edges and a dim center.

Caveat worth keeping: position effects are model- and version-specific, and frontier long-context models have closed much of the gap on needle-in-a-haystack benchmarks — which are easier than real multi-fact reasoning over long inputs. The U is weakest exactly where benchmarks are strongest. Measure on your own task (§1.5) before trusting any curve, including FIG P1.1.

1.4

Tokens are the unit of cost and attention

The model does not read words, lines, or pages — it reads tokens (Vol II · Ch 01), and tokens are the currency of all three budgets you are spending: money (APIs price per token, input and output separately), latency (time-to-first-token scales with prefill length; every prompt token is paid for on every call unless prefix caching saves you — Vol II · Ch 08), and attention (a 200K window sounds infinite until thirty retrieved documents at 4K tokens each eat 120K of it, most of which lands in the dim middle of FIG P1.1).

Less obviously, formatting is not free styling — it changes the token stream, and therefore the condition in EQ P1.1. The same content, serialized differently, is a different prompt:

Choice	Token-level effect	Behavioral effect
Headers, delimiters, XML tags	a few extra structure tokens	Anchors for attention; sections become addressable. Usually the cheapest reliability upgrade available.
JSON vs YAML vs prose	JSON spends heavily on quotes, braces, escapes; YAML is often leaner	Shifts the latent register toward "config file" / "API payload"; affects compliance, verbosity, and what the model thinks it is writing.
Trailing whitespace	BPE merges leading spaces into words; `"is:"` and `"is: "` end in different tokens	A trailing space can strand the model off its preferred token boundary and measurably degrade the completion. End prompts cleanly.
ALL CAPS, typos, sloppy text	fragments into rarer, longer token sequences	Evidence (EQ P1.2) that this is low-care text — distributions drift toward the registers where such text lived in training.
Repeated boilerplate per call	thousands of identical prefill tokens	Pure cost and latency unless served with prefix caching; also crowds the window the task actually needs.

Do not micro-optimize tokens at the expense of clarity — a clear 400-token instruction beats a cryptic 150-token one every time, and the table's effects are second-order next to what you actually say. The point is narrower: format choices are real inputs with real consequences, not decoration. Budget them like you budget words.

You do not need a real tokenizer to develop budget instincts — the len/4 rule of thumb (≈ 4 characters per token for English) is close enough to feel the difference. The cell below estimates a bloated, over-polite prompt against the same instruction tightened, prints both counts and the percentage saved, then projects the cost gap across a day of traffic — because every input token is paid for on every call:

PYTHON · RUNNABLE IN-BROWSER

# token-budget: len/4 heuristic on a bloated vs tightened prompt, plus % and cost saved
bloated = (
    "Hello there! I was hoping that you might possibly be able to help me out "
    "with something today, if that is at all okay with you. What I would really "
    "like for you to do, if you would be so kind, is to take the following "
    "customer message and let me know whether or not it sounds like the person "
    "is feeling positive, negative, or somewhere neutral in between, and then "
    "kindly explain your reasoning to me in a few sentences. Thank you so much!"
)
tight = "Classify the sentiment of the message below as positive, negative, or neutral.\nMessage:"

def est_tokens(s):
    return max(1, round(len(s) / 4))          # ~4 chars/token rule of thumb

tb, tt = est_tokens(bloated), est_tokens(tight)
saved = (tb - tt) / tb * 100
print(f"bloated prompt : {len(bloated):4d} chars  ~{tb:4d} tokens")
print(f"tight prompt   : {len(tight):4d} chars  ~{tt:4d} tokens")
print(f"tokens saved   : {tb - tt:4d}  ({saved:.0f}% smaller)")

calls = 100_000                               # every call re-pays the prompt (no prefix cache)
print(f"at {calls:,} calls/day, $3 / 1M input tok: "
      f"${tb*calls/1e6*3:.2f}/day bloated  ->  ${tt*calls/1e6*3:.2f}/day tight")

edits are live — break it on purpose

Using the $\text{len}/4$ heuristic (≈ 4 characters per English token), estimate the token count of a system prompt that is $600$ characters long.

$600 / 4 =$ 150 tokens. The rule of thumb is rough, but it is close enough to build budget instincts — and every one of these 150 tokens is re-paid on every call without prefix caching.

That $150$-token prompt is sent on $100{,}000$ calls per day with no prefix cache, at $\$3$ per million input tokens. What is the daily input-token cost, in dollars?

Total input tokens $= 150 \times 100{,}000 = 15{,}000{,}000 = 15$ MTok. Cost $= 15 \times \$3 =$ $45 per day. Trimming the prompt by a third saves $15/day — the §1.4 argument that format choices are real inputs with real bills.

1.5

The empirical mindset

You cannot inspect $p_\theta(y \mid x)$ directly — no gradients, no documentation, billions of opaque parameters. Prompting is therefore an experimental science run against a black box: form a hypothesis, change one variable, hold everything else fixed, and measure. The discipline matters because the noise is vicious — at any temperature above zero a single run is an anecdote, and even "deterministic" settings wobble under batched serving and expert routing on modern stacks. The protocol that survives contact with this:

# The prompt experiment, minimum viable rigor
baseline:   current prompt, frozen — including its chat template
change:     ONE edit (role line, example, ordering, format) per variant
decoding:   fix temperature / top-p across variants; n ≥ 20 samples each
metric:     programmatic check > rubric scored blind > vibes (never vibes)
record:     prompt version, model ID, date — served models drift under you
decide:     keep the edit only if the gain holds on a second, held-out set

Run Instrument P1.2 again with this lens: each button is a one-variable experiment, and the KL readout is the measured effect size. That is the whole methodology in miniature — the rest of this volume is a catalog of which edits are worth testing first.

NO MAGIC WORDS

Two honest limits. First: incantations — "take a deep breath", offering tips, threats — show real but small, model-specific effects that routinely evaporate across model versions; the folklore survives because single runs are anecdotes and confirmation bias does the rest. Test them like anything else; expect them to lose to one good example. Second, the hard ceiling: prompting selects among behaviors the frozen $\theta$ can already express — it cannot add capability. If the model fails at best-of-50 sampling, no phrasing will fix it; you need a stronger model, tools, or fine-tuning (Vol II · §6.1's escalation ladder). Knowing which side of that line you are on is the most valuable prompt skill there is.

You now know why prompts work: they condition a frozen distribution, read through a template, with bright edges and a priced-by-the-token interior. Chapter 02 turns mechanics into method — the five-part scaffold of Role · Task · Context · Format · Constraints, assembled live with before/after pairs.

§