02 · Context Engineering — AI Encyclopedia

2.1

The window is a budget

Every call an agent makes is assembled from the same six components, and they all draw on one account. The system prompt establishes identity and invariants. Tool definitions describe what the agent can do — schemas the model must re-read on every single call. Memory carries what previous sessions learned. Retrieval injects evidence for the current question. History is the transcript so far — turns, tool calls, tool results. The scratchpad holds the agent's own working notes. Their sum must fit the window, and in practice it must fit well under it:

EQ A2.1 — THE CONTEXT BUDGET $$ \underbrace{T_{\text{sys}} + T_{\text{tool}} + T_{\text{mem}}}_{\text{stable prefix}} \;+\; \underbrace{T_{\text{ret}} + T_{\text{hist}} + T_{\text{pad}}}_{\text{per-step dynamics}} \;\le\; \rho \, L_{\max}, \qquad \rho \approx 0.5\text{–}0.7 $$

The six terms are system prompt, tool definitions, memory, retrieval, history, and scratchpad; $L_{\max}$ is the advertised window. The factor $\rho$ is the honest part: you budget against an effective limit well below the advertised one, partly to leave headroom for the next tool result, partly because attention quality degrades long before the hard wall. The grouping into stable prefix and per-step dynamics is not cosmetic — it is the entire basis of §2.6.

PYTHON · RUNNABLE IN-BROWSER

# the context budget: six components against a 200K window (EQ A2.1, A2.2)
parts = {"system": 4000, "tools": 12000, "memory": 2000,
         "history": 60000, "retrieval": 30000, "scratchpad": 8000}
LIMIT, PRICE, KAPPA = 200_000, 3.00, 0.1   # window, $/Mtok input, cache discount

total = sum(parts.values())
for name, tok in parts.items():
    bar = "#" * round(40 * tok / LIMIT)
    print(f"{name:10s} {tok:7,d}  {tok/total:6.1%}  {bar}")
print(f"{'TOTAL':10s} {total:7,d}  -> {total/LIMIT:.0%} of the 200K window")

# EQ A2.2: the stable prefix P is served from cache; only the tail D is fresh
P = parts["system"] + parts["tools"] + parts["memory"] + parts["history"]
D = total - P
cold = total * PRICE / 1e6
warm = (KAPPA * P + D) * PRICE / 1e6
print(f"\ncold call (no cache): ${cold:.3f}     warm call: ${warm:.3f}")
print(f"warm/cold = (kP+D)/(P+D) = {warm/cold:.2f}  ->  {cold/warm:.1f}x cheaper per step")
print("history dominates the budget, but append-only history is cache-hit —")
print("the expensive tokens are the ones you change, not the ones you keep")

edits are live — break it on purpose

A coding agent assembles its context from six components: system 4,000 · tools 12,000 · memory 2,000 · history 60,000 · retrieval 30,000 · scratchpad 8,000 tokens. Against a 200K window, what percentage of the window does the total occupy?

Total $= 4{,}000 + 12{,}000 + 2{,}000 + 60{,}000 + 30{,}000 + 8{,}000 = 116{,}000$ tokens. As a share of 200,000: $116{,}000 / 200{,}000 = 0.58 = 58\%$ — already past the $\rho \approx 0.5$–0.7 effective limit where attention quality starts to slide. The answer is 58.

Why $\rho \ll 1$? Because the window is a physical limit but attention is a budget of its own. A transformer relates $T$ tokens through $T^2$ pairwise scores (Vol II · EQ 3.1), softmax spreads a fixed unit of probability mass over an ever-longer row, and training data contains far fewer million-token dependency patterns than thousand-token ones. The result is context rot: needle-in-a-haystack benchmarks saturate near 100%, while realistic tasks — multi-fact reasoning, instructions stated once at turn 3 and needed at turn 47, relevant passages buried mid-window among plausible distractors — degrade measurably as the window fills. Frontier models in 2026 hold up far better than the 2023 generation that made “lost in the middle” a famous phrase, but none are flat, and the degradation profile varies by model, by task, and by where the needle sits (Vol II · Ch 09). Treat the advertised window as an engineering maximum, not an operating point.

INSTRUMENT A2.1 — CONTEXT BUDGET COMPOSEREQ A2.1 · 200K WINDOW · ILLUSTRATIVE PRICES

PRESET

TOTAL CONTEXT

—

COST / REQUEST

—

TIME-TO-FIRST-TOKEN

—

ATTENTION QUALITY

—

Cycle the presets, then drag HISTORY toward its maximum and watch all three readouts move against you. Honest footnote: the attention-quality dial is an illustrative curve, not a measurement — real degradation depends on model, task, and where the relevant facts sit. Prices and prefill rate are illustrative too ($3/MTok input, 10× cache-read discount on the stable prefix, 8K tok/s prefill). The lesson survives the caveats: cost and latency grow linearly with what you stuff into the window; quality does not.

2.2

What earns its place

Adding context is never free, even far from the limit. Every token competes for the same attention mass; every irrelevant passage is a distractor the model must actively rule out at every subsequent step. The working metric is signal-to-token ratio: of the tokens you are about to add, what fraction changes what the model will do? A 3,000-token file dump whose only relevant content is one function signature has a signal-to-token ratio near zero — and unlike money, badly spent context keeps charging you, because it rides along in all future calls until something removes it.

Curation has a characteristic failure on each side. System prompts drift too rigid: after every incident someone appends another if-then rule, until the prompt is a brittle 4,000-token legal code the model follows to the letter and the spirit of nothing. Or they stay too vague: “be helpful and thorough” — a row of zeros that assumes shared context the model does not have. The right altitude is in between: identity, hard invariants, heuristics with reasons, and a small number of canonical examples that show rather than enumerate.

Component	Earns its place when…	Typical bloat
System	identity · invariants · heuristics	Edge-case rules patched in after every incident
Tools	each tool distinct & necessary	40 overlapping tools whose schemas ride along on every call
Memory	distilled decisions & preferences	Raw transcripts pasted forward as “memory”
Retrieval	passages that answer the live question	Top-k padding; whole files when one signature suffices
History	recent turns verbatim, older compacted	Every raw tool dump since turn 1
Scratchpad	plans & notes the agent actually rereads	Stale reasoning that no later step ever reads

A useful discipline: before any component is added, name the future step that will read it. If you cannot, it is not context — it is sediment.

2.3

Retrieval vs long context

When the corpus is much larger than the window, there is no debate. A 10-million-document knowledge base at ~500 tokens each is 5B tokens against a 200K window — a factor of 25,000. Retrieval-augmented generation exists because selection is forced: an index (embeddings, BM25, or both) narrows the corpus to a handful of candidates, and only those candidates spend context. The engineering then lives in retrieval quality — chunking, hybrid lexical + semantic search (embeddings famously miss exact identifiers like error codes and function names that keyword search catches trivially), and reranking.

When the corpus fits, the trade is genuinely contested. Stuffing the full corpus into context often beats RAG on answer quality — no retriever to miss the relevant passage — and for one-shot questions over a small document set it is frequently the right call. But the costs recur on every request: you pay tokens and prefill latency for the whole corpus each time (softened, not eliminated, by caching — §2.6), and you spend the very attention budget that §2.1 showed degrading past half-fill. Long context and retrieval are not rivals; they are a price curve, and the crossover moves with corpus size, query volume, and how often the corpus changes.

The agentic turn added a third option that has largely won for tool-rich domains: just-in-time retrieval. Instead of front-loading top-k passages, keep lightweight references in context — file paths, schema names, document titles — and give the agent tools to fetch full content on demand: grep, open_file, a search API. A coding agent that navigates with search-and-open routinely beats one fed pre-embedded chunks of the same repository, because each fetch is targeted by the agent's current hypothesis rather than by a similarity score computed before the task began. This is progressive disclosure: context holds the map, tools fetch the territory. The honest cost is latency — every just-in-time fetch is a round trip — so production systems hybridize: pre-load what is almost certainly needed (the map, the conventions file), fetch the rest as the task reveals it.

A knowledge base holds 4 million documents averaging 600 tokens each. Against a 200,000-token window, by what factor does the corpus exceed the window? (This is why retrieval is forced.)

Corpus $= 4{,}000{,}000 \times 600 = 2.4 \times 10^{9}$ tokens. Factor over the window $= 2.4\times10^{9} / 2\times10^{5} = 12{,}000$. When the corpus is 12,000× the window, "just stuff it all in" is not on the table — selection is mandatory. The answer is 12000.

2.4

Memory architectures

Everything in the window dies when the session ends. Memory is the set of structures that survive — and agents use three tiers, distinguished by scope and lifetime. The scratchpad is task-scoped: a todo list, a running plan, intermediate results, maintained inside or alongside the current context so the agent can re-anchor after long tool outputs push the original goal thousands of tokens upstream. The persistent memory file is project-scoped: a curated document (the CLAUDE.md / MEMORY.md pattern) of conventions, decisions, and preferences, loaded into the stable prefix of every session. Episodic summaries are history-scoped: compressed records of what previous sessions did, retrievable when relevant rather than always loaded.

Tier	Scope · lifetime	Written	Characteristic failure
Scratchpad	one task · minutes–hours	continuously, by the agent	Notes written but never reread; plan drift
Memory file	one project · weeks–months	on decision, with review	Stale facts treated as live truth
Episodic summaries	across sessions · indefinite	at session boundaries	Summary-of-summary blur; contamination

What separates working memory systems from decorative ones is write-back discipline. Memory that is only ever read decays into fiction: the project migrated databases in March, the memory file still says Postgres, and the agent confidently writes against the wrong schema. The rules that hold up in practice: write on decisions and constraints, not on chatter (“user prefers tabs” earns a write; a transcript of the debate about tabs does not); keep entries small, structured, and dated; and route writes through review — either a human glance or a separate validation pass — because an agent that can write its own memory can also poison it, persisting a hallucination that every future session will inherit as ground truth. Memory is the one context component with compound interest, in both directions.

2.5

Compaction: summarize and continue

A long-running agent will hit the budget no matter how disciplined the curation. Compaction is the standard escape: when fill crosses a threshold (typically 70–90% of the effective budget), replace the oldest span of history with a structured summary and keep the recent tail verbatim. The session continues; the transcript does not.

PYTHON · RUNNABLE IN-BROWSER

# compaction vs monotone growth: context size across a 60-turn session
BASE = 6_000        # stable prefix: system + tools + memory
PER_TURN = 1_400    # average tokens one turn adds (action + observation)
SUMMARY = 900       # what a structured compaction leaves behind
EVERY = 10          # compact every N turns, keep a 2-turn tail verbatim

turns, raw, compacted = [], [], []
ctx_r = ctx_c = BASE
for t in range(1, 61):
    ctx_r += PER_TURN
    ctx_c += PER_TURN
    if t % EVERY == 0:
        ctx_c = BASE + SUMMARY + 2 * PER_TURN
    turns.append(t); raw.append(ctx_r); compacted.append(ctx_c)

print(f"turn 60 without compaction: {raw[-1]:6,d} tokens "
      f"({raw[-1]/200_000:.0%} of a 200K window, still climbing)")
print(f"turn 60 with compaction   : peaks at {max(compacted):6,d}, "
      f"resets to {min(compacted[9:]):5,d}")
print(f"tokens re-sent on the next call: {raw[-1]/compacted[-1]:.0f}x more without it")
print("the sawtooth is the win; what the summary DROPS is the risk (Instrument A2.2)")
plot_xy(turns, raw)         # mint: monotone growth
plot_xy(turns, compacted)   # blue: the compaction sawtooth

edits are live — break it on purpose

A transcript holds 5,342 tokens. Compaction replaces the oldest span — 4,369 tokens — with a structured summary of 240 tokens, keeping the rest verbatim. By what percentage does the context shrink?

New total $= 5{,}342 - 4{,}369 + 240 = 1{,}213$ tokens. Drop $= 1 - 1{,}213 / 5{,}342 = 1 - 0.227 = 0.773 = 77\%$. The percentage is the easy part; whether the 240-token summary kept every constraint is the hard part the instrument below tests. The answer is 77.

Compaction is a lossy codec, and the entire craft is choosing the loss function. The loss is brutally asymmetric: dropping a pleasantry costs nothing; dropping a constraint costs the task. What must survive, in rough priority order: the goal as currently understood; every constraint, stated once and never repeated; decisions with their reasons (so they are not silently relitigated); exact identifiers — file paths, function names, ids, commands; and unresolved state — what failed, what was tried, what is pending. What can die: greetings and acknowledgments, superseded drafts, and raw tool payloads whose conclusions have already been distilled into a decision. A summary that reads like a friendly recap and tests like amnesia is the most common failure in production agents — which is exactly what the instrument below lets you reproduce.

INSTRUMENT A2.2 — COMPACTION SIM12-MESSAGE TRANSCRIPT · TWO SUMMARY POLICIES

ACTION

CONTEXT TOKENS

—

FACTS PRESERVED

—

VERDICT

—

FACTS CHECKLIST

COMPACT replaces the nine oldest messages with a structured summary — tokens fall 77%, all five facts survive. BAD COMPACT compresses just as hard (78%) while preserving the mood and losing the constraints. Compression ratio is not the metric; retention of decisions and constraints is. Both summaries would look fine to a casual reader — that is the trap.

2.6

Cache-aware context design

Prompt caching (Vol II · Ch 08) stores the computed KV state of a prompt prefix so the next request that shares it skips that prefill entirely. Two consequences define how agent context must be laid out. First, matching is exact-prefix: one changed byte at position $i$ invalidates everything from $i$ onward — there is no partial credit. Second, the savings are large enough to dominate architecture: cache reads are billed at roughly a tenth of fresh input across the major providers, and the skipped prefill is most of your time-to-first-token on long contexts.

EQ A2.2 — CACHED VS UNCACHED COST $$ \frac{\$_{\text{warm}}}{\$_{\text{cold}}} \;=\; \frac{\kappa P + D}{P + D}, \qquad \kappa \approx 0.1 $$

$P$ = tokens in the stable, cache-hit prefix; $D$ = dynamic tokens after the first changed byte; $\kappa$ = cache-read discount. A coding agent at 100K context with a 90K stable prefix pays $(0.1 \times 90 + 10)/100 = 0.19$ of the cold price — 5.3× cheaper per step, with proportionally faster prefill. The fine print: the first request pays a small write surcharge (≈1.25× on the cached span), so caching breaks even after roughly one subsequent hit. The entire equation collapses to this rule: order context by stability, and never touch what you've already sent.

By EQ A2.2, a warm call costs $\frac{\kappa P + D}{P + D}$ of a cold one. With a stable cache-hit prefix $P = 80\text{K}$, dynamic tail $D = 20\text{K}$, and cache discount $\kappa = 0.1$, what is the warm/cold cost ratio?

$\frac{\kappa P + D}{P + D} = \frac{0.1 \times 80 + 20}{80 + 20} = \frac{8 + 20}{100} = \frac{28}{100} = 0.28$ — the warm call is 0.28× the cold price, i.e. ≈3.6× cheaper per step. Push more tokens into the stable prefix (raise $P$, shrink $D$) and the ratio falls further. The answer is 0.28.

In a fifty-step loop the same prefix is replayed fifty times, so the layout rules are unforgiving:

# cache-aware assembly — most stable first
1 system:   identity, invariants — changes never
2 tools:    full schemas — changes per deploy, not per step
3 memory:   project file — changes per session
─── cache breakpoint ───
4 history:  append-only — new turns go at the END; never rewrite,
            reorder, or re-render earlier turns
5 dynamics: fresh retrieval & scratchpad — changes every step
# cache killers: a timestamp in the system prompt · mutating the
#   tool list mid-session · non-deterministic JSON serialization

Append-only history is why compaction (§2.5) is scheduled, not casual: a compaction necessarily rewrites the transcript and takes the cold-prefill hit once, on purpose, at a moment of your choosing — instead of a timestamp doing it silently on every single call.

2.7

Sub-agents as context partitioning

The final tool is architectural: when one window cannot hold a task, split the task, not the window. A sub-agent is a fresh context dedicated to one concern — search this codebase, audit this contract, verify this claim — spawned with a self-contained brief, run to completion, and discarded. The orchestrator's window holds the plan and the results; each worker's window absorbs the noise of its own exploration and dies with it. The contract that makes this work: results flow back, never transcripts. A sub-agent that reads forty files and burns 80K tokens doing it returns a 300-token report; the orchestrator pays 300, not 80,000. Each spawn is a deliberate compression boundary — sharper than compaction, because the summary is written while the full evidence is still in (the sub-agent's) context.

FIG A2.1RESULTS FLOW BACK — TRANSCRIPTS DON'T

Three concerns, three fresh windows. The orchestrator's context grows by the size of three reports, not three explorations — each spawn is a compression boundary enforced by architecture rather than discipline.

The honest ledger, because sub-agents are currently fashionable enough to be over-applied. They multiply total token spend — multi-agent research systems burn several times the tokens of a single-agent run on the same task, which only pays off when the work is read-heavy and parallelizable. They add latency per spawn. And they reintroduce the oldest distributed-systems bug as a prompt problem: the telephone game. A sub-agent only knows what its brief says — it cannot see the conversation that produced the brief — so an under-specified brief yields a confident answer to the wrong question. Worse, two sub-agents editing shared state will collide, which is why the stable pattern is read-heavy fan-out (search, audit, verify in parallel) feeding a single writer that holds the plan. Partition concerns, not sentences.

Context decides what the agent sees; tools decide what it can do. Chapter 03: designing tool interfaces a model can actually wield — naming, schemas, error surfaces, token-efficient outputs — and MCP, the protocol that turned tool integration from an N×M matrix into a standard.

§