The window is a budget
Every call an agent makes is assembled from the same six components, and they all draw on one account. The system prompt establishes identity and invariants. Tool definitions describe what the agent can do — schemas the model must re-read on every single call. Memory carries what previous sessions learned. Retrieval injects evidence for the current question. History is the transcript so far — turns, tool calls, tool results. The scratchpad holds the agent's own working notes. Their sum must fit the window, and in practice it must fit well under it:
# the context budget: six components against a 200K window (EQ A2.1, A2.2)
parts = {"system": 4000, "tools": 12000, "memory": 2000,
"history": 60000, "retrieval": 30000, "scratchpad": 8000}
LIMIT, PRICE, KAPPA = 200_000, 3.00, 0.1 # window, $/Mtok input, cache discount
total = sum(parts.values())
for name, tok in parts.items():
bar = "#" * round(40 * tok / LIMIT)
print(f"{name:10s} {tok:7,d} {tok/total:6.1%} {bar}")
print(f"{'TOTAL':10s} {total:7,d} -> {total/LIMIT:.0%} of the 200K window")
# EQ A2.2: the stable prefix P is served from cache; only the tail D is fresh
P = parts["system"] + parts["tools"] + parts["memory"] + parts["history"]
D = total - P
cold = total * PRICE / 1e6
warm = (KAPPA * P + D) * PRICE / 1e6
print(f"\ncold call (no cache): ${cold:.3f} warm call: ${warm:.3f}")
print(f"warm/cold = (kP+D)/(P+D) = {warm/cold:.2f} -> {cold/warm:.1f}x cheaper per step")
print("history dominates the budget, but append-only history is cache-hit —")
print("the expensive tokens are the ones you change, not the ones you keep")
Why \(\rho \ll 1\)? Because the window is a physical limit but attention is a budget of its own. A transformer relates \(T\) tokens through \(T^2\) pairwise scores (Vol II · EQ 3.1), softmax spreads a fixed unit of probability mass over an ever-longer row, and training data contains far fewer million-token dependency patterns than thousand-token ones. The result is context rot: needle-in-a-haystack benchmarks saturate near 100%, while realistic tasks — multi-fact reasoning, instructions stated once at turn 3 and needed at turn 47, relevant passages buried mid-window among plausible distractors — degrade measurably as the window fills. Frontier models in 2026 hold up far better than the 2023 generation that made “lost in the middle” a famous phrase, but none are flat, and the degradation profile varies by model, by task, and by where the needle sits (Vol II · Ch 09). Treat the advertised window as an engineering maximum, not an operating point.
What earns its place
Adding context is never free, even far from the limit. Every token competes for the same attention mass; every irrelevant passage is a distractor the model must actively rule out at every subsequent step. The working metric is signal-to-token ratio: of the tokens you are about to add, what fraction changes what the model will do? A 3,000-token file dump whose only relevant content is one function signature has a signal-to-token ratio near zero — and unlike money, badly spent context keeps charging you, because it rides along in all future calls until something removes it.
Curation has a characteristic failure on each side. System prompts drift too rigid: after every incident someone appends another if-then rule, until the prompt is a brittle 4,000-token legal code the model follows to the letter and the spirit of nothing. Or they stay too vague: “be helpful and thorough” — a row of zeros that assumes shared context the model does not have. The right altitude is in between: identity, hard invariants, heuristics with reasons, and a small number of canonical examples that show rather than enumerate.
| Component | Earns its place when… | Typical bloat |
|---|---|---|
| System | identity · invariants · heuristics | Edge-case rules patched in after every incident |
| Tools | each tool distinct & necessary | 40 overlapping tools whose schemas ride along on every call |
| Memory | distilled decisions & preferences | Raw transcripts pasted forward as “memory” |
| Retrieval | passages that answer the live question | Top-k padding; whole files when one signature suffices |
| History | recent turns verbatim, older compacted | Every raw tool dump since turn 1 |
| Scratchpad | plans & notes the agent actually rereads | Stale reasoning that no later step ever reads |
A useful discipline: before any component is added, name the future step that will read it. If you cannot, it is not context — it is sediment.
Retrieval vs long context
When the corpus is much larger than the window, there is no debate. A 10-million-document knowledge base at ~500 tokens each is 5B tokens against a 200K window — a factor of 25,000. Retrieval-augmented generation exists because selection is forced: an index (embeddings, BM25, or both) narrows the corpus to a handful of candidates, and only those candidates spend context. The engineering then lives in retrieval quality — chunking, hybrid lexical + semantic search (embeddings famously miss exact identifiers like error codes and function names that keyword search catches trivially), and reranking.
When the corpus fits, the trade is genuinely contested. Stuffing the full corpus into context often beats RAG on answer quality — no retriever to miss the relevant passage — and for one-shot questions over a small document set it is frequently the right call. But the costs recur on every request: you pay tokens and prefill latency for the whole corpus each time (softened, not eliminated, by caching — §2.6), and you spend the very attention budget that §2.1 showed degrading past half-fill. Long context and retrieval are not rivals; they are a price curve, and the crossover moves with corpus size, query volume, and how often the corpus changes.
The agentic turn added a third option that has largely won for tool-rich domains: just-in-time retrieval. Instead of front-loading top-k passages, keep lightweight references in context — file paths, schema names, document titles — and give the agent tools to fetch full content on demand: grep, open_file, a search API. A coding agent that navigates with search-and-open routinely beats one fed pre-embedded chunks of the same repository, because each fetch is targeted by the agent's current hypothesis rather than by a similarity score computed before the task began. This is progressive disclosure: context holds the map, tools fetch the territory. The honest cost is latency — every just-in-time fetch is a round trip — so production systems hybridize: pre-load what is almost certainly needed (the map, the conventions file), fetch the rest as the task reveals it.
Memory architectures
Everything in the window dies when the session ends. Memory is the set of structures that survive — and agents use three tiers, distinguished by scope and lifetime. The scratchpad is task-scoped: a todo list, a running plan, intermediate results, maintained inside or alongside the current context so the agent can re-anchor after long tool outputs push the original goal thousands of tokens upstream. The persistent memory file is project-scoped: a curated document (the CLAUDE.md / MEMORY.md pattern) of conventions, decisions, and preferences, loaded into the stable prefix of every session. Episodic summaries are history-scoped: compressed records of what previous sessions did, retrievable when relevant rather than always loaded.
| Tier | Scope · lifetime | Written | Characteristic failure |
|---|---|---|---|
| Scratchpad | one task · minutes–hours | continuously, by the agent | Notes written but never reread; plan drift |
| Memory file | one project · weeks–months | on decision, with review | Stale facts treated as live truth |
| Episodic summaries | across sessions · indefinite | at session boundaries | Summary-of-summary blur; contamination |
What separates working memory systems from decorative ones is write-back discipline. Memory that is only ever read decays into fiction: the project migrated databases in March, the memory file still says Postgres, and the agent confidently writes against the wrong schema. The rules that hold up in practice: write on decisions and constraints, not on chatter (“user prefers tabs” earns a write; a transcript of the debate about tabs does not); keep entries small, structured, and dated; and route writes through review — either a human glance or a separate validation pass — because an agent that can write its own memory can also poison it, persisting a hallucination that every future session will inherit as ground truth. Memory is the one context component with compound interest, in both directions.
Compaction: summarize and continue
A long-running agent will hit the budget no matter how disciplined the curation. Compaction is the standard escape: when fill crosses a threshold (typically 70–90% of the effective budget), replace the oldest span of history with a structured summary and keep the recent tail verbatim. The session continues; the transcript does not.
# compaction vs monotone growth: context size across a 60-turn session
BASE = 6_000 # stable prefix: system + tools + memory
PER_TURN = 1_400 # average tokens one turn adds (action + observation)
SUMMARY = 900 # what a structured compaction leaves behind
EVERY = 10 # compact every N turns, keep a 2-turn tail verbatim
turns, raw, compacted = [], [], []
ctx_r = ctx_c = BASE
for t in range(1, 61):
ctx_r += PER_TURN
ctx_c += PER_TURN
if t % EVERY == 0:
ctx_c = BASE + SUMMARY + 2 * PER_TURN
turns.append(t); raw.append(ctx_r); compacted.append(ctx_c)
print(f"turn 60 without compaction: {raw[-1]:6,d} tokens "
f"({raw[-1]/200_000:.0%} of a 200K window, still climbing)")
print(f"turn 60 with compaction : peaks at {max(compacted):6,d}, "
f"resets to {min(compacted[9:]):5,d}")
print(f"tokens re-sent on the next call: {raw[-1]/compacted[-1]:.0f}x more without it")
print("the sawtooth is the win; what the summary DROPS is the risk (Instrument A2.2)")
plot_xy(turns, raw) # mint: monotone growth
plot_xy(turns, compacted) # blue: the compaction sawtooth
Compaction is a lossy codec, and the entire craft is choosing the loss function. The loss is brutally asymmetric: dropping a pleasantry costs nothing; dropping a constraint costs the task. What must survive, in rough priority order: the goal as currently understood; every constraint, stated once and never repeated; decisions with their reasons (so they are not silently relitigated); exact identifiers — file paths, function names, ids, commands; and unresolved state — what failed, what was tried, what is pending. What can die: greetings and acknowledgments, superseded drafts, and raw tool payloads whose conclusions have already been distilled into a decision. A summary that reads like a friendly recap and tests like amnesia is the most common failure in production agents — which is exactly what the instrument below lets you reproduce.
Cache-aware context design
Prompt caching (Vol II · Ch 08) stores the computed KV state of a prompt prefix so the next request that shares it skips that prefill entirely. Two consequences define how agent context must be laid out. First, matching is exact-prefix: one changed byte at position \(i\) invalidates everything from \(i\) onward — there is no partial credit. Second, the savings are large enough to dominate architecture: cache reads are billed at roughly a tenth of fresh input across the major providers, and the skipped prefill is most of your time-to-first-token on long contexts.
In a fifty-step loop the same prefix is replayed fifty times, so the layout rules are unforgiving:
# cache-aware assembly — most stable first
1 system: identity, invariants — changes never
2 tools: full schemas — changes per deploy, not per step
3 memory: project file — changes per session
─── cache breakpoint ───
4 history: append-only — new turns go at the END; never rewrite,
reorder, or re-render earlier turns
5 dynamics: fresh retrieval & scratchpad — changes every step
# cache killers: a timestamp in the system prompt · mutating the
# tool list mid-session · non-deterministic JSON serialization
Append-only history is why compaction (§2.5) is scheduled, not casual: a compaction necessarily rewrites the transcript and takes the cold-prefill hit once, on purpose, at a moment of your choosing — instead of a timestamp doing it silently on every single call.
Sub-agents as context partitioning
The final tool is architectural: when one window cannot hold a task, split the task, not the window. A sub-agent is a fresh context dedicated to one concern — search this codebase, audit this contract, verify this claim — spawned with a self-contained brief, run to completion, and discarded. The orchestrator's window holds the plan and the results; each worker's window absorbs the noise of its own exploration and dies with it. The contract that makes this work: results flow back, never transcripts. A sub-agent that reads forty files and burns 80K tokens doing it returns a 300-token report; the orchestrator pays 300, not 80,000. Each spawn is a deliberate compression boundary — sharper than compaction, because the summary is written while the full evidence is still in (the sub-agent's) context.
The honest ledger, because sub-agents are currently fashionable enough to be over-applied. They multiply total token spend — multi-agent research systems burn several times the tokens of a single-agent run on the same task, which only pays off when the work is read-heavy and parallelizable. They add latency per spawn. And they reintroduce the oldest distributed-systems bug as a prompt problem: the telephone game. A sub-agent only knows what its brief says — it cannot see the conversation that produced the brief — so an under-specified brief yields a confident answer to the wrong question. Worse, two sub-agents editing shared state will collide, which is why the stable pattern is read-heavy fan-out (search, audit, verify in parallel) feeding a single writer that holds the plan. Partition concerns, not sentences.
Context decides what the agent sees; tools decide what it can do. Chapter 03: designing tool interfaces a model can actually wield — naming, schemas, error surfaces, token-efficient outputs — and MCP, the protocol that turned tool integration from an N×M matrix into a standard.
Further reading
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. — the foundational RAG paper behind retrieval-vs-long-context tradeoffs.
- Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. — empirical proof that position in the window changes what a model can use.
- Beltagy, I., Peters, M. & Cohan, A. (2020). Longformer: The Long-Document Transformer. — a seminal approach to scaling attention past the fixed window.
- Park, J. S. et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. — introduces a memory stream with retrieval, reflection, and decay for long-lived agents.
- Packer, C. et al. (2023). MemGPT: Towards LLMs as Operating Systems. — frames context as tiered memory paged in and out, the basis for compaction architectures.
- Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. — the dense-embedding retrieval method underpinning modern context assembly.