07 · Retrieval-Augmented Generation

7.1

Why retrieve at all

A pretrained model stores knowledge in its weights. That parametric memory has four chronic problems, and retrieval addresses each one directly:

Freshness. Weights freeze at the training cutoff. Anything that changed afterward — a price, a policy, last quarter's numbers — is invisible to the model and cannot be patched without retraining. A retrieval corpus is updated by writing a row.
Private and proprietary knowledge. The model never saw your wiki, your contracts, or this customer's ticket history. You cannot prompt your way to facts the model was never shown; you have to put them in the context.
Grounding and citations. A model asked a factual question will produce fluent text whether or not it knows the answer. Supplying the source passages lets the model quote them and lets you show the user where an answer came from. Grounding does not eliminate hallucination, but it converts an unanswerable "is this true?" into a checkable "does this follow from these passages?"
Cost and context economy. The naive alternative — paste the entire knowledge base into every prompt — is quadratic in attention cost (Vol II · EQ 3.1) and bumps into the window limit. Retrieval sends the model a few hundred relevant tokens instead of a few hundred thousand, which is cheaper, faster, and frequently more accurate because the model is not distracted by irrelevant text.

The original framing (Lewis et al., 2020) treated retrieval as a differentiable component trained jointly with the generator. Production RAG in 2026 is almost always the simpler, modular version: a frozen retriever feeds a frozen generator, and you engineer the seam between them. That seam — chunk, embed, index, retrieve, rerank, assemble — is the subject of this chapter, and every stage is a place where recall quietly leaks.

RAG is not a model; it is a systems discipline wrapped around one. The failure surface looks like the agent failure surface from Ch 06: most "the model is wrong" complaints turn out to be "the right passage was never retrieved," which is a recall bug, not a reasoning bug. Diagnosing which one you have is half the job.

7.2

The pipeline, end to end

RAG has two phases. Indexing runs offline, once per corpus version: documents are split into chunks, each chunk is embedded into a vector, and the vectors are loaded into a searchable index. Query time runs per request: the question is embedded, the index returns the nearest chunks, an optional reranker reorders them, and the top survivors are assembled into a prompt alongside the question.

FIG A7.ATHE RAG PIPELINE — INDEX OFFLINE, RETRIEVE ONLINE

Two design constraints fall straight out of the diagram. First, the query and the documents must be embedded into the same vector space — usually the same encoder — or nearness is meaningless. Second, every chunk must carry its provenance (document id, section, offsets) all the way through, because the answer's value is only as good as the citation you can attach to it. Treat the chunk's metadata as load-bearing, not decorative.

The rest of the chapter walks the stages in order. Keep one number in mind throughout: end-to-end answer quality is bounded above by retrieval quality. If the right passage is not in the top-k that reaches the model, no amount of prompting recovers it. That is why we spend more pages on retrieval than on generation.

7.3

Chunking: the unglamorous decision that dominates

An embedding represents a whole chunk as one vector, so the chunk is the atomic unit of retrieval. You can only ever retrieve a chunk, never half of one, which makes the splitting policy more consequential than practitioners expect. The trade-off is a tension you cannot escape:

Small chunks (one or two sentences) are semantically focused: the embedding is a tight summary, so a matching query scores high and precision per chunk is good. But a small chunk often lacks the surrounding context the model needs to answer, and the fact you want may straddle two chunks.
Large chunks (a page or more) carry full context and rarely split a fact, helping recall. But one vector now averages many topics, diluting the signal, so the right chunk scores lower and may not crack the top-k. Large chunks also waste context budget on irrelevant neighbouring text.

The standard knobs: a target chunk size (commonly 200–500 tokens for dense embeddings, bounded by the encoder's max sequence length), a token overlap between adjacent chunks (50–100 tokens) so a fact spanning a boundary survives in at least one chunk, and a splitter that respects structure — split on paragraphs, headings, or markdown sections rather than a blind character count, so chunks coincide with semantic units. More elaborate schemes exist: semantic chunking places boundaries where adjacent-sentence embedding similarity drops; hierarchical / parent-document retrieval indexes small chunks for precise matching but feeds the model the larger parent passage for context. There is no universal best; the size is an evaluated hyperparameter, tuned against the metrics of §7.8 on your own corpus.

EQ A7.1 — CHUNK COUNT WITH OVERLAP $$ N_{\text{chunks}} \;=\; \left\lceil \frac{L_{\text{doc}} - o}{\,c - o\,} \right\rceil, \qquad \text{stride} \;=\; c - o $$

$L_{\text{doc}}$ document length in tokens, $c$ chunk size, $o$ overlap ($0 \le o < c$). Each new chunk advances by the stride $c-o$, not by $c$, because the last $o$ tokens are repeated in the next chunk. Overlap is pure storage cost: doubling overlap roughly doubles the redundant tokens you embed and store, so it buys boundary safety at a real index-size price. At $o \to c$ the stride collapses and the chunk count diverges — a useful sanity check that you have not set overlap too high.

A document is $L_{\text{doc}} = 6000$ tokens. With chunk size $c = 500$ and overlap $o = 50$, how many chunks does EQ A7.1 produce? (Stride $= c - o$; take the ceiling.)

Stride $= c - o = 500 - 50 = 450$. EQ A7.1 gives $\lceil (L_{\text{doc}} - o)/(c - o) \rceil = \lceil (6000 - 50)/450 \rceil = \lceil 5950/450 \rceil = \lceil 13.22 \rceil$. The ceiling rounds up to cover the trailing partial chunk, so the count is 14. Note that without overlap ($o = 0$) the same document tiles into exactly $6000/500 = 12$ chunks: overlap added two extra chunks, the storage price of boundary safety. The answer is 14.

INSTRUMENT A7.2 — CHUNK-SIZE TRADE-OFFPRECISION VS RECALL VS INDEX COST · ILLUSTRATIVE

CHUNK SIZE c (tokens) 350

OVERLAP o (tokens) 60

CORPUS SIZE 2.0M tok

EST. PRECISION@k

—

EST. RECALL@k

—

CHUNKS IN INDEX

—

The curves are an illustrative model of the real tension, not a fit to any benchmark: tiny chunks score high precision but lose facts that span boundaries (recall sags); huge chunks recover every fact but dilute the embedding (precision sags). The sweet spot is the crossover, typically a few hundred tokens. Raise overlap and watch recall recover at the boundaries while the index-size readout climbs — overlap is recall you buy with storage. Push corpus size to feel how the chunk count scales linearly with it.

7.4

Embeddings, similarity, and the index

An embedding model maps a chunk of text to a fixed-length vector $\mathbf{e} \in \mathbb{R}^{d}$ (typical $d$: 384 to 3072) such that semantically similar text lands nearby. These are the same dense representations whose geometry we met in attention, but trained with a contrastive objective so that "how do I reset my password" and "I forgot my login" map to neighbours even with no shared words. Nearness is measured by an inner product:

EQ A7.2 — COSINE SIMILARITY $$ \cos(\mathbf{q}, \mathbf{d}) \;=\; \frac{\mathbf{q} \cdot \mathbf{d}}{\lVert \mathbf{q} \rVert\, \lVert \mathbf{d} \rVert} \;=\; \frac{\sum_{i=1}^{d} q_i d_i}{\sqrt{\sum_i q_i^2}\,\sqrt{\sum_i d_i^2}} \;\in\; [-1, 1] $$

The cosine of the angle between query and document vectors: 1 means identical direction, 0 orthogonal (unrelated), negative means opposed. It ignores magnitude and compares only direction, which is what you want when document length should not dominate relevance. If every vector is L2-normalized to unit length, cosine similarity equals the plain dot product $\mathbf{q}\cdot\mathbf{d}$, and maximizing cosine is equivalent to minimizing squared Euclidean distance. This is why production pipelines normalize at index time and then use the cheaper dot product everywhere downstream.

Retrieval is then k-nearest-neighbour search: return the $k$ document vectors with the highest cosine to the query vector. Exact kNN scans every vector, an $O(Nd)$ cost that is fine for thousands of chunks and ruinous for hundreds of millions. Production systems use a vector index that answers approximate nearest-neighbour (ANN) queries in sub-linear time by trading a little recall for a lot of speed. The two dominant families:

HNSW (Hierarchical Navigable Small World): a multi-layer proximity graph you greedily traverse from a coarse top layer down to a dense bottom layer. Excellent recall–latency curve, the default for in-memory indexes; memory-heavy because it stores the graph alongside the vectors.
IVF (inverted file, often with product quantization, IVF-PQ): cluster vectors into cells, search only the few cells nearest the query, and compress vectors so billions fit in RAM. Cheaper memory, slightly lower recall, the workhorse for very large corpora.

Both expose a knob (HNSW's efSearch, IVF's nprobe) that trades recall for latency. The number that matters in the eval (§7.8) is index recall against an exact-kNN ground truth: an ANN index that silently returns the 9th-best instead of the best neighbour is a recall leak you will mistake for a model failure.

Query vector $\mathbf{q} = (1, 2, 2)$ and document vector $\mathbf{d} = (2, 3, 1)$. Compute their cosine similarity $\dfrac{\mathbf{q}\cdot\mathbf{d}}{\lVert\mathbf{q}\rVert\,\lVert\mathbf{d}\rVert}$ (round to 4 decimals).

Dot product $\mathbf{q}\cdot\mathbf{d} = 1{\cdot}2 + 2{\cdot}3 + 2{\cdot}1 = 2 + 6 + 2 = 10$. Norms: $\lVert\mathbf{q}\rVert = \sqrt{1+4+4} = \sqrt{9} = 3$; $\lVert\mathbf{d}\rVert = \sqrt{4+9+1} = \sqrt{14} = 3.7417$. Cosine $= 10/(3 \times 3.7417) = 10/11.2250 = $ 0.8909. Both vectors point in a similar direction, so they are strongly related. The answer is 0.8909.

PYTHON · RUNNABLE IN-BROWSER

# dense retrieval, end to end: embed toy docs, L2-normalize, cosine top-k
import numpy as np

# toy "embeddings" for 5 docs and 1 query (already in some vector space)
docs = np.array([
    [0.9, 0.1, 0.0, 0.2],   # 0: password reset
    [0.8, 0.2, 0.1, 0.1],   # 1: forgot login
    [0.1, 0.9, 0.2, 0.0],   # 2: billing invoice
    [0.0, 0.1, 0.95, 0.1],  # 3: ship a package
    [0.85,0.15,0.05,0.25],  # 4: account recovery
], dtype=float)
labels = ["password reset","forgot login","billing","shipping","recovery"]
q = np.array([0.88, 0.12, 0.0, 0.15])

def l2norm(x):                       # unit vectors: cosine becomes dot product
    return x / (np.linalg.norm(x, axis=-1, keepdims=True) + 1e-12)

D, qn = l2norm(docs), l2norm(q)
scores = D @ qn                      # cosine of each doc with the query
order = np.argsort(-scores)          # descending

print("rank  cosine  document")
for r, i in enumerate(order):
    mark = "  <- query is a login problem" if r == 0 else ""
    print(f"{r+1:>4d}{scores[i]:8.3f}  {labels[i]}{mark}")
print("\ntop-3 retrieved:", [labels[i] for i in order[:3]])

edits are live — break it on purpose

INSTRUMENT A7.1 — COSINE RETRIEVAL PLAYGROUNDEQ A7.2 · QUERY VS DOC SET IN 2-D · TOP-k

QUERY ANGLE θ 30°

TOP-k RETRIEVED 3

SIMILARITY

TOP MATCH

—

TOP SCORE

—

RETRIEVED SET

—

Six documents sit at fixed angles; the query is the bright vector you steer. Highlighted dots are the current top-k by similarity. Sweep the query angle and watch the ranking re-sort. The instructive part: switch from COSINE to DOT and the long red vector — a high-magnitude but off-angle document — can jump the ranking, because the dot product rewards magnitude while cosine ignores it. That is exactly why pipelines normalize before indexing: so a verbose document cannot buy relevance with length.

7.5

Sparse, dense, and hybrid retrieval

Dense embeddings are not the only way to rank documents, and on their own they are not the best. There are two complementary regimes:

Sparse / lexical retrieval scores documents by exact term overlap, weighted by how rare and how frequent each term is. The canonical scorer is BM25 (Robertson & Zaragoza, 2009), the refined heir to TF-IDF. It is unbeatable at one thing: matching exact tokens — product codes, error strings, proper nouns, acronyms — that a dense model may smear into a nearby-but-wrong region of the vector space.
Dense / semantic retrieval scores by embedding similarity (§7.4), as in Dense Passage Retrieval (Karpukhin et al., 2020). It captures paraphrase and meaning, matching "car won't start" to "engine fails to crank" with zero shared content words — exactly where BM25 returns nothing.

EQ A7.3 — BM25 SCORE $$ \text{BM25}(q, D) \;=\; \sum_{t \in q} \text{IDF}(t)\,\cdot\, \frac{f(t, D)\,(k_1 + 1)}{f(t, D) + k_1\left(1 - b + b\,\dfrac{|D|}{\overline{|D|}}\right)} $$

Sum over query terms $t$. $f(t,D)$ is term $t$'s frequency in document $D$; $\text{IDF}(t)$ up-weights rare terms; $|D|/\overline{|D|}$ is the document length divided by the average, so $b$ ($\approx 0.75$) penalizes long documents and $k_1$ ($\approx 1.2\text{–}2.0$) controls term-frequency saturation. The saturation is the whole point: the tenth occurrence of a term adds far less than the first, so keyword stuffing cannot run away with the score. BM25 needs no training and no GPU; it is the baseline every dense retriever must beat, and frequently does not, alone.

Because their failure modes are disjoint, the production answer is hybrid retrieval: run both, then fuse the two ranked lists. The simplest robust fusion does not even need the raw scores — which live on incomparable scales — only the ranks. Reciprocal Rank Fusion sums a discounted contribution from each list:

EQ A7.4 — RECIPROCAL RANK FUSION $$ \text{RRF}(d) \;=\; \sum_{r \in R} \frac{1}{\,K + \text{rank}_r(d)\,} $$

For document $d$, sum over each ranker $r$ (e.g. BM25 and dense) the reciprocal of its rank in that list, softened by a constant $K$ (commonly 60). A document ranked 1st by BM25 and 2nd by the dense retriever scores $1/61 + 1/62$. RRF is scale-free: it never compares BM25 scores to cosine values, only positions, which is why it is the default fusion in production search and beats hand-tuned score weighting in most settings. $K$ damps the influence of low ranks: a large $K$ flattens the contribution curve, a small $K$ lets the top of each list dominate.

With $K = 60$, a document is ranked 1st by BM25 and 3rd by the dense retriever. By EQ A7.4 its RRF score is $\frac{1}{60+1} + \frac{1}{60+3}$. Compute it (round to 4 decimals).

$\frac{1}{61} = 0.016393$ and $\frac{1}{63} = 0.015873$. Sum $= 0.016393 + 0.015873 = 0.032266 \approx$ 0.0323. A document that both rankers like rises above one that only a single ranker liked, with no score-scale calibration needed. The answer is 0.0323.

PYTHON · RUNNABLE IN-BROWSER

# a tiny BM25 over toy documents (EQ A7.3), then fuse with a dense list (EQ A7.4)
import numpy as np

docs = [
    "reset your password from the account settings page",
    "i forgot my login and cannot sign in",
    "billing invoice and payment refund policy",
    "track a shipped package with the tracking id",
]
def toks(s): return s.lower().split()
corpus = [toks(d) for d in docs]
avgdl = np.mean([len(d) for d in corpus])
N = len(corpus)
k1, b = 1.5, 0.75

def idf(term):                        # BM25 idf with the +0.5 smoothing
    nq = sum(term in d for d in corpus)
    return np.log((N - nq + 0.5) / (nq + 0.5) + 1)

def bm25(query):
    q = toks(query); out = []
    for d in corpus:
        s = 0.0
        for t in set(q):
            f = d.count(t)
            if f == 0: continue
            s += idf(t) * f * (k1 + 1) / (f + k1 * (1 - b + b * len(d) / avgdl))
        out.append(s)
    return np.array(out)

scores = bm25("forgot password login")
rank_bm25 = np.argsort(-scores)       # best first
# pretend a dense retriever returned this order (doc indices, best first):
rank_dense = np.array([0, 1, 3, 2])

def rrf(ranks_lists, K=60):
    fused = np.zeros(N)
    for ranks in ranks_lists:
        for pos, doc in enumerate(ranks):
            fused[doc] += 1.0 / (K + pos + 1)   # rank is 1-based
    return fused

fused = rrf([rank_bm25, rank_dense])
print("BM25 scores :", scores.round(3))
print("BM25 order  :", rank_bm25.tolist())
print("dense order :", rank_dense.tolist())
print("RRF fused   :", fused.round(4).tolist())
print("hybrid top-2:", np.argsort(-fused)[:2].tolist(),
      "->", [docs[i][:24] for i in np.argsort(-fused)[:2]])

edits are live — break it on purpose

7.6

Reranking and query transformation

The retriever's job is recall under a tight latency budget: surface a generous candidate set (say top-50) fast. That set is usually mis-ordered — the truly best passage may sit at rank 11. A reranker takes the small candidate set and re-scores it with a much more expensive, much more accurate model, then keeps the top few.

The architectural distinction is the whole story. The first-stage retriever is a bi-encoder: it embeds the query and each document separately, so document vectors can be precomputed and indexed, but the query never interacts with the document text — only their final vectors meet, at a dot product. A reranker is a cross-encoder: it feeds the query and one candidate document together through a transformer, so every query token can attend to every document token (Vol II · EQ 3.1). That cross-attention is far more discriminating — it sees the actual word-level interactions a single dot product cannot — but it cannot be precomputed, so it costs one full forward pass per candidate. The two-stage pattern resolves the tension: cheap bi-encoder narrows millions to dozens, expensive cross-encoder perfects the order of those dozens.

	Bi-encoder (retriever)	Cross-encoder (reranker)
Query–doc interaction	none until the dot product	full cross-attention, all token pairs
Precompute docs?	yes — index offline	no — must run at query time
Cost	one ANN lookup	one forward pass per candidate
Role	recall over millions, fast	precision over dozens, accurate

Query transformation attacks the same recall problem from the other end — by fixing the query before it ever hits the index. Three techniques earn their keep:

Query rewriting / expansion. Have an LLM rephrase a terse or conversational query into a retrieval-friendly form, resolve pronouns from the conversation ("it" → "the refund policy"), and add synonyms. Cheap, and it fixes the most common failure: a user query that is not phrased like the documents.
HyDE (Hypothetical Document Embeddings). Instead of embedding the question, ask the LLM to write a hypothetical answer and embed that. A fabricated answer, even if factually wrong, lives near real answers in embedding space — it shares their vocabulary and shape — so it retrieves better than the bare question, which is phrased nothing like the documents it should match.
Multi-query. Generate several paraphrases of the question, retrieve for each, and union (or RRF-fuse) the results. This widens recall for ambiguous queries at the cost of more retrieval calls.

INSTRUMENT A7.3 — RERANKER ORDER SHIFTBI-ENCODER CANDIDATES → CROSS-ENCODER ORDER

STAGE

KEEP TOP-k 3

RELEVANT IN TOP-k (RETRIEVER)

—

RELEVANT IN TOP-k (RERANKED)

—

PRECISION@k LIFT

—

Eight candidates the bi-encoder returned, mint = truly relevant. In RETRIEVER ORDER two relevant passages languish below the top-3 cut line — the dot product mis-ranked them. Toggle AFTER RERANK and the cross-encoder pulls them up, because it actually read the query against each document. The lift readout is the precision@k gain. Slide k to see why reranking matters most at small k: the tighter the context budget, the more the order of the first few passages decides the answer.

7.7

Context assembly and lost-in-the-middle

You now have a short, reranked list of passages. How you arrange them in the prompt is not cosmetic. Liu et al. (2023) documented a robust position effect: when relevant information is placed in the middle of a long context, models use it markedly less reliably than when the same information sits at the very start or the very end. Accuracy as a function of the gold passage's position is U-shaped, not flat — a model with a 100K window does not attend uniformly across it.

The engineering consequences are concrete:

Order by relevance toward the edges. Put the most relevant reranked passages first and last, not buried in the middle. Some pipelines deliberately interleave to keep top passages at both ends.
Fewer, better passages beat more, noisier ones. Padding the context with marginal chunks does not help and can hurt: it adds distractors (§7.8) and pushes the gold passage toward the lost-in-the-middle zone. This is the quantitative case for aggressive reranking — context budget spent on rank-20 noise is worse than wasted.
Cite explicitly. Number the passages and instruct the model to ground each claim in a numbered source. This makes citations checkable and discourages the model from answering from parametric memory when the passages do not support a claim.

The instruction also matters: a prompt that says "answer only from the provided context; if the answer is not present, say so" measurably reduces ungrounded answers compared to a prompt that merely provides context. Grounding is partly retrieval and partly instruction.

Lost-in-the-middle is the retrieval-side mirror of the long-context argument in §7.9. A bigger window does not make position effects disappear; it enlarges the middle where attention is weakest. Retrieving fewer, well-ordered passages is often more accurate than dumping a hundred candidates into a long context and hoping the model finds the needle.

7.8

Evaluation and failure modes

RAG has two stages to evaluate, and conflating them is the most common measurement mistake. Evaluate the retriever and the generator separately, because a bad answer can come from either, and the fix differs entirely.

Retrieval metrics need a labelled set: queries with their known-relevant chunks. The two workhorses:

EQ A7.5 — RECALL@k AND PRECISION@k $$ \text{Recall@}k \;=\; \frac{|\,\text{relevant} \cap \text{top-}k\,|}{|\,\text{relevant}\,|}, \qquad \text{Precision@}k \;=\; \frac{|\,\text{relevant} \cap \text{top-}k\,|}{k} $$

Recall@k: of all the chunks that should have been found, what fraction made the top-$k$. Precision@k: of the $k$ you returned, what fraction were relevant. For RAG, recall@k is the metric that bounds answer quality — a passage the retriever never surfaces is a passage the model can never use, so recall is the ceiling on everything downstream. Precision@k matters because every irrelevant passage spends context budget and risks distracting the generator (lost-in-the-middle, §7.7). Rank-aware variants (MRR, nDCG) additionally reward putting the relevant chunk near the top.

Generation metrics ask whether the answer is supported by the retrieved context, which is the property RAG actually promises:

Faithfulness / groundedness. Is every claim in the answer entailed by the retrieved passages? An unfaithful answer hallucinated beyond its sources even though the sources were present. Typically scored by an LLM judge (Vol IV Ch 06) decomposing the answer into claims and checking each against the context — with all the judge caveats from that chapter.
Answer relevance. Does the answer actually address the question, regardless of grounding?
Context relevance / precision. Were the retrieved passages actually needed, or did the retriever pad the context with noise?

The diagnostic table that turns a vague "RAG is wrong" into an owned bug:

Failure mode	Symptom	Where it lives	First fix
Stale index	confidently cites outdated facts	indexing	Re-index on source change; track corpus version in the trace
Chunk boundary split	retrieves half a fact; answer truncated mid-reasoning	chunking (7.3)	Increase overlap; structure-aware splitter; parent-document retrieval
Recall miss	right passage exists but never reaches the model	retrieval (7.4–7.5)	Add hybrid (BM25 + dense); raise efSearch/nprobe; query rewrite / HyDE
Distractors	a near-miss passage misleads the answer	ranking (7.6)	Add a cross-encoder reranker; shrink k; tighten context-only prompt
Lost in the middle	relevant passage retrieved but ignored	assembly (7.7)	Reorder top passages to the edges; fewer, better chunks
Ungrounded answer	good context, answer drifts beyond it	generation	Stronger grounding instruction; faithfulness eval gate; citation check

A query has 5 relevant chunks in the corpus. The retriever's top-10 contains 3 of them. By EQ A7.5, what is recall@10? (Answer as a decimal.)

Recall@10 $= \dfrac{|\text{relevant} \cap \text{top-}10|}{|\text{relevant}|} = \dfrac{3}{5} = $ 0.6. Two relevant chunks were missed entirely, so the model can answer from at most 3 of 5 sources — recall@k is the ceiling on what the generator can possibly use. The answer is 0.6.

7.9

Agentic RAG, and the honest decision

Everything above describes naive RAG: one retrieval, one generation, fixed pipeline. It is the right default and solves most cases. It fails on questions that need more than one hop — "compare our 2024 and 2025 refund policies" needs two distinct retrievals — or where the first retrieval reveals what the second query should be.

Agentic / iterative RAG hands retrieval to the agent loop of Ch 01. Retrieval becomes a tool the model calls (Ch 03): the agent decides whether to retrieve, formulates its own query, reads the results, and decides whether to retrieve again, refine, or answer. This subsumes the techniques above — query rewriting becomes the agent choosing a better query; multi-query becomes the agent issuing several searches; reranking becomes the agent reading and discarding. It is strictly more capable and strictly more expensive (every hop is a model call plus a retrieval), and it inherits every agent failure mode from Ch 06: loops, premature termination, hallucinated tool args. Reach for it when single-shot recall is provably insufficient, not by default.

Finally, the question that should precede all of this engineering. RAG is one of three ways to give a model knowledge it did not have, and they are not interchangeable:

	RAG	Long context	Fine-tuning
Best for	large, changing, citable knowledge bases	a few documents, used once, per request	teaching style, format, or a skill
Freshness	update by writing a row	whatever you paste in	frozen at training time
Citations	native — points to chunks	possible but unindexed	none
Per-query cost	low (small context)	high (whole corpus in context)	low
Upfront cost	build & maintain a pipeline	none	a training run + data curation
Main failure	recall misses	lost-in-the-middle; price	catastrophic forgetting; staleness

The honest reading: they are complementary, not competing. Fine-tuning teaches the model how to behave (a domain's tone, an output schema, a reasoning style); RAG supplies what is true right now; long context handles the one-off document you will never query again. A common production stack fine-tunes for format and behaviour, then uses RAG for facts. "Just use a long context window" is tempting as windows grow, but it pays the full context price on every query, offers no citations, and walks straight into lost-in-the-middle at the scales where it would matter. Use the cheapest mechanism that meets the requirement, and reach for RAG precisely when the knowledge is large, private, changing, or must be cited.

A long-context approach costs $\$0.90$ per query and answers correctly 80% of the time; a RAG pipeline costs $\$0.45$ per query and answers correctly 70% of the time. Using the cost-per-resolved-task idea from Ch 06 ($C/\Pr[\text{correct}]$), what is the cost per correct answer of the cheaper-per-correct-answer option (round to 4 decimals)?

Long context: $0.90/0.80 = \$1.125$ per correct answer. RAG: $0.45/0.70 = \$0.6429$ per correct answer. RAG is cheaper per correct answer despite the lower raw accuracy, because it is half the price per query — exactly the lesson of Ch 06: compare on cost per resolved task, never on accuracy or raw cost alone. The cheaper option costs 0.6429.

That closes Volume IV. Across seven chapters you went from the bare agentic loop to context engineering, tools and MCP, harness and loop design, evaluation and cost, and now retrieval — the discipline that grounds an agent in knowledge it was never trained on. The four volumes together cover the stack: foundations, prompting, and agent engineering. Return to the full contents to revisit a thread, or carry these patterns into the Gym and put them under load.

References

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 · the paper that named RAG and trained retriever + generator jointly
Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906 · the dual-encoder dense retriever behind modern semantic search
Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 · the U-shaped position effect that governs context assembly
Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 · the naive / advanced / modular RAG taxonomy used here
Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. DOI:10.1561/1500000019 · the canonical derivation of the BM25 scorer