Why retrieve at all
A pretrained model stores knowledge in its weights. That parametric memory has four chronic problems, and retrieval addresses each one directly:
- Freshness. Weights freeze at the training cutoff. Anything that changed afterward — a price, a policy, last quarter's numbers — is invisible to the model and cannot be patched without retraining. A retrieval corpus is updated by writing a row.
- Private and proprietary knowledge. The model never saw your wiki, your contracts, or this customer's ticket history. You cannot prompt your way to facts the model was never shown; you have to put them in the context.
- Grounding and citations. A model asked a factual question will produce fluent text whether or not it knows the answer. Supplying the source passages lets the model quote them and lets you show the user where an answer came from. Grounding does not eliminate hallucination, but it converts an unanswerable "is this true?" into a checkable "does this follow from these passages?"
- Cost and context economy. The naive alternative — paste the entire knowledge base into every prompt — is quadratic in attention cost (Vol II · EQ 3.1) and bumps into the window limit. Retrieval sends the model a few hundred relevant tokens instead of a few hundred thousand, which is cheaper, faster, and frequently more accurate because the model is not distracted by irrelevant text.
The original framing (Lewis et al., 2020) treated retrieval as a differentiable component trained jointly with the generator. Production RAG in 2026 is almost always the simpler, modular version: a frozen retriever feeds a frozen generator, and you engineer the seam between them. That seam — chunk, embed, index, retrieve, rerank, assemble — is the subject of this chapter, and every stage is a place where recall quietly leaks.
RAG is not a model; it is a systems discipline wrapped around one. The failure surface looks like the agent failure surface from Ch 06: most "the model is wrong" complaints turn out to be "the right passage was never retrieved," which is a recall bug, not a reasoning bug. Diagnosing which one you have is half the job.
The pipeline, end to end
RAG has two phases. Indexing runs offline, once per corpus version: documents are split into chunks, each chunk is embedded into a vector, and the vectors are loaded into a searchable index. Query time runs per request: the question is embedded, the index returns the nearest chunks, an optional reranker reorders them, and the top survivors are assembled into a prompt alongside the question.
Two design constraints fall straight out of the diagram. First, the query and the documents must be embedded into the same vector space — usually the same encoder — or nearness is meaningless. Second, every chunk must carry its provenance (document id, section, offsets) all the way through, because the answer's value is only as good as the citation you can attach to it. Treat the chunk's metadata as load-bearing, not decorative.
The rest of the chapter walks the stages in order. Keep one number in mind throughout: end-to-end answer quality is bounded above by retrieval quality. If the right passage is not in the top-k that reaches the model, no amount of prompting recovers it. That is why we spend more pages on retrieval than on generation.
Chunking: the unglamorous decision that dominates
An embedding represents a whole chunk as one vector, so the chunk is the atomic unit of retrieval. You can only ever retrieve a chunk, never half of one, which makes the splitting policy more consequential than practitioners expect. The trade-off is a tension you cannot escape:
- Small chunks (one or two sentences) are semantically focused: the embedding is a tight summary, so a matching query scores high and precision per chunk is good. But a small chunk often lacks the surrounding context the model needs to answer, and the fact you want may straddle two chunks.
- Large chunks (a page or more) carry full context and rarely split a fact, helping recall. But one vector now averages many topics, diluting the signal, so the right chunk scores lower and may not crack the top-k. Large chunks also waste context budget on irrelevant neighbouring text.
The standard knobs: a target chunk size (commonly 200–500 tokens for dense embeddings, bounded by the encoder's max sequence length), a token overlap between adjacent chunks (50–100 tokens) so a fact spanning a boundary survives in at least one chunk, and a splitter that respects structure — split on paragraphs, headings, or markdown sections rather than a blind character count, so chunks coincide with semantic units. More elaborate schemes exist: semantic chunking places boundaries where adjacent-sentence embedding similarity drops; hierarchical / parent-document retrieval indexes small chunks for precise matching but feeds the model the larger parent passage for context. There is no universal best; the size is an evaluated hyperparameter, tuned against the metrics of §7.8 on your own corpus.
Embeddings, similarity, and the index
An embedding model maps a chunk of text to a fixed-length vector \(\mathbf{e} \in \mathbb{R}^{d}\) (typical \(d\): 384 to 3072) such that semantically similar text lands nearby. These are the same dense representations whose geometry we met in attention, but trained with a contrastive objective so that "how do I reset my password" and "I forgot my login" map to neighbours even with no shared words. Nearness is measured by an inner product:
Retrieval is then k-nearest-neighbour search: return the \(k\) document vectors with the highest cosine to the query vector. Exact kNN scans every vector, an \(O(Nd)\) cost that is fine for thousands of chunks and ruinous for hundreds of millions. Production systems use a vector index that answers approximate nearest-neighbour (ANN) queries in sub-linear time by trading a little recall for a lot of speed. The two dominant families:
- HNSW (Hierarchical Navigable Small World): a multi-layer proximity graph you greedily traverse from a coarse top layer down to a dense bottom layer. Excellent recall–latency curve, the default for in-memory indexes; memory-heavy because it stores the graph alongside the vectors.
- IVF (inverted file, often with product quantization, IVF-PQ): cluster vectors into cells, search only the few cells nearest the query, and compress vectors so billions fit in RAM. Cheaper memory, slightly lower recall, the workhorse for very large corpora.
Both expose a knob (HNSW's efSearch, IVF's nprobe) that trades recall for latency. The number that matters in the eval (§7.8) is index recall against an exact-kNN ground truth: an ANN index that silently returns the 9th-best instead of the best neighbour is a recall leak you will mistake for a model failure.
# dense retrieval, end to end: embed toy docs, L2-normalize, cosine top-k
import numpy as np
# toy "embeddings" for 5 docs and 1 query (already in some vector space)
docs = np.array([
[0.9, 0.1, 0.0, 0.2], # 0: password reset
[0.8, 0.2, 0.1, 0.1], # 1: forgot login
[0.1, 0.9, 0.2, 0.0], # 2: billing invoice
[0.0, 0.1, 0.95, 0.1], # 3: ship a package
[0.85,0.15,0.05,0.25], # 4: account recovery
], dtype=float)
labels = ["password reset","forgot login","billing","shipping","recovery"]
q = np.array([0.88, 0.12, 0.0, 0.15])
def l2norm(x): # unit vectors: cosine becomes dot product
return x / (np.linalg.norm(x, axis=-1, keepdims=True) + 1e-12)
D, qn = l2norm(docs), l2norm(q)
scores = D @ qn # cosine of each doc with the query
order = np.argsort(-scores) # descending
print("rank cosine document")
for r, i in enumerate(order):
mark = " <- query is a login problem" if r == 0 else ""
print(f"{r+1:>4d}{scores[i]:8.3f} {labels[i]}{mark}")
print("\ntop-3 retrieved:", [labels[i] for i in order[:3]])
Sparse, dense, and hybrid retrieval
Dense embeddings are not the only way to rank documents, and on their own they are not the best. There are two complementary regimes:
- Sparse / lexical retrieval scores documents by exact term overlap, weighted by how rare and how frequent each term is. The canonical scorer is BM25 (Robertson & Zaragoza, 2009), the refined heir to TF-IDF. It is unbeatable at one thing: matching exact tokens — product codes, error strings, proper nouns, acronyms — that a dense model may smear into a nearby-but-wrong region of the vector space.
- Dense / semantic retrieval scores by embedding similarity (§7.4), as in Dense Passage Retrieval (Karpukhin et al., 2020). It captures paraphrase and meaning, matching "car won't start" to "engine fails to crank" with zero shared content words — exactly where BM25 returns nothing.
Because their failure modes are disjoint, the production answer is hybrid retrieval: run both, then fuse the two ranked lists. The simplest robust fusion does not even need the raw scores — which live on incomparable scales — only the ranks. Reciprocal Rank Fusion sums a discounted contribution from each list:
# a tiny BM25 over toy documents (EQ A7.3), then fuse with a dense list (EQ A7.4)
import numpy as np
docs = [
"reset your password from the account settings page",
"i forgot my login and cannot sign in",
"billing invoice and payment refund policy",
"track a shipped package with the tracking id",
]
def toks(s): return s.lower().split()
corpus = [toks(d) for d in docs]
avgdl = np.mean([len(d) for d in corpus])
N = len(corpus)
k1, b = 1.5, 0.75
def idf(term): # BM25 idf with the +0.5 smoothing
nq = sum(term in d for d in corpus)
return np.log((N - nq + 0.5) / (nq + 0.5) + 1)
def bm25(query):
q = toks(query); out = []
for d in corpus:
s = 0.0
for t in set(q):
f = d.count(t)
if f == 0: continue
s += idf(t) * f * (k1 + 1) / (f + k1 * (1 - b + b * len(d) / avgdl))
out.append(s)
return np.array(out)
scores = bm25("forgot password login")
rank_bm25 = np.argsort(-scores) # best first
# pretend a dense retriever returned this order (doc indices, best first):
rank_dense = np.array([0, 1, 3, 2])
def rrf(ranks_lists, K=60):
fused = np.zeros(N)
for ranks in ranks_lists:
for pos, doc in enumerate(ranks):
fused[doc] += 1.0 / (K + pos + 1) # rank is 1-based
return fused
fused = rrf([rank_bm25, rank_dense])
print("BM25 scores :", scores.round(3))
print("BM25 order :", rank_bm25.tolist())
print("dense order :", rank_dense.tolist())
print("RRF fused :", fused.round(4).tolist())
print("hybrid top-2:", np.argsort(-fused)[:2].tolist(),
"->", [docs[i][:24] for i in np.argsort(-fused)[:2]])
Reranking and query transformation
The retriever's job is recall under a tight latency budget: surface a generous candidate set (say top-50) fast. That set is usually mis-ordered — the truly best passage may sit at rank 11. A reranker takes the small candidate set and re-scores it with a much more expensive, much more accurate model, then keeps the top few.
The architectural distinction is the whole story. The first-stage retriever is a bi-encoder: it embeds the query and each document separately, so document vectors can be precomputed and indexed, but the query never interacts with the document text — only their final vectors meet, at a dot product. A reranker is a cross-encoder: it feeds the query and one candidate document together through a transformer, so every query token can attend to every document token (Vol II · EQ 3.1). That cross-attention is far more discriminating — it sees the actual word-level interactions a single dot product cannot — but it cannot be precomputed, so it costs one full forward pass per candidate. The two-stage pattern resolves the tension: cheap bi-encoder narrows millions to dozens, expensive cross-encoder perfects the order of those dozens.
| Bi-encoder (retriever) | Cross-encoder (reranker) | |
|---|---|---|
| Query–doc interaction | none until the dot product | full cross-attention, all token pairs |
| Precompute docs? | yes — index offline | no — must run at query time |
| Cost | one ANN lookup | one forward pass per candidate |
| Role | recall over millions, fast | precision over dozens, accurate |
Query transformation attacks the same recall problem from the other end — by fixing the query before it ever hits the index. Three techniques earn their keep:
- Query rewriting / expansion. Have an LLM rephrase a terse or conversational query into a retrieval-friendly form, resolve pronouns from the conversation ("it" → "the refund policy"), and add synonyms. Cheap, and it fixes the most common failure: a user query that is not phrased like the documents.
- HyDE (Hypothetical Document Embeddings). Instead of embedding the question, ask the LLM to write a hypothetical answer and embed that. A fabricated answer, even if factually wrong, lives near real answers in embedding space — it shares their vocabulary and shape — so it retrieves better than the bare question, which is phrased nothing like the documents it should match.
- Multi-query. Generate several paraphrases of the question, retrieve for each, and union (or RRF-fuse) the results. This widens recall for ambiguous queries at the cost of more retrieval calls.
Context assembly and lost-in-the-middle
You now have a short, reranked list of passages. How you arrange them in the prompt is not cosmetic. Liu et al. (2023) documented a robust position effect: when relevant information is placed in the middle of a long context, models use it markedly less reliably than when the same information sits at the very start or the very end. Accuracy as a function of the gold passage's position is U-shaped, not flat — a model with a 100K window does not attend uniformly across it.
The engineering consequences are concrete:
- Order by relevance toward the edges. Put the most relevant reranked passages first and last, not buried in the middle. Some pipelines deliberately interleave to keep top passages at both ends.
- Fewer, better passages beat more, noisier ones. Padding the context with marginal chunks does not help and can hurt: it adds distractors (§7.8) and pushes the gold passage toward the lost-in-the-middle zone. This is the quantitative case for aggressive reranking — context budget spent on rank-20 noise is worse than wasted.
- Cite explicitly. Number the passages and instruct the model to ground each claim in a numbered source. This makes citations checkable and discourages the model from answering from parametric memory when the passages do not support a claim.
The instruction also matters: a prompt that says "answer only from the provided context; if the answer is not present, say so" measurably reduces ungrounded answers compared to a prompt that merely provides context. Grounding is partly retrieval and partly instruction.
Lost-in-the-middle is the retrieval-side mirror of the long-context argument in §7.9. A bigger window does not make position effects disappear; it enlarges the middle where attention is weakest. Retrieving fewer, well-ordered passages is often more accurate than dumping a hundred candidates into a long context and hoping the model finds the needle.
Evaluation and failure modes
RAG has two stages to evaluate, and conflating them is the most common measurement mistake. Evaluate the retriever and the generator separately, because a bad answer can come from either, and the fix differs entirely.
Retrieval metrics need a labelled set: queries with their known-relevant chunks. The two workhorses:
Generation metrics ask whether the answer is supported by the retrieved context, which is the property RAG actually promises:
- Faithfulness / groundedness. Is every claim in the answer entailed by the retrieved passages? An unfaithful answer hallucinated beyond its sources even though the sources were present. Typically scored by an LLM judge (Vol IV Ch 06) decomposing the answer into claims and checking each against the context — with all the judge caveats from that chapter.
- Answer relevance. Does the answer actually address the question, regardless of grounding?
- Context relevance / precision. Were the retrieved passages actually needed, or did the retriever pad the context with noise?
The diagnostic table that turns a vague "RAG is wrong" into an owned bug:
| Failure mode | Symptom | Where it lives | First fix |
|---|---|---|---|
| Stale index | confidently cites outdated facts | indexing | Re-index on source change; track corpus version in the trace |
| Chunk boundary split | retrieves half a fact; answer truncated mid-reasoning | chunking (7.3) | Increase overlap; structure-aware splitter; parent-document retrieval |
| Recall miss | right passage exists but never reaches the model | retrieval (7.4–7.5) | Add hybrid (BM25 + dense); raise efSearch/nprobe; query rewrite / HyDE |
| Distractors | a near-miss passage misleads the answer | ranking (7.6) | Add a cross-encoder reranker; shrink k; tighten context-only prompt |
| Lost in the middle | relevant passage retrieved but ignored | assembly (7.7) | Reorder top passages to the edges; fewer, better chunks |
| Ungrounded answer | good context, answer drifts beyond it | generation | Stronger grounding instruction; faithfulness eval gate; citation check |
Agentic RAG, and the honest decision
Everything above describes naive RAG: one retrieval, one generation, fixed pipeline. It is the right default and solves most cases. It fails on questions that need more than one hop — "compare our 2024 and 2025 refund policies" needs two distinct retrievals — or where the first retrieval reveals what the second query should be.
Agentic / iterative RAG hands retrieval to the agent loop of Ch 01. Retrieval becomes a tool the model calls (Ch 03): the agent decides whether to retrieve, formulates its own query, reads the results, and decides whether to retrieve again, refine, or answer. This subsumes the techniques above — query rewriting becomes the agent choosing a better query; multi-query becomes the agent issuing several searches; reranking becomes the agent reading and discarding. It is strictly more capable and strictly more expensive (every hop is a model call plus a retrieval), and it inherits every agent failure mode from Ch 06: loops, premature termination, hallucinated tool args. Reach for it when single-shot recall is provably insufficient, not by default.
Finally, the question that should precede all of this engineering. RAG is one of three ways to give a model knowledge it did not have, and they are not interchangeable:
| RAG | Long context | Fine-tuning | |
|---|---|---|---|
| Best for | large, changing, citable knowledge bases | a few documents, used once, per request | teaching style, format, or a skill |
| Freshness | update by writing a row | whatever you paste in | frozen at training time |
| Citations | native — points to chunks | possible but unindexed | none |
| Per-query cost | low (small context) | high (whole corpus in context) | low |
| Upfront cost | build & maintain a pipeline | none | a training run + data curation |
| Main failure | recall misses | lost-in-the-middle; price | catastrophic forgetting; staleness |
The honest reading: they are complementary, not competing. Fine-tuning teaches the model how to behave (a domain's tone, an output schema, a reasoning style); RAG supplies what is true right now; long context handles the one-off document you will never query again. A common production stack fine-tunes for format and behaviour, then uses RAG for facts. "Just use a long context window" is tempting as windows grow, but it pays the full context price on every query, offers no citations, and walks straight into lost-in-the-middle at the scales where it would matter. Use the cheapest mechanism that meets the requirement, and reach for RAG precisely when the knowledge is large, private, changing, or must be cited.
That closes Volume IV. Across seven chapters you went from the bare agentic loop to context engineering, tools and MCP, harness and loop design, evaluation and cost, and now retrieval — the discipline that grounds an agent in knowledge it was never trained on. The four volumes together cover the stack: foundations, prompting, and agent engineering. Return to the full contents to revisit a thread, or carry these patterns into the Gym and put them under load.
References
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
- Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering.
- Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts.
- Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey.
- Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond.