Back to Work

RAG Architecture Patterns: Retrieval & Context Windows

RAG is the most over-discussed and under-engineered pattern in production AI. Most teams implement it the same way: chunk documents, embed them, dump the top-k into the prompt, ship. That implementation works for demos and breaks for users. The breakage isn't dramatic — the model produces confident, plausible answers grounded in the wrong chunks, and nobody notices for weeks because the failure mode is silent. This essay covers what production-grade RAG actually looks like, where it diverges from the demo version, and the three governance constraints that determine whether the system is one you can ship to customers or one you can only show to investors.

What RAG is for, and what it isn't

RAG exists to do exactly two things: ground responses in a defined corpus, and produce a citation trail. If your system needs neither, you don't need RAG. Long-context models with the documents pasted into the prompt will outperform a poorly-engineered RAG pipeline on most tasks. The reason to build RAG anyway is that the corpus is too large to fit in context, the corpus changes faster than retraining can keep up, or the citations are themselves part of the product (compliance, audit, attribution).

Building RAG without one of those three motivations is a footgun. The latency overhead, the embedding storage, the index maintenance, the retrieval failures — all of it has to be paid for, and "we should use RAG because everyone uses RAG" doesn't pay the bill.

The three tradeoff axes

Every RAG architecture is a point on three intersecting axes: latency, relevance, and governance. You cannot optimize all three. Production deployments live or die by which one you sacrifice — and which one you sacrifice should be a deliberate choice, not an accident of implementation order.

Latency

Every RAG hop adds time. Embedding the query, searching the index, fetching the chunks, re-ranking, formatting the context — easily 200–800ms before the model has even started. For interactive products this is the budget. For batch products it doesn't matter. Knowing which one you are is the first design decision.

Relevance

The right chunks have to come back. Naive top-k cosine similarity works on benchmarks and degrades on real corpora because real corpora are heterogeneous: different document types, different chunk granularities, mixed languages, structured and unstructured content side by side. Improvements to relevance — re-ranking, hybrid retrieval, query rewriting, recursive expansion — all cost latency.

Governance

The output has to be traceable to its sources, and the sources have to be authorized for the user asking. This is where most RAG systems quietly violate enterprise constraints: the embedding store doesn't know about ACLs, so the retrieval returns chunks the user shouldn't see. The model dutifully synthesizes them into a clean answer. The leak is silent. The audit finding is loud.

Five patterns and when to use them

Pattern 1: Naive RAG (chunk → embed → top-k → prompt)

The starting point. Fastest to build, lowest quality at scale. Use it for: prototypes, internal tools where latency matters more than nuance, demos. Stop using it for: anything customer-facing, anything compliance-touching, anything where the corpus is heterogeneous.

Pattern 2: Hybrid retrieval (dense + sparse)

Combine vector similarity with traditional keyword search (BM25 or equivalent). Hybrid retrieval handles the failure mode where dense embeddings miss exact-match queries — a part number, a function name, a legal clause reference. Recall improves materially. Latency cost: minimal. This should be the default for any RAG system above prototype stage.

Pattern 3: Re-ranked retrieval

Retrieve more candidates than you'll use (top 50), then re-rank them with a cross-encoder model that scores query-document pairs directly. Pick the best 5–10. Re-ranking dramatically improves precision; the latency cost is real but bounded (typically 100–300ms for the re-rank). The right pattern for systems where quality matters more than the last 200ms of latency.

Pattern 4: Query rewriting / decomposition

Before retrieval, a small model rewrites the user's query into one or more retrieval-optimized queries, decomposing complex questions into sub-questions. The retrieval runs on the rewrites, not the raw query. This handles the failure mode where users ask compound questions that no single retrieval can answer. Latency cost: another model call. Quality improvement on complex queries: substantial.

Pattern 5: Recursive / multi-hop retrieval

For questions that require synthesis across multiple documents, the system retrieves, generates a partial answer, identifies gaps, retrieves again to fill the gaps, and synthesizes. This is the heaviest pattern and the one most often deployed prematurely. Use it only when the task genuinely requires multi-step reasoning over the corpus and the latency budget can absorb 2–5 retrieval rounds.

Chunking is half the system

The single design decision that has the largest effect on RAG quality is chunking strategy, and it is the decision that gets the least attention because it isn't glamorous. Bad chunks doom retrieval no matter how good the embedding model is.

Three principles I've found load-bearing:

  1. Chunk along semantic boundaries, not character counts. Sections, paragraphs, list items, code blocks. Splitting mid-sentence to hit a 500-token target produces unretrievable chunks.
  2. Preserve metadata at the chunk level. Every chunk should carry its source document, section heading, ACL identifier, and timestamp. The model never sees this metadata, but the retrieval pipeline does — and the citation system depends on it.
  3. Index multiple representations of the same content. A chunk plus a summary of that chunk plus a list of questions the chunk answers. Different query types match different representations. Storage is cheap; retrieval failures are not.

The governance gap

The fastest way to ship a compliance incident is to build RAG without ACL propagation. The pattern: documents go into the corpus with mixed permission levels. The embedding store doesn't model permissions. Retrieval returns chunks regardless of who's asking. The user gets answers they shouldn't have access to.

The fix is non-trivial because the embedding index doesn't naturally express "this chunk is visible to user X but not user Y." Options that work in production:

  • Separate indexes per permission tier. Cleanest but expensive at scale; works when the tier count is small.
  • Metadata-filtered retrieval. Index the ACL on each chunk, filter at retrieval time. Requires a vector store that supports efficient filtered search; not all do.
  • Post-retrieval enforcement. Retrieve broadly, drop unauthorized chunks before passing to the model. Simplest to bolt on; introduces a class of bugs where the model generates an answer based on chunks that get filtered out and provides no citation.

Whichever option you pick, write it down explicitly. Treat ACL propagation as a first-class component, not as something to figure out later. Later is when the audit happens.

Citation, not summary

Customers don't trust ungrounded model output, and they shouldn't. The distinction that matters in production is not whether the model "uses" the retrieved context — it almost always does to some degree. The distinction is whether the output preserves a verifiable citation trail to specific source passages, such that a user (or an auditor) can click through and verify.

The technical pattern is straightforward and underused: have the model produce structured output that includes inline citation tokens referencing chunk identifiers. Validate post-generation that every claim in the output traces to at least one cited chunk. Reject outputs that contain claims with no citation. Yes, this constrains what the model can say. That is the point.

Evaluation that survives drift

Most RAG evaluations are point-in-time: build a labeled set, score the system, declare victory. This breaks within weeks because the corpus drifts, the embedding model gets updated, the user query distribution shifts, and the evaluation set becomes unrepresentative.

What works in practice: a small synthetic eval that runs on every change (catches regressions fast), a stratified production sample reviewed weekly (catches distribution shift), and an explicit "unanswerable" class in the eval (catches the model confidently inventing answers when retrieval failed). Without the third one, your eval will tell you the system is performing better than it is — because confident wrong answers and confident right answers look the same on accuracy metrics until you look at the citations.

Closing

RAG is not a product feature; it is an architectural commitment. The teams getting good results from it have decided which axis to sacrifice, are deliberate about chunking, treat governance as a first-class component, and run evaluations that catch silent drift. The teams getting mediocre results are running naive top-k against a hopeful chunker and wondering why the answers feel off.

None of the patterns above are exotic. They are the difference between a demo that compiles and a system that survives the customers using it.