Kensink Labs
RAG architecturesDirect LLM · no frameworkProduction grade
RAG · ARCHITECTURES

RAG architectures. Sketched, benchmarked, ranked by when each ships.

Each pattern, full-width, with its architecture sketch, the stack we'd deploy, the accuracy work that gates it, and an honest read on when it earns the build. Five are the patterns we reach for in most production conversations. The rest earn the build in narrower shapes of corpus or query.

pgvectorCohereOpenAIAnthropicEval pipelines
Treatment
Sketch · stack · deploy · evals
Default
Advanced RAG (hybrid + rerank)
Cycle
Sprint or program · sized to corpus
Discipline
Citations + eval gates
[EVERY PATTERN, FULL TREATMENT]

Each one sketched, ranked, with a playbook.

Primary patterns are colour-coded by brand gradient. Specialised patterns sit on neutral cards. Every card links to the full playbook: deployment, stack, accuracy work, community verdict, and our take.

01SpecialisedBASELINE

Naive RAG

Embed the query, fetch top-K, stuff them in the prompt. No rewriting, no fusion, no reranking.

When it earns the build

Single-domain FAQ, internal docs search, narrow-scope chatbots where queries are short and the corpus is well-keyed. The fastest thing that can possibly work.

When it doesn't

Anything with ambiguity, jargon mismatch, multi-hop reasoning, or a corpus larger than ~50k chunks. Recall falls off and the LLM hallucinates the rest.

02SpecialisedCONVERSATIONAL

Simple RAG with memory

Naive RAG plus a conversation buffer. The model can resolve pronouns and follow-up references because it sees prior turns.

When it earns the build

Customer support chat, tutoring bots, personal assistants where the second question depends on the first. Cheaper than rebuilding context every turn.

When it doesn't

When privacy or retention requirements rule out persistent state, or when the conversation has no useful continuity (one-shot search).

03SpecialisedARCHITECTURE STYLE

Modular RAG

Compose the pipeline from swappable parts (retriever, reranker, generator). Not a pattern itself; an engineering posture.

When it earns the build

Every production build. Module boundaries let you swap a reranker or change the embedding model without rewriting the whole thing. The 2026 default engineering posture.

When it doesn't

Prototypes where shipping in two days matters more than swapping a component in two months.

04PrimaryOUR DEFAULT

Advanced RAG

Query rewriting + hybrid (vector + BM25) + reciprocal-rank fusion + cross-encoder rerank + citation discipline. The 2026 production consensus.

When it earns the build

Most production RAG. Reranking alone lifts Recall@5 by ~17 points; hybrid catches the exact-term matches embeddings miss. Worth the extra ~30ms p95 on almost every build.

When it doesn't

Throughput-extreme workloads where you cannot pay the rerank latency tax, or corpora so well-keyed that hybrid + rerank doesn't move the needle.

05PrimaryTOOL-USING

Agentic RAG

LLM decides what to retrieve, from which source, and whether the result is good enough. Multiple retrieval rounds, source-specific agents, validation step.

When it earns the build

Heterogeneous corpora, multi-source research (legal + financial + internal), complex queries that need decomposition. Where one shot of retrieval was always wrong.

When it doesn't

Cost-sensitive volume traffic. Each query costs multiple LLM calls. Latency is unpredictable.

06PrimaryMULTI-HOP

Graph RAG

Build a knowledge graph over entities + relationships. Retrieval becomes graph traversal. Reasoning becomes path-following.

When it earns the build

Multi-hop reasoning across linked facts: clinical decision support, regulatory analysis, complex case files, investigative journalism. Published numbers report 81%+ accuracy in specialised domains, +6.8 pts over flat RAG.

When it doesn't

Flat document collections, fast-changing data, small corpora where the graph build cost dwarfs the retrieval gain.

07PrimarySELF-CRITIQUE

Self-RAG

Model evaluates its own retrieval and answer. Decides if it needs to retrieve again, with what query, before responding.

When it earns the build

Vague or under-specified queries, domains where a confidently wrong answer is more expensive than a slow one. Hides bad retrieval from the user.

When it doesn't

Latency-critical interactive use. Self-critique adds round-trips. Can also refuse to answer too often when uncertainty is high.

08SpecialisedPARALLEL EXPLORATION

Branched RAG

Run multiple interpretations of the query in parallel, score each, pick or merge the best answer.

When it earns the build

Open-ended research queries, comparative analysis (this product vs that), domains where the question has multiple legitimate framings.

When it doesn't

Cost-sensitive workloads, since branching multiplies LLM + retrieval calls. Can overwhelm the user if results aren't filtered.

09SpecialisedROUTING

Adaptive RAG

Classify the query (simple / complex / broad / narrow) and route to a matched retrieval strategy. Simple queries get fast pipelines; complex ones get the agentic shape.

When it earns the build

Mixed-shape traffic: public-facing assistants, support bots, internal tools that see everything from "what's our return policy?" to "compare these three contracts".

When it doesn't

Single-shape workloads where every query is the same depth. Adds classifier latency without earning it back.

10SpecialisedPRE-FETCH

Speculative RAG

Predict the next likely query while answering the current one. Pre-fetch retrieval for the predicted follow-up.

When it earns the build

Latency-critical interactive use where the conversation has predictable shape. Autocomplete-style search, support flows with well-known follow-ups.

When it doesn't

Open-domain or one-shot use. Wrong predictions waste compute and (worse) prime the cache with irrelevant context.

11PrimaryPOST-CHECK

Corrective RAG (CRAG)

After retrieval, evaluate document quality. If weak, fall back to web search or query rewrite before generation.

When it earns the build

High-stakes accuracy contexts like legal research, academic writing, and policy analysis, where catching a bad retrieval before generation is worth the extra latency.

When it doesn't

Volume traffic. Quality checks add cost on every query, including the ones that would have been fine.

12SpecialisedQUERY EXPANSION

HyDE (Hypothetical Document Embedding)

LLM writes a hypothetical answer to the query, embeds the answer, retrieves documents semantically similar to it. Semantic matching, not term matching.

When it earns the build

Technical or specialist domains where the query and the document use different vocabulary. Medical, legal, academic. When BM25 misses and embeddings need the right anchor.

When it doesn't

Domains where the LLM is likely to hallucinate the hypothetical. The bad guess pulls retrieval down a wrong path. Always pair with rerank.

13SpecialisedBEYOND TEXT

Multimodal RAG

Text + images + tables + audio in one retrieval surface. Vision LLMs and multi-modal embedding models (BGE-M3, ColPali) make documents understandable, not just searchable.

When it earns the build

PDFs with tables and figures (legal, financial, technical), visual catalogs, scanned archives, medical imaging notes. See our /llm/rag/multimodal/ playbook for the full build.

When it doesn't

Plain-text corpora. The extra extraction step costs latency and complexity without retrieval gain.

14SpecialisedREASONING LOOPS

Iterative / multi-step RAG

Generate, retrieve based on the partial answer, generate again. Used inside agentic and chain-of-thought workflows when one retrieval pass isn't enough.

When it earns the build

Long-form synthesis, structured report writing, queries that decompose into sub-questions. Often a composition pattern inside agentic systems rather than a standalone build.

When it doesn't

Single-turn lookup. Iteration just adds latency.

[WHAT YOU GET]

What we leave on every RAG build.

14
Patterns considered, one named
Hybrid
Vector + BM25 by default
Reranked
Cross-encoder on top-K
Cited
Every claim, every answer
[COMMON QUESTIONS]

What buyers ask before they sign.

If you had to ship one pattern tomorrow, which one?
Advanced RAG: hybrid retrieval (pgvector + BM25 fused with RRF) + cross-encoder rerank (Cohere Rerank v3) + citation discipline. It's the 2026 production consensus, and the +17 pts of Recall@5 from reranking alone almost always justifies the latency. Promote to Agentic or GraphRAG only when the corpus or query shape demands.
When is GraphRAG worth the build cost?
When the question quality depends on multi-hop reasoning across linked entities: clinical decision support, regulatory analysis, complex case files, multi-document Q&A in regulated domains. Microsoft and follow-on research report 6-8 point accuracy gains over flat RAG, hitting 81%+ in specialised domains. The cost is building and maintaining the graph, which is non-trivial.
Aren't Self-RAG and Corrective RAG basically the same?
Close but not identical. Self-RAG critiques its own answer and decides whether to retrieve again. Corrective RAG evaluates retrieval quality before generation, and falls back to web search or query rewrite if the retrieved docs are weak. In practice we often combine them with eval gates at both stages.
Is HyDE still useful with modern embedding models?
Yes, in specialist domains where the query vocabulary diverges from the document vocabulary: medical, legal, code. Always pair with rerank: HyDE expands the candidate set; rerank cleans it up. Without rerank, hallucinated hypotheses can pull retrieval off course.
Do you ever ship Naive RAG to production?
Only for very narrow, well-keyed corpora: a single product's FAQ, internal docs with consistent vocabulary, narrow-domain chatbots. Even then, we add reranking and citations on the second iteration. The cost of "upgrade later" is almost always lower than the cost of shipping a system that gets quietly wrong.
DIRECT RAG · APPLIED K

Pick the pattern. Bring the corpus.

We will sketch the pipeline against your real data, name the trade, and ship a measured build. Sized to the work: sprint, program, or ongoing partnership.