★ RAGDirect LLM · no frameworkProduction grade

RAG · RETRIEVAL-AUGMENTED GENERATION

Production RAG. Fourteen patterns, seven vector DBs, one discipline.

Most RAG demos pass the screenshot test and fail the corpus. We ship the retrieval pipeline as the load-bearing part of the system: hybrid search, reranking, citations on every answer, eval suite gating every change. The model is the smaller engineering problem.

PostgreSQLpgvectorCohereOpenAITypeScriptEval pipelines

Start a conversation →All RAG topics →

Cycle

Sprint or program · sized to corpus

Stack

Postgres · pgvector · Qdrant · Cohere rerank

Output

Retrieval + citation surface + eval suite

Discipline

Citations on every answer

[THE STACK WE BUILD]

Five layers. Every layer measured.

Production RAG is a pipeline, not a model call. Each layer has a published 2026 best practice; we name the choice and the trade for every one.

Corpus + chunking

We characterise the source before we pick a chunker. PDFs with tables, code with comments, transcripts with timestamps, contracts with clauses. Each gets a different strategy. Late chunking and contextual retrieval (Anthropic) where they earn it.

Embedding + index

Embedding model picked from your real query distribution, not a leaderboard. Index lives in pgvector by default, in Qdrant or Milvus when scale demands. HNSW everywhere.

Hybrid retrieval

Dense (vector) + sparse (BM25) merged with reciprocal-rank fusion. Catches the exact-term matches embeddings miss and the semantic matches keywords miss. The 2026 default.

Reranking

Cross-encoder rerank (Cohere Rerank v3, BGE, ColBERT) on the top-K. Lifts Recall@5 from ~0.69 to ~0.82 on published benchmarks. Two stages, one consistent answer.

Citation + eval

Generation prompt requires citations per claim. Eval suite over your real queries gates every prompt + model change. Drift watched in production.

[GO DEEP]

Five specialty topics.

Pick the angle that matches the question you came in with.

01 · RAG

RAG architectures

Naive, Advanced, Modular, Agentic, GraphRAG, CRAG, Self-RAG. Five named patterns with the decision tree for picking one.

Vector databases

pgvector, Qdrant, Milvus, Weaviate, Vespa, LanceDB, Pinecone. Honest 2026 comparison and our default.

Retrieval pipeline

Embeddings, chunking, hybrid search, reranking. The four layers retrieval quality lives or dies in.

RAG by corpus scale

Proven designs from under 100k chunks to over 1B. The architecture changes with the scale.

Multimodal RAG

PDFs with tables and figures. Vision LLM extraction, ColPali, BGE-M3, court-ready citations.

[WHICH PATTERN FITS]

The 90-second pattern picker.

Answer up to three questions about your corpus and queries. The picker filters down to a single recommended pattern with a link to the playbook. Back up or start over at any time.

Pick by question shape, not by hype cycle.

If two answers both feel right for a question, choose the one that describes the build you actually want. The wizard recommends the bigger pattern by design.

Step 1

No answers yet

Question 1

Does answer quality depend on multi-hop reasoning across linked facts (entities, relationships)?

[THE PIPELINE]

Query in, citation out.

The shape every production RAG follows in 2026. The retrieval-pipeline page goes layer by layer.

What we run on every query.

Hybrid retrieval + cross-encoder rerank is the 2026 default. Adds ~12-30ms p95 and ~17 points of Recall@5.

User query

Optional: rewrite or expand (HyDE)

Embed

Cohere v3 / OpenAI 3-large / BGE

Hybrid retrieve

pgvector + BM25 → RRF

Rerank

Cohere Rerank v3 or BGE-reranker

Generate + cite

Claude / GPT with citation discipline

[WHAT YOU GET]

What's live at handoff.

Citations

On every answer, every claim

Hybrid

Vector + BM25, fused with RRF

+17 pts

Recall@5 lift from reranking

Eval-gated

Every prompt + model change tested

[COMMON QUESTIONS]

What buyers ask before they sign.

Should I use RAG or fine-tune?: Almost always RAG for facts that change, fine-tune (or post-train) for behaviour that doesn't. Most enterprise cases are RAG. We have a deeper take on the blog (RAG vs fine-tuning).
Do I need a dedicated vector database?: Usually not. pgvector with HNSW matches or beats dedicated vector databases up to ~1M vectors on equivalent compute (Supabase 2026 benchmarks). We move to Qdrant / Milvus / Vespa when scale, query throughput, or workload genuinely demands it. See the vector-databases page for the decision matrix.
How do you measure RAG quality?: Golden set of representative queries with expected citations + a separate eval set with adversarial cases. Retrieval metrics (Recall@K, MRR) gate the index. End-to-end metrics (faithfulness, answer quality via LLM-as-judge) gate the prompt + model. See /llm/evaluation for the broader eval discipline.
What about GraphRAG?: Microsoft's GraphRAG is the right answer for domains where multi-hop reasoning across linked entities is the question: clinical research, regulatory analysis, complex case files. Cost: building and maintaining the graph. See architectures for when it earns it.
How long does a RAG engagement run?: First production-grade build is typically an eight-week sprint. Enterprise programs (multi-corpus, GraphRAG, multimodal) run as quarterly phases. Ongoing partnerships make sense post-launch for eval ops, model migrations, and corpus drift monitoring.

DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.

Start a conversation →All RAG topics