Kensink Labs
RAGDirect LLM · no frameworkProduction grade
RAG · RETRIEVAL-AUGMENTED GENERATION

Production RAG. Fourteen patterns, seven vector DBs, one discipline.

Most RAG demos pass the screenshot test and fail the corpus. We ship the retrieval pipeline as the load-bearing part of the system: hybrid search, reranking, citations on every answer, eval suite gating every change. The model is the smaller engineering problem.

PostgreSQLpgvectorCohereOpenAITypeScriptEval pipelines
Cycle
Sprint or program · sized to corpus
Stack
Postgres · pgvector · Qdrant · Cohere rerank
Output
Retrieval + citation surface + eval suite
Discipline
Citations on every answer
[THE STACK WE BUILD]

Five layers. Every layer measured.

Production RAG is a pipeline, not a model call. Each layer has a published 2026 best practice; we name the choice and the trade for every one.

01

Corpus + chunking

We characterise the source before we pick a chunker. PDFs with tables, code with comments, transcripts with timestamps, contracts with clauses. Each gets a different strategy. Late chunking and contextual retrieval (Anthropic) where they earn it.

02

Embedding + index

Embedding model picked from your real query distribution, not a leaderboard. Index lives in pgvector by default, in Qdrant or Milvus when scale demands. HNSW everywhere.

03

Hybrid retrieval

Dense (vector) + sparse (BM25) merged with reciprocal-rank fusion. Catches the exact-term matches embeddings miss and the semantic matches keywords miss. The 2026 default.

04

Reranking

Cross-encoder rerank (Cohere Rerank v3, BGE, ColBERT) on the top-K. Lifts Recall@5 from ~0.69 to ~0.82 on published benchmarks. Two stages, one consistent answer.

05

Citation + eval

Generation prompt requires citations per claim. Eval suite over your real queries gates every prompt + model change. Drift watched in production.

[WHICH PATTERN FITS]

The 90-second pattern picker.

Answer up to three questions about your corpus and queries. The picker filters down to a single recommended pattern with a link to the playbook. Back up or start over at any time.

Pick by question shape, not by hype cycle.

If two answers both feel right for a question, choose the one that describes the build you actually want. The wizard recommends the bigger pattern by design.

Step 1
No answers yet
Question 1

Does answer quality depend on multi-hop reasoning across linked facts (entities, relationships)?

[THE PIPELINE]

Query in, citation out.

The shape every production RAG follows in 2026. The retrieval-pipeline page goes layer by layer.

What we run on every query.

Hybrid retrieval + cross-encoder rerank is the 2026 default. Adds ~12-30ms p95 and ~17 points of Recall@5.

01

User query

Optional: rewrite or expand (HyDE)

02

Embed

Cohere v3 / OpenAI 3-large / BGE

03

Hybrid retrieve

pgvector + BM25 → RRF

04

Rerank

Cohere Rerank v3 or BGE-reranker

05

Generate + cite

Claude / GPT with citation discipline

[WHAT YOU GET]

What's live at handoff.

Citations
On every answer, every claim
Hybrid
Vector + BM25, fused with RRF
+17 pts
Recall@5 lift from reranking
Eval-gated
Every prompt + model change tested
[COMMON QUESTIONS]

What buyers ask before they sign.

Should I use RAG or fine-tune?
Almost always RAG for facts that change, fine-tune (or post-train) for behaviour that doesn't. Most enterprise cases are RAG. We have a deeper take on the blog (RAG vs fine-tuning).
Do I need a dedicated vector database?
Usually not. pgvector with HNSW matches or beats dedicated vector databases up to ~1M vectors on equivalent compute (Supabase 2026 benchmarks). We move to Qdrant / Milvus / Vespa when scale, query throughput, or workload genuinely demands it. See the vector-databases page for the decision matrix.
How do you measure RAG quality?
Golden set of representative queries with expected citations + a separate eval set with adversarial cases. Retrieval metrics (Recall@K, MRR) gate the index. End-to-end metrics (faithfulness, answer quality via LLM-as-judge) gate the prompt + model. See /llm/evaluation for the broader eval discipline.
What about GraphRAG?
Microsoft's GraphRAG is the right answer for domains where multi-hop reasoning across linked entities is the question: clinical research, regulatory analysis, complex case files. Cost: building and maintaining the graph. See architectures for when it earns it.
How long does a RAG engagement run?
First production-grade build is typically an eight-week sprint. Enterprise programs (multi-corpus, GraphRAG, multimodal) run as quarterly phases. Ongoing partnerships make sense post-launch for eval ops, model migrations, and corpus drift monitoring.
DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.