★ Multimodal RAGDirect LLM · no frameworkProduction grade

RAG · MULTIMODAL · BEYOND TEXT

Multimodal RAG. Tables, figures, scans, audio. Citation-grade.

Plain-text RAG breaks on the documents that matter most in regulated industries. Legal contracts have signatures and redlines. Financial reports have charts. Medical records have scans. Court-ready RAG needs a vision-aware extraction layer, multimodal embeddings, and a citation surface that tracks back to the page region. This is the shape we built for Affidavit Mapp.

Claude visionGPT visionColPaliBGE-M3Unstructuredpgvector

Start a conversation →All RAG topics →

Inputs

PDF · image · scanned · audio

Extract

Claude vision · GPT vision · Unstructured

Embed

ColPali · BGE-M3 · CLIP (legacy)

Reference

Affidavit Mapp (court-ready)

[AT A GLANCE]

Best for: PDFs with tables and figures. Court-ready legal evidence. Financial reports with charts. Technical manuals with schematics. Medical imaging plus text. Anywhere the layout, not just the words, carries meaning.

Origin

ColPali · BGE-M3 · vision LLM convergence (2024-2025)

Year

2024-2026 production consensus

Complexity

Complex · multi-component pipeline

Production stage

Mature for text + tables · emerging for video / audio

[THE PDF PROBLEM]

Four ways naive RAG breaks on real documents.

Each problem has a named solution in 2026. None of them are the fastest path. All of them earn the build when the documents are regulated, complex, or both.

Tables

Row/column structure carries the meaning. A naive text extraction loses the relational structure and the LLM hallucinates back. Solution: table-aware extraction (Unstructured / Camelot) plus structured chunking that preserves header context.

Figures + charts

Bar charts, schematics, diagrams. Embeddable as images; retrievable as captions plus images. Solution: vision LLM caption + the original image both indexed, retrieved together at query time.

Scanned pages

OCR is no longer enough. A scan of a hand-annotated contract has signatures, redlines, stamps that matter. Solution: vision LLM does the read directly (Claude vision / GPT vision), no OCR intermediate.

Layout + reading order

Multi-column PDFs, footnotes, sidebars, headers. Reading order breaks every naive extraction. Solution: layout-aware parsers (Unstructured) plus a layout-respecting chunker, often with a vision LLM as the layout judge for ambiguous pages.

[EXTRACTION FLOW]

From PDF to indexed chunks.

The shape every multimodal RAG ingestion runs through. Per-stage tone matches the broader RAG palette.

Ingestion pipeline (per document).

Each stage emits provenance. The chunk-level metadata that lands in the index includes source URL, page, region, extraction model, and timestamp.

01 PDF / image

Lands in object storage

02 Type router

Vision LLM vs Unstructured

03 Extract + region

Text + media + bbox

04 Chunk + caption

Layout-respecting

05 Embed

ColPali / BGE-M3 / Cohere

06 Index

pgvector + provenance

[RETRIEVAL ARCHITECTURE]

Text and visual, retrieved together.

At query time, the index serves text chunks, image embeddings, table chunks, and audio transcripts through one unified retrieval call. Rerank tightens the top-K; the LLM generates the answer with citations back to the source media.

[FIVE-LAYER STACK]

The components, named.

Document ingestion

PDFs, images, audio, structured docs land in object storage. A worker enqueues them by type. Each type gets its own extraction strategy. Provenance recorded (source URL, hash, ingestion timestamp) so every chunk can cite its origin.

Extraction

Vision LLM for documents with layout / tables / figures (Claude vision is our 2026 default, GPT vision for fallback). Unstructured for clean PDFs with simple layout. Whisper for audio. The extraction layer outputs both the text AND the original media reference.

Embedding

Text-only: Cohere v3 or OpenAI 3-large. Multimodal: ColPali for direct image retrieval (no text intermediate), BGE-M3 for dense + sparse + multi-vector in one model. CLIP for legacy multimodal where the corpus and queries don't need 2024-era retrieval gains.

Unified index

Text chunks, image embeddings, table chunks, and audio transcripts all in one pgvector or Qdrant index, separated by metadata facets. Retrieval can be filtered by media type, or merged across all four.

Citation surface

Every claim in the answer points back to its source: page number, table cell, audio timestamp, image region. The UI renders the citation inline; the source can be opened, verified, and audited. Court-ready by design.

[HOW WE DEPLOY]

Day one to indexed corpus.

An eight-week sprint for the first multimodal RAG build over a defined corpus. Longer programs phase the same shape across additional corpora and regulatory reviews.

01
Document profiling
We sample the corpus. Scanned vs born-digital, table density, figure load, layout consistency, page count distribution. Each shape gets a different extraction strategy; we name them up front.
02
Type router + extraction layer
A type router classifies each incoming document and dispatches to the right extractor. Vision LLM for layout-heavy and scanned. Unstructured for clean born-digital. Whisper for audio. Each extractor emits text plus the source media reference plus a region (page + bounding box, timestamp).
03
Layout-respecting chunking
Multi-column reading order resolved. Tables kept as structural units (header row plus rows). Figures captioned and indexed with their captions. Footnotes attached to the parent passage. Chunk size tuned per document type, not a single global default.
04
Multimodal embedding + index
Text embedding for prose chunks (Cohere v3 or OpenAI 3-large). ColPali or BGE-M3 for image and visually-rich pages. Image embeddings co-stored with text in pgvector or Qdrant, separated by media type for facetable retrieval.
05
Citation surface
Every chunk carries provenance: source URL, page, bounding box (for visual chunks), timestamp range (for audio), extraction model identifier, ingestion timestamp. The citation surface is the index contract; the UI renders citations inline at query time.
06
Hybrid retrieval + rerank
Same Advanced RAG shape as text-only: vector + BM25 fused with RRF, then a cross-encoder rerank (Cohere Rerank v3) on top-50. For visually-rich queries, ColPali's late-interaction ranking can replace the reranker for image-first results.
07
Generation with cite-to-region
Citation-required prompt. Model cites text-to-passage and image-to-region. Structured output validates the citation map. The UI renders inline citations that open the source PDF at the right page and bounding box.
08
Eval suite + audit trail
Golden set of 100-300 representative queries with hand-graded expected citations including layout precision. The eval gates ship; the audit trail of every model call gets shipped to the SIEM with PII scrubbed at the proxy.

[ACCURACY + BENCHMARKS]

What the numbers say.

Multimodal RAG benchmarks are workload-specific. Published numbers on ColPali and BGE-M3 hold up in our engagements; citation precision is the metric our buyers care about most.

+10-20%

Retrieval accuracy on visually-rich corpora

ColPali published numbers

99.7%

Data integrity on Affidavit Mapp

Production engagement

Days→min

Processing turnaround on legal corpora

Affidavit Mapp

Region

Citation precision on cite-to-region

Our default

Our eval methodology

Our eval suite for multimodal RAG grades retrieval and citation precision separately. Retrieval precision is standard Recall@K plus media-type recall (did we surface the right chart, not just the right text). Citation precision adds region-level grading: did the cited bounding box actually contain the supporting evidence? Both metrics gate ship.

[COMMUNITY FEEDBACK]

What practitioners report.

Multimodal RAG went from emerging to mature between 2024 and 2026. The papers are recent; the production shape is settled.

Multimodal RAG has moved from research to production fast. ColPali (Faysse et al., 2024) shifted the conversation: late-interaction multimodal embeddings retrieve directly against page images, often beating text-only RAG on visually-rich corpora by 10-20 points. BGE-M3 (Chen et al., 2024) consolidated dense plus sparse plus multi-vector in one embedding model, simplifying the stack. The two together became the 2025-2026 production default for visually-rich corpora. Anthropic's vision API and OpenAI's GPT vision delivered the extraction half of the pipeline at frontier-LLM quality.

[COMMON PITFALLS]

Treating multimodal RAG as a text RAG plus images bolt-on. The extraction layer is the bottleneck; build for layout first.
Skipping cite-to-region citations. On regulated builds, the source must be auditable down to the page or cell.
Single chunker for every document type. Tables, prose, scans, and figures want different strategies; one global setting fails all of them.
Cost surprises on vision LLM extraction. Vision token bills can dwarf text bills on large corpora; budget per-document caps and provenance everywhere.

[KENSINK LABS EVALUATION]

Our honest take.

Multimodal RAG earns the build when the documents are not plain text. When figures, tables, or scans carry meaning, the multimodal stack repays its complexity.

We have shipped multimodal RAG most consistently in legal and financial work where the documents are scanned, tabular, and regulated. The Affidavit Mapp build is the reference we point buyers at: vision LLM extraction, layout-respecting chunking, hybrid retrieval, cite-to-region citations, 99.7% data integrity vs ground truth, days-to-minutes processing. The pattern is mature; the engineering is the work.

[WHEN WE REACH FOR IT]

Legal evidence, court-ready citation surfaces, regulated document review.
Financial filings, prospectuses, annual reports with charts and tables that matter.
Technical and medical documentation where schematics and images carry meaning.
Customer-facing assistants over product manuals where figures answer faster than prose.

What we'd substitute

Text-only Advanced RAG on text-heavy corpora where figures are decorative rather than load-bearing. Heavyweight document understanding services (AWS Textract, Google DocumentAI) for buyers who would rather pay for a managed extraction layer than operate vision LLMs.

[REFERENCE BUILD]

Affidavit Mapp.

Court-ready legal RAG. Bank statements, contracts, scanned exhibits go in as PDFs. The pipeline produces a court-admissible report with citations down to the page-and-region level.

Read the case study →

99.7%

Data integrity vs ground truth

Days→min

Processing turnaround

Court-ready

Citation surface to region

Chain

Of custody preserved

[WHAT YOU GET]

What we ship on a multimodal RAG.

Vision-aware

Extraction, not OCR

Multimodal

ColPali / BGE-M3 / Cohere

Cite-to-region

Page + bounding box per claim

Auditable

Provenance chain on every chunk

[COMMON QUESTIONS]

What buyers ask before they sign.

Do I need a multimodal embedding model, or is text extraction enough?: Depends on the corpus. For text-heavy PDFs (contracts, legal briefs, policy documents) where the figures are illustrative not load-bearing, high-quality text extraction + a text embedding (Cohere v3) is enough. For visually-rich documents (financial reports with charts, technical manuals with schematics, medical images), ColPali or BGE-M3 with the original image as part of the retrieval surface delivers significantly better quality.
What is ColPali and why does it matter?: ColPali is a multimodal embedding model that retrieves directly against page images, skipping the text extraction step entirely. Published numbers show it outperforming text-only RAG on visually-rich corpora by 10-20 points of retrieval accuracy. The cost is heavier embeddings and a more expensive index, but for documents where layout IS the content, it's the right answer.
How do you handle citations in multimodal RAG?: The citation surface tracks source identity at extraction time. For PDFs: page number, optional bounding-box for the chunk. For tables: page number plus table row/cell. For images: page number plus image region. For audio: timestamp range. The prompt requires the model to cite per claim, and the UI renders the source inline. This is the Affidavit Mapp shape, where court-ready citations were the engagement constraint, not a nice-to-have.
Can I use a single LLM for both extraction and generation?: Yes, and we often do. Claude or GPT vision can extract the document content into structured chunks at ingestion, then the same family of model generates the answer at query time. The decoupling we DO keep: the extraction step writes to the index, the generation step reads from it. So we can swap one without re-extracting the corpus.
What does the Affidavit Mapp build look like?: Legal documents (bank statements, court filings, financial records) come in as PDFs with mixed layouts including tables, scanned pages, hand-annotated sections. Vision LLM extracts each into chunks tagged with page + bounding box. Hybrid retrieval (pgvector + BM25 + RRF) finds the relevant chunks. Cohere Rerank tightens to top-20. Generation produces a court-admissible report with citations down to the page-and-region level. Days-to-minutes processing, 99.7% data integrity vs ground truth. See the case study for the full architecture.

[RELATED RAG TOPICS]

Worth a look next.

01 · RAG

Bring the documents. We will sketch the extraction.

Send a sample PDF or describe the corpus. We will name the extraction strategy, the multimodal embedding choice, and the citation surface, sized to your regulatory bar.

Start a conversation →All RAG topics

Multimodal RAG. Tables, figures, scans, audio. Citation-grade.

Four ways naive RAG breaks on real documents.

Tables

Figures + charts

Scanned pages

Layout + reading order

From PDF to indexed chunks.

Ingestion pipeline (per document).

01 PDF / image

02 Type router

03 Extract + region

04 Chunk + caption

05 Embed

06 Index

Text and visual, retrieved together.

The components, named.

Document ingestion

Extraction

Embedding

Unified index

Citation surface

Day one to indexed corpus.

Document profiling

Type router + extraction layer

Layout-respecting chunking

Multimodal embedding + index

Citation surface

Hybrid retrieval + rerank

Generation with cite-to-region

Eval suite + audit trail

What the numbers say.

What practitioners report.

Our honest take.

Affidavit Mapp.

What we ship on a multimodal RAG.

What buyers ask before they sign.

Worth a look next.

RAG architectures

Vector databases

Retrieval pipeline

RAG by corpus scale

Bring the documents. We will sketch the extraction.