Tables
Row/column structure carries the meaning. A naive text extraction loses the relational structure and the LLM hallucinates back. Solution: table-aware extraction (Unstructured / Camelot) plus structured chunking that preserves header context.
Plain-text RAG breaks on the documents that matter most in regulated industries. Legal contracts have signatures and redlines. Financial reports have charts. Medical records have scans. Court-ready RAG needs a vision-aware extraction layer, multimodal embeddings, and a citation surface that tracks back to the page region. This is the shape we built for Affidavit Mapp.
Best for: PDFs with tables and figures. Court-ready legal evidence. Financial reports with charts. Technical manuals with schematics. Medical imaging plus text. Anywhere the layout, not just the words, carries meaning.
Each problem has a named solution in 2026. None of them are the fastest path. All of them earn the build when the documents are regulated, complex, or both.
Row/column structure carries the meaning. A naive text extraction loses the relational structure and the LLM hallucinates back. Solution: table-aware extraction (Unstructured / Camelot) plus structured chunking that preserves header context.
Bar charts, schematics, diagrams. Embeddable as images; retrievable as captions plus images. Solution: vision LLM caption + the original image both indexed, retrieved together at query time.
OCR is no longer enough. A scan of a hand-annotated contract has signatures, redlines, stamps that matter. Solution: vision LLM does the read directly (Claude vision / GPT vision), no OCR intermediate.
Multi-column PDFs, footnotes, sidebars, headers. Reading order breaks every naive extraction. Solution: layout-aware parsers (Unstructured) plus a layout-respecting chunker, often with a vision LLM as the layout judge for ambiguous pages.
The shape every multimodal RAG ingestion runs through. Per-stage tone matches the broader RAG palette.
Each stage emits provenance. The chunk-level metadata that lands in the index includes source URL, page, region, extraction model, and timestamp.
Lands in object storage
Vision LLM vs Unstructured
Text + media + bbox
Layout-respecting
ColPali / BGE-M3 / Cohere
pgvector + provenance
At query time, the index serves text chunks, image embeddings, table chunks, and audio transcripts through one unified retrieval call. Rerank tightens the top-K; the LLM generates the answer with citations back to the source media.
PDFs, images, audio, structured docs land in object storage. A worker enqueues them by type. Each type gets its own extraction strategy. Provenance recorded (source URL, hash, ingestion timestamp) so every chunk can cite its origin.
Vision LLM for documents with layout / tables / figures (Claude vision is our 2026 default, GPT vision for fallback). Unstructured for clean PDFs with simple layout. Whisper for audio. The extraction layer outputs both the text AND the original media reference.
Text-only: Cohere v3 or OpenAI 3-large. Multimodal: ColPali for direct image retrieval (no text intermediate), BGE-M3 for dense + sparse + multi-vector in one model. CLIP for legacy multimodal where the corpus and queries don't need 2024-era retrieval gains.
Text chunks, image embeddings, table chunks, and audio transcripts all in one pgvector or Qdrant index, separated by metadata facets. Retrieval can be filtered by media type, or merged across all four.
Every claim in the answer points back to its source: page number, table cell, audio timestamp, image region. The UI renders the citation inline; the source can be opened, verified, and audited. Court-ready by design.
An eight-week sprint for the first multimodal RAG build over a defined corpus. Longer programs phase the same shape across additional corpora and regulatory reviews.
We sample the corpus. Scanned vs born-digital, table density, figure load, layout consistency, page count distribution. Each shape gets a different extraction strategy; we name them up front.
A type router classifies each incoming document and dispatches to the right extractor. Vision LLM for layout-heavy and scanned. Unstructured for clean born-digital. Whisper for audio. Each extractor emits text plus the source media reference plus a region (page + bounding box, timestamp).
Multi-column reading order resolved. Tables kept as structural units (header row plus rows). Figures captioned and indexed with their captions. Footnotes attached to the parent passage. Chunk size tuned per document type, not a single global default.
Text embedding for prose chunks (Cohere v3 or OpenAI 3-large). ColPali or BGE-M3 for image and visually-rich pages. Image embeddings co-stored with text in pgvector or Qdrant, separated by media type for facetable retrieval.
Every chunk carries provenance: source URL, page, bounding box (for visual chunks), timestamp range (for audio), extraction model identifier, ingestion timestamp. The citation surface is the index contract; the UI renders citations inline at query time.
Same Advanced RAG shape as text-only: vector + BM25 fused with RRF, then a cross-encoder rerank (Cohere Rerank v3) on top-50. For visually-rich queries, ColPali's late-interaction ranking can replace the reranker for image-first results.
Citation-required prompt. Model cites text-to-passage and image-to-region. Structured output validates the citation map. The UI renders inline citations that open the source PDF at the right page and bounding box.
Golden set of 100-300 representative queries with hand-graded expected citations including layout precision. The eval gates ship; the audit trail of every model call gets shipped to the SIEM with PII scrubbed at the proxy.
Multimodal RAG benchmarks are workload-specific. Published numbers on ColPali and BGE-M3 hold up in our engagements; citation precision is the metric our buyers care about most.
Our eval suite for multimodal RAG grades retrieval and citation precision separately. Retrieval precision is standard Recall@K plus media-type recall (did we surface the right chart, not just the right text). Citation precision adds region-level grading: did the cited bounding box actually contain the supporting evidence? Both metrics gate ship.
Multimodal RAG went from emerging to mature between 2024 and 2026. The papers are recent; the production shape is settled.
Multimodal RAG has moved from research to production fast. ColPali (Faysse et al., 2024) shifted the conversation: late-interaction multimodal embeddings retrieve directly against page images, often beating text-only RAG on visually-rich corpora by 10-20 points. BGE-M3 (Chen et al., 2024) consolidated dense plus sparse plus multi-vector in one embedding model, simplifying the stack. The two together became the 2025-2026 production default for visually-rich corpora. Anthropic's vision API and OpenAI's GPT vision delivered the extraction half of the pipeline at frontier-LLM quality.
Multimodal RAG earns the build when the documents are not plain text. When figures, tables, or scans carry meaning, the multimodal stack repays its complexity.
We have shipped multimodal RAG most consistently in legal and financial work where the documents are scanned, tabular, and regulated. The Affidavit Mapp build is the reference we point buyers at: vision LLM extraction, layout-respecting chunking, hybrid retrieval, cite-to-region citations, 99.7% data integrity vs ground truth, days-to-minutes processing. The pattern is mature; the engineering is the work.
Text-only Advanced RAG on text-heavy corpora where figures are decorative rather than load-bearing. Heavyweight document understanding services (AWS Textract, Google DocumentAI) for buyers who would rather pay for a managed extraction layer than operate vision LLMs.
Court-ready legal RAG. Bank statements, contracts, scanned exhibits go in as PDFs. The pipeline produces a court-admissible report with citations down to the page-and-region level.
Read the case study →Naive, Advanced, Modular, Agentic, GraphRAG, CRAG, Self-RAG. Five named patterns with the decision tree for picking one.
Read morepgvector, Qdrant, Milvus, Weaviate, Vespa, LanceDB, Pinecone. Honest 2026 comparison and our default.
Read moreEmbeddings, chunking, hybrid search, reranking. The four layers retrieval quality lives or dies in.
Read moreProven designs from under 100k chunks to over 1B. The architecture changes with the scale.
Read more