A user asks the corporate assistant: "How do I enable SSO for Product B?". The bot replies: "Click the button below and pick SAML". The problem is that there's no button "below" in the answer. In the source documentation it was on a screenshot. The Registrar dutifully brought the page into the system, but the Indexer threw the image away as "not text". The index got the paragraph that says "click here", but not the part that says where to click.
This is a textbook RAG fail. From the outside it looks like the LLM messed up. In reality the model didn't blank – it was handed a truncated version of reality.
There's a second class of failures: money. In a public write‑up, an engineer described how an embedding pipeline burned tens of thousands of dollars, and after a hybrid strategy dropped to hundreds of dollars a month. The story sticks not because the number is huge, but because the cause is banal: ingestion was running unsupervised, re‑embedding too much, with too expensive a model, and without any real cost control.
This article is about the RAG Indexer: the layer that turns "the system has accepted this document" into "the document can be found, cited, and safely used". The Registrar answers which documents exist, who can see them, and what lifecycle stage they're in. The Indexer answers something else:
- what specifically from the document goes into the index;
- how the document will be parsed;
- how it will be cut up;
- which embeddings will be produced;
- which metadata will travel alongside;
- which vector indexes will be built;
- how to bind text, tables, images, and permissions together;
- how to survive an embedder migration without a silent recall drop;
- how to delete data under GDPR when the user invokes "forget me".
A good Indexer is not "a parser + chunk_size=512 + embeddings". It's a production system with its own contracts, metrics, cost model, migrations, and security boundary.
§ 01The Indexer contract: what it receives and what it must produce
The Indexer doesn't receive a file. It receives a document, with context from the Registrar:
{
"doc_id": "doc_123",
"canonical_id": "product_b_sso_setup",
"source": "confluence",
"scope": "product_b",
"language": "ru",
"lifecycle_status": "active",
"acl_principals": ["group:support", "group:solutions"],
"source_trust": 0.9,
"last_modified": "2026-04-10T12:30:00Z",
"content_hash": "sha256:...",
"requires_multimodal": true
}
On the way out, the Retriever should not get "an array of vectors" – it should get a proper search surface:
{
"chunk_id": "chk_789",
"parent_doc_id": "doc_123",
"retrieval_unit_id": "unit_456",
"text": "To enable SSO, go to Settings → Security...",
"section_path": "Product B > Admin > SSO > SAML setup",
"page_number": 12,
"linked_image_ids": ["img_12_01"],
"chunk_type": "original",
"structural_role": "body",
"contains_image": true,
"contains_table": false,
"parser_version": "mineru-2.5-pro",
"chunker_version": "structural-v3",
"embedding_model": "text-embedding-3-small",
"embedding_dim": 1536,
"embedding_space_id": "openai-3-small-v1",
"acl_principals": ["group:support", "group:solutions"],
"subject_ids": [],
"untrusted_layer": false,
"chunk_hash": "sha256:..."
}
The Indexer designs how a document is represented for search.
It decides not only "how to get the text", but also "what counts as a fact", "what counts as context", "what counts as an image", "what can be shown to this specific user", and "how to later prove where the answer came from".
§ 02Parsing isn't "extract the text" – it's preserve the structure
The most common mistake is to think that parsing means "pull out the strings". For RAG that's not enough. A good parser has to preserve:
- text;
- headings and hierarchy;
- tables;
- formulas;
- lists;
- captions;
- links;
- page numbers;
- the link between text and images;
- visible vs hidden text;
- confidence at the page or block level.
PDF is particularly nasty: it's a format for visual rendering, not for structured knowledge. That's why a single "parser for everything" almost always loses to a router approach.
Production parser router
A document arrives.
1. Detect the format:
- HTML / Markdown / DOCX / Confluence / Notion / GitBook
- digital-born PDF
- scanned PDF
- PPTX / pitch deck
- XLSX / spreadsheets
- screenshots / dashboards
2. Pick the cheapest parser that preserves the structure you need:
- HTML/Markdown/DOCX → source-native parser
- simple digital-born → PyMuPDF / pdfplumber
- complex PDF / scan → VLM parser
- chart-heavy → vision-first index in parallel
3. If confidence is low:
- escalate to a more expensive parser
- don't lose the whole document over one page
- keep an error marker and page-level status
Cheap parser for 80% of the corpus, expensive parser for the 20% of hard documents, premium fallback for critical pages.
§ 03Vision-first: when meaning lives in the picture
There are documents you can't honestly turn into plain text without losing meaning:
- pitch decks;
- BI dashboards;
- financial reports with charts;
- instructions with screenshots;
- "click here" pages;
- UI documentation;
- tables that exist only as images;
- hand‑drawn or scanned diagrams.
For these documents, a second path opens up: instead of parsing the page into text, index the page as an image. By 2026 this is no longer experimental. Jina Embeddings v4 is positioned as a multimodal/multilingual embedding model for complex document retrieval with charts, tables, and illustrations; it supports both single‑vector and late‑interaction multi‑vector modes. Cohere Embed 4 also handles text, images, and interleaved text+image in a unified representation, including screenshots of PDFs, slides, tables, and figures. Voyage multimodal models support text, images, and video, and voyage‑multimodal‑3.5 has 32K context and 1024 default dimensions.
If a user's question depends on a picture, chart, diagram, or UI screenshot, a text‑only index isn't enough anymore.
There are three levels of multimodal binding.
Level 1: caption + OCR augmentation
A VLM generates a caption for the image:
Image caption:
Screenshot of the Settings → Security page.
The "Enable SSO" button is visible in the top right corner.
That caption is appended to the neighbouring text chunk. It's a cheap fallback, and it covers a lot of product documentation.
Level 2: text vector + image vector
The text and the image are indexed separately, but linked through a shared retrieval_unit_id:
{
"retrieval_unit_id": "unit_456",
"text_chunk_id": "chk_789",
"image_ids": ["img_12_01"],
"binding_type": "instruction_screenshot"
}
If the Retriever finds the text, it must pull in the linked image. If it finds the image, it must pull in the neighbouring text.
Level 3: vision-first page retrieval
For dashboards, charts, slide decks, and visual instructions, the page is indexed as an image‑first unit. It costs more, but otherwise the system ends up saying "click the button below" with no button.
§ 04Chunking is not chunk_size – it's where one thought ends
Chunking isn't a choice of chunk_size. It's the decision about where one thought in a document ends and another begins. Bad chunking produces four types of problem:
- The chunk is too small – the LLM doesn't have enough context.
- The chunk is too big – retrieval finds a "vaguely relevant" slab, but the actual fact drowns inside it.
- The chunk cut a table, code block, or formula in half.
- The chunk lost its heading, and "click Save" no longer means anything.
Decision tree
Is there explicit structure?
Markdown / HTML / DOCX / Confluence headings?
→ structural chunking by heading
→ section_path is mandatory
Is it an XLSX or a table?
→ sheet → table region → row groups
→ keep header rows and units
Is it a Slack/email thread?
→ thread as the base unit
→ split only if > the limit
Is it a transcript/log?
→ fixed-size 256–512 tokens + 10–20% overlap
→ timestamp metadata
Is it a long-form legal/book/technical doc?
→ structural + hierarchical parent/child
→ possibly RAPTOR / summary nodes
Lots of anaphora: "this feature", "as mentioned above"?
→ late chunking or contextual augmentation
Are there screenshots/charts?
→ text chunk + linked image
→ possibly a vision-first index
Structural chunking – the foundation for 80% of cases
If the document has headings, use them. The author has done the work for you. A chunk should not be stored as:
To enable SSO, click Enable and upload your metadata XML.
It should be stored like this:
# Product B > Admin Settings > SSO > SAML setup
To enable SSO, click Enable and upload your metadata XML.
section_path is not display sugar. It's a retrieval signal. Without it, identical phrases from different products end up almost identical in embedding space.
Fixed-size – an honest baseline
When there's no structure – logs, transcripts, badly laid out PDFs – fixed‑size is still a strong baseline. A reasonable starting point:
chunk_size: 512 tokens
overlap: 64 tokens
For short support snippets, 256 is fine. For long reasoning‑heavy documents, 768–1024, if the embedder and the Retriever can handle it. The important part: overlap must be a tail of the previous chunk, not a random prefix.
Semantic chunking – not a default
Semantic chunking sounds great on paper: compute sentence embeddings, find semantic breakpoints, cut where similarity drops. The problem is that end‑to‑end RAG isn't an isolated benchmark. A chunk has to be not only "semantically clean", but also useful for an answer. Semantic chunks that are too small can yield good recall@k but poor answer quality, because the LLM gets shards without context.
Semantic chunking is something you test. Semantic chunking is not something you turn on by default.
§ 05Augmentation: section_path beats switching embedders
Chunking answers "where are the boundaries?". Augmentation answers "what context do I add so the chunk becomes searchable?". Very often a simple section_path augmentation buys you more than swapping the embedder.
5.1. Title / section prefix
The minimum must‑have:
# Product B > Admin > SSO > SAML setup
To enable SSO, click Enable and upload your metadata XML.
It's almost free, and it's especially important for product docs, API references, legal clauses, internal wikis, and troubleshooting guides.
5.2. Document summary vector
For each document, build a separate summary chunk:
This document explains how administrators configure SSO for Product B,
including SAML metadata upload, certificate rotation, and IdP troubleshooting.
It's there for high‑level questions: "Do we have SSO documentation?", "Where is certificate rotation described?", "Which documents cover SAML?". The summary chunk doesn't replace leaf chunks. It helps the Retriever locate the right area, then drill deeper.
5.3. Contextual retrieval
Before each chunk, an LLM generates 50–100 tokens of context explaining what this fragment is in the scope of the document. A regular chunk:
Click Enable and upload the metadata XML.
Contextual version:
This chunk is from the SAML setup section of Product B admin documentation.
It describes the step where an administrator enables SSO and uploads IdP metadata.
Click Enable and upload the metadata XML.
This is especially useful for product docs, internal knowledge bases, legal docs, medical/regulatory corpora, and long documents with repetitive phrasing.
5.4. Synthetic questions
For each chunk, generate 3–5 questions it answers:
How do I enable SSO in Product B?
Where do I upload SAML metadata XML?
How do I configure IdP metadata for Product B?
The questions are indexed as separate vectors that all point back to the source chunk. This closes the query‑document gap: users don't ask in the same words the docs were written in. The downside: the index inflates. So synthetic questions are best enabled on hot or high‑value docs, not the entire corpus.
§ 06One embedder for everything is a mistake
The big mistake in the embeddings sections of most RAG articles is that they only talk about text embeddings. A production corpus is rarely just text. Inside one RAG you can have plain text, markdown/docs, code, tables, PDF pages, screenshots, diagrams, audio/video, tickets/chats, entities/graphs, metadata, sparse lexical signals.
"Which embedder is best?" is the wrong question. The right one: what data types do I have, and what retrieval relationships am I trying to search?
6.1. Dense text embeddings
The base case. Fits product docs, help centres, wikis, policies, blog posts, email bodies, legal text, support articles. Example models: OpenAI text‑embedding‑3‑small / 3‑large; Cohere Embed; Voyage; BGE‑M3; Qwen3‑Embedding; Jina Embeddings; E5. OpenAI documents 1536 dimensions for text‑embedding‑3‑small, 3072 for 3‑large, an 8192 max input, and the ability to shrink dimensionality via the dimensions parameter.
For MVP: text-embedding-3-small / BGE-M3 / Qwen3-Embedding-0.6B
For production: pick by golden set, languages, latency, storage, compliance
6.2. Sparse embeddings and lexical search
Dense embeddings capture meaning but struggle with exact identifiers, error codes, API names, SKUs, legal clause numbers, function names, rare acronyms, exact product names. That's why production RAG almost always ends up hybrid: dense vector search + sparse / BM25 / SPLADE / learned sparse.
Query: "ERR_AUTH_SAML_042 metadata upload failed"
Dense might find an article about SSO.
Sparse has to find the exact error code.
Pinecone documents two approaches to hybrid search: a single hybrid index or separate dense/sparse indexes. A single index is operationally simpler; separate indexes give you more flexibility but require maintaining the linkage between dense and sparse vectors.
If your corpus has identifiers, error codes, API names, legal refs – hybrid search is mandatory.
6.3. Code embeddings
You can't index code like prose. Code has different retrieval relationships: natural language → code; code → similar code; stack trace → function; API usage → examples; bug report → relevant module; migration guide → changed symbols.
For code retrieval it helps to add symbol_path, language, repo, branch, file_path, function_name, class_name, imports, call_graph_neighbors, docstring. A code chunk:
# repo: billing-service
# file: src/invoices/parser.ts
# symbol: parseInvoicePayload
# language: TypeScript
function parseInvoicePayload(...) { ... }
In 2025 Jina released jina‑code‑embeddings, designed specifically for natural language → code retrieval, technical QA, and finding similar code snippets across languages. Voyage's voyage‑code‑3 is also positioned for code retrieval; the Pinecone model docs list configurable dimensions of 256/512/1024/2048.
For codebase RAG, don't index only raw code. Index raw code + docstring augmentation + symbol metadata.
6.4. Table embeddings: you can't just flatten a table
Tables are their own data type. The bad version:
Q1 4.2M Q2 4.7M Q3 4.1M Q4 5.0M
Better:
Table: Revenue by quarter
Currency: USD
Period: FY2025
| Quarter | Revenue |
|---|---:|
| Q1 | $4.2M |
| Q2 | $4.7M |
| Q3 | $4.1M |
| Q4 | $5.0M |
Better still: keep multiple representations – raw table / HTML / markdown for display; row‑group text for retrieval; a table summary vector; a SQL‑accessible structured table; a visual table image for complex layouts.
The TableRAG paper states the problem plainly: flattening tables and conventional chunking strategies disrupt intrinsic tabular structure, cause information loss, and degrade reasoning on multi‑hop/global queries. The authors propose a SQL‑based framework. For complex tables with merged cells, irregular alignment, and embedded images, there's a case for visual table retrieval: TaR‑ViR argues for treating tables as images, because visual representations preserve structure and content without error‑prone text conversion.
Small table: markdown serialisation + header rows in every chunk. Big table: table summary vector + row‑group chunks + SQL access. Visually complex: add an image/table‑page embedding on top.
6.5. Image/page embeddings
For images there are two distinct scenarios.
Scenario A: image search. The user asks for "a diagram with the OAuth token refresh flow". The system has to find the picture. This needs image/text shared embeddings: Jina Embeddings v4; Cohere Embed 4; Voyage multimodal; Gemini Embedding 2; CLIP‑like models; ColPali/ColQwen‑style page embeddings.
Scenario B: document page retrieval. The user is after a fact that lives on a PDF page: "What does the chart say about churn in Q3?". Here it's better to index not a single image but the page as a retrieval unit: page_image_embedding + page_text_embedding + layout metadata + linked chunks. Gemini Embedding 2, for example, is officially described as a model that accepts text, images, audio, video, and PDFs and produces 3072‑dimensional vectors in a unified semantic space.
If the document is visually rich, text embeddings alone are not enough.
6.6. Video/audio embeddings
Video and audio can be indexed in three ways.
Approach 1: transcript‑first. video → ASR transcript → text chunks → text embeddings. Good for lectures, meetings, podcasts, interviews. Bad for visual actions, screenshare, product demos, "find the moment where the person points at the button".
Approach 2: frame/scene embeddings. video → keyframes/scenes → image embeddings. Good for visual search, bad for the meaning of speech.
Approach 3: native video embeddings. video/audio/text/image → shared embedding space. TwelveLabs Marengo 3.0 is described as a video embedding model that analyses visuals, audio, and text, with support for fine‑grained search, motion search, audio comprehension, and composed text+image search. Gemini Embedding 2 also supports text, images, video, audio, and documents in a unified embedding space.
For meeting RAG, transcript‑first is enough. For training videos / product demos / surveillance / sports / media search, you need native video or a frame+transcript hybrid.
6.7. Graph/entity embeddings
Not everything has to be turned into a dense vector. In enterprise RAG, entity relations matter: Customer → Contract → Product → Feature → Ticket → Incident.
Options: entity metadata as filters; knowledge graph traversal before vector search; entity embeddings; hybrid graph + vector retrieval.
Query:
"Which customers are affected by the SAML bug after release 2.4?"
Pipeline:
graph lookup: release 2.4 → affected feature SAML → customers using SAML
vector search: docs/tickets/incidents within that scope
If a relationship is exact and structural, don't rely on embeddings. Use a graph/SQL/filter, and use embeddings only for fuzzy semantic match.
6.8. Multi-vector embeddings
A chunk doesn't have to have a single vector. One retrieval unit can carry several representations:
{
"retrieval_unit_id": "unit_456",
"vectors": {
"dense_text": "...",
"sparse_text": "...",
"section_summary": "...",
"hypothetical_questions": ["...", "..."],
"image": "...",
"code": "...",
"table": "..."
}
}
This is especially useful for ColBERT / late interaction; ColPali‑style page retrieval; synthetic questions; summary vectors; multimodal docs; code + docstring; table + table summary.
One retrieval unit can carry many vectors, but the user should receive one coherent evidence unit. Don't hand the LLM five independent "fragments" of what is, semantically, one object.
6.9. Decision tree: which embedding strategy to pick
Plain text docs?
→ dense text embedding + BM25
Product docs / API docs?
→ dense text + sparse + section_path prefix
→ synthetic questions for hot docs
Codebase?
→ code embedding + docstring augmentation
→ metadata: repo/file/symbol/language
Tables?
→ markdown/HTML serialization
→ row-group vectors
→ table summary vector
→ SQL access for computation
→ visual embedding for complex layouts
PDF with charts/screenshots?
→ text parser + page image embedding
→ linked text-image retrieval units
Slides / dashboards?
→ vision-first or multimodal embedding
→ caption/OCR as fallback
Video/audio?
→ transcript-first for speech-heavy
→ video embeddings for visual/action-heavy
Knowledge graph / entities?
→ graph/SQL filters first
→ vectors only for fuzzy semantic expansion
§ 07Metadata: four roles, without which ACL is cosmetic
Metadata plays four roles.
| Role | Purpose |
|---|---|
Filter |
Hard‑cut before vector search: tenant_id, scope, language, acl_principals, lifecycle_status, doc_type. |
Boost |
Softly influence ranking: source_trust, last_verified_at, liveness_score, recency. |
Display |
Show the source to the user: section_path, page_number, original_url, image_id. |
Lineage |
Migrations, audit, GDPR: parser_version, chunker_version, embedding_model, chunk_hash, subject_ids. |
A minimum production schema:
{
"chunk_id": "chk_789",
"parent_doc_id": "doc_123",
"canonical_id": "product_b_sso_setup",
"retrieval_unit_id": "unit_456",
"tenant_id": "acme",
"scope": "product_b",
"language": "en",
"doc_type": "product_docs",
"lifecycle_status": "active",
"acl_principals": ["group:support"],
"section_path": "Product B > Admin > SSO > SAML setup",
"page_number": 12,
"position": 34,
"structural_role": "body",
"chunk_type": "original",
"contains_table": false,
"contains_code": false,
"contains_image": true,
"linked_image_ids": ["img_12_01"],
"requires_multimodal": true,
"source_trust": 0.9,
"last_verified_at": "2026-04-10T12:30:00Z",
"liveness_score": 1.0,
"parser_version": "mineru-2.5-pro",
"chunker_version": "structural-v3",
"embedding_model": "text-embedding-3-small",
"embedding_dim": 1536,
"embedding_space_id": "openai-3-small-2024-01",
"content_hash": "sha256:...",
"chunk_hash": "sha256:...",
"subject_ids": [],
"untrusted_layer": false
}
A good pipeline: pre‑filter (tenant + scope + language + lifecycle + ACL) → vector/hybrid search → rerank → post‑rank boosts (trust + recency + liveness) → answer with citations.
§ 08A vector DB isn't storage – it's a set of failure modes
A vector DB doesn't just store float arrays. It owns approximate nearest neighbour search; metadata filtering; hybrid dense+sparse search; payload indexing; multi‑tenancy; insert/update/delete; snapshots/backups; quantisation; sharding; replication; latency under filters; migration; cost at scale.
The big mistake: "Pick any vector DB, they're all the same". They are not. They differ not only in speed, but in which failure modes you inherit in production.
8.1. Exact kNN vs ANN/aKNN
Exact kNN. Compare the query vector against every vector in the index and return the exact nearest neighbours. Pros: 100% recall; simple logic; great on small collections. Cons: O(N); doesn't scale to millions/billions.
ANN / approximate kNN. Build an index structure, only search promising candidates, return approximate neighbours. Pros: low latency; works on large corpora. Cons: recall < 100%; parameters to tune; filters can break quality; index build costs time and memory; silent degradation is possible. The Milvus docs frame the tradeoff well: an index speeds up search, but it incurs preprocessing time, space, RAM during search, and can lower recall.
8.2. The main ANN index families
Flat / brute force. Scan everything. Use for small collections (e.g. fewer than 100,000 vectors), dev/test, per‑tenant tiny indexes, when you need exact recall, when inserts are very frequent and latency isn't critical. Weaviate explicitly recommends flat where objects per index are low.
HNSW. A graph‑based index: a navigable small‑world graph, searched via shortcuts. Use for interactive RAG, low latency, high recall, incremental inserts, a hot index, millions to tens of millions of vectors. Cons: memory overhead; build time; harder under strict filters; quantisation is often necessary at scale. Base parameters:
M – how many neighbors a node stores; higher → recall ↑, memory ↑
ef_construct – search width during build; higher → better graph, longer build
ef_search – search width per query; higher → recall ↑, latency ↑
A practical start:
M = 16 or 32
ef_construct = 100–300
ef_search = 64–256
Don't copy these values blindly. Tune them against recall@k and p95 latency.
IVF / IVF_FLAT. Cluster vectors into centroids, search only the nearest clusters at query time. Use for batch‑built indexes, lower RAM than HNSW, corpora that don't update every second, periodic rebuild/recluster, cold/warm search. Parameters: nlist (number of clusters), nprobe (clusters searched per query).
PQ / SQ / binary quantisation. Compress vectors. SQ8 – scalar 8‑bit; PQ – product quantisation, splits the vector into subvectors; binary – aggressive compression, fast, higher quality risk. Use when storage/RAM are expensive, 10M+ vectors, cold index, acceptable small recall loss. Weaviate's best practices recommend vector quantisation to cut memory footprint on large datasets; for HNSW, rotational quantisation is a reasonable starting point.
DiskANN / Vamana. A disk‑based graph index. Some of it lives on SSD, not all of it in RAM. Use for hundreds of millions to billions of vectors, a RAM‑bound budget, cold/archival retrieval, SSD cheaper than RAM. The Milvus docs describe DISKANN as a disk‑based approach for large‑scale scenarios; DISKANN combines a Vamana graph with PQ.
8.3. ANN decision tree
How many vectors?
<100K:
Flat or pgvector HNSW – don't overcomplicate
100K–10M:
HNSW default
hybrid search if there are exact terms
payload indexes for filters
10M–100M:
HNSW + quantization
or IVF/SQ/PQ for cold
shard by tenant/scope
separate hot/cold layout
100M+:
DiskANN / Vamana / tiered storage
Milvus/Zilliz, Vespa, Turbopuffer, LanceDB-like lakehouse
benchmark on your own workload
Frequent updates? HNSW beats IVF
Batch-built archival? IVF/PQ/DiskANN
Many strict filters? filter-aware ANN, payload indexes
Need SQL + transactions? pgvector
Need zero-ops managed? Pinecone / Zilliz / Weaviate Cloud / Qdrant Cloud
Cheapest cold storage? object-storage-backed (Turbopuffer)
8.4. Filters break naive ANN
In RAG, filters are nearly always present: tenant_id, scope, language, ACL, lifecycle_status, doc_type, date_range. The naive approach: HNSW finds nearest neighbours → the application filters by tenant/ACL. The problem: the top‑50 neighbours may all get filtered out → the user gets an empty answer → even though the relevant chunk was at rank 500. You need either pre‑filter before ANN, or filter‑aware ANN, or overfetch + rerank + fallback.
The Qdrant docs explain the issue plainly: under strict filters the HNSW graph can fall apart; Qdrant fixes this with extra edges based on indexed payload values, and payload indexes are recommended right after collection creation, before ingest, so the graph is optimised for filtered search.
Metadata fields used in filters must be indexed in the vector DB. Don't dump everything into a JSON payload "for tidiness".
8.5. Hot/cold architecture
Not all of the corpus matters equally. Typically the top 10% of docs serves 80% of queries, and the long tail gets almost no traffic. So:
hot_index:
HNSW
high recall
low latency
full metadata filtering
premium embeddings / augmentation
cold_index:
compressed
IVF/PQ/DiskANN/object-storage-backed
lower cost
slower acceptable latency
Retrieval plan:
1. search hot_index
2. if confidence low → search cold_index
3. merge + rerank
This way you don't pay HNSW memory cost for documents almost no one queries.
8.6. Multi-index architecture
One corpus can have several indexes: dense_text_index, sparse_text_index, image_page_index, code_index, table_index, summary_index, question_index.
Important: this must not fragment the evidence. Every vector should reference a single retrieval_unit_id. The Retriever may search across indexes, but the user receives one coherent context unit.
§ 09There is no best vector DB without a workload
There's no best DB without a workload. Recent benchmarking papers and ANN reviews agree on one thing: performance depends on dataset, filters, vector dimension, update rate, hardware, and target recall. The filtered ANN benchmark puts it neatly: algorithms fall into tree/hash/graph/quantization categories, and no algorithm dominates across datasets.
9.1. When to pick which DB
| DB / engine | When to use it | Strong point |
|---|---|---|
| pgvector | you already have Postgres, low/mid scale, you need joins/transactions | minimum new infrastructure |
| Qdrant | self‑host/managed, strong filters, payload indexing, Rust | a good default for production RAG |
| Milvus / Zilliz | big scale, many index types, GPU/index tuning, billion‑scale | scale and a rich choice of indexes |
| Weaviate | AI‑native DB, hybrid/BM25, modules, schema, managed/self‑host | developer experience + hybrid |
| Pinecone | zero‑ops managed, serverless, multi‑tenant SaaS, predictable API | managed simplicity |
| Vespa | search platform: ranking, hybrid, structured + vector, large‑scale retrieval | search/ranking engine, not just a vector store |
| Elasticsearch / OpenSearch | you already have a search stack, you need hybrid text+vector | easy to add vector to existing search |
| LanceDB | data lake / local / embedded / multimodal datasets | lakehouse‑style vector storage |
| Chroma | prototyping, local dev, simple RAG demos | fast to get started |
| Turbopuffer | cold/large scale, object‑storage‑backed, cost‑sensitive, hybrid search | cheap serverless vector/full‑text on object storage |
Turbopuffer describes itself as a serverless vector and full‑text search database, built from first principles on object storage. Pinecone pricing shows a managed/serverless model with dense and sparse indexes, storage pricing, read/write units, and namespaces. Qdrant Cloud pricing is based on resource usage rather than per‑query units; the free tier includes 1GB RAM and 4GB disk. Zilliz Cloud is managed Milvus, with serverless/dedicated/BYOC options and a 5GB free tier.
9.2. Which benchmarks to look at
ANN‑Benchmarks. A benchmark environment for approximate nearest neighbour algorithms with current results across datasets and algorithms. Useful to compare HNSW vs IVF vs ScaNN vs Annoy vs Faiss variants, the recall/QPS tradeoff, and dataset sensitivity. Not useful when you need to choose a managed DB with filters, ACL, hybrid search, and writes.
VectorDBBench / VDBBench. The best practical tool for a POC. A benchmark for mainstream vector databases and cloud services, with a UI, comparative reports, and cost‑effectiveness reports. Use it like this: take your own embeddings, your own filters, your own top_k, your own concurrency, and measure p50/p95/p99 latency, recall@k, indexing throughput, memory/storage, and cost.
Vendor benchmarks. Qdrant's benchmarks, for example, openly state that they use the same machines, affordable hardware, and open‑sourced benchmarks – but it's still a vendor benchmark, so read it as a signal, not a verdict.
Academic / independent papers. Useful for architectural tradeoffs. The storage‑based ANN paper covering Milvus/Qdrant/Weaviate/LanceDB shows that index method strongly affects throughput; in their setup, DiskANN reached 0.93–0.98 accuracy versus 0.90–0.91 for HNSW/IVF.
9.3. A minimum benchmark plan
Dataset: 100K / 1M / 10M vectors from your real corpus
Queries: 500–5000 real or golden queries
Ground truth: human-labeled relevant chunks
or exact kNN on a smaller subset
or a QA golden set
Scenarios: no filter / tenant / ACL / language / lifecycle / combined strict
Metrics: recall@5, recall@10, MRR, nDCG
p50 / p95 / p99 latency
indexing throughput
update / delete latency
memory, storage, monthly cost
Workloads: read-heavy / write-heavy / mixed / nightly backfill / hot/cold
A benchmark without filters is almost useless for enterprise RAG.
§ 10The Indexer must be idempotent
10.1. Collection design
Bad design: one giant collection for everything; metadata as unindexed JSON; ACL post‑filter in app; all embeddings mixed across versions. Good design: collections/namespaces by tenant boundary, embedding_space_id, modality, hot/cold tier. Example:
acme_text_openai3small_hot
acme_text_openai3small_cold
acme_image_jina4_pages
acme_code_voyagecode3
But don't spin up a separate collection for every tiny document. Too many tiny collections become an operational burden of their own.
10.2. Payload indexes before ingestion
If the DB supports payload/scalar indexes, create them before ingestion for fields that actually participate in filters: tenant_id, scope, language, acl_hash, lifecycle_status, doc_type, section_path, created_at / last_modified.
In Qdrant this matters especially, because filterable HNSW extra edges can be built on top of indexed payload values; the documentation recommends creating payload indices immediately after collection creation, before ingesting data.
10.3. Sharding strategy
Sharding axes: tenant_id / scope / embedding_space_id / time / hot-cold / modality
B2B SaaS: shard/namespace by tenant_id
Public docs: shard by product/version/language
Support tickets: shard by tenant_id + time bucket
Legal archive: shard by client/matter_id
Shard by fields that are almost always present in filters. Do not shard by fields users rarely filter by.
10.4. Upserts and idempotency
The Indexer must be idempotent: same doc_id + same content_hash + same chunker_version → same chunk_ids. Otherwise every reindex creates duplicates, and retrieval starts returning five copies of the same boilerplate.
Write protocol:
1. parse document
2. create chunks
3. compute chunk_hash
4. compare with existing chunk_hashes
5. write only changed chunks
6. tombstone deleted chunks
7. update doc_index_state
8. emit audit event
10.5. Deletion
Deletion is not "delete the file from S3". You have to delete the document record, the chunks, dense vectors, sparse vectors, image vectors, summary vectors, synthetic question vectors, cache entries, and retrieval logs where policy requires it.
For GDPR / right‑to‑erasure you need a subject_id → chunk_ids → vector_ids mapping. Without it, "remove Smith from the index" becomes an incident.
§ 11Embeddings are not hashes: a vector store with PII is a PII store
One of the most dangerous misconceptions: "a vector is just numbers, you can't recover PII from it". That's not true.
A vector store with PII is a PII store.
Which means: encrypt at rest; per‑tenant isolation; KMS keys; strict ACL pre‑filter; audit logs; no public read‑only vector dumps; PII redaction before embedding where possible; a subject_id → chunk_id → vector_id mapping for deletion; monitoring for unusual similarity‑search patterns.
§ 12Parser security: hidden text is a prompt injection surface
A parser can bring more than visible text into the index. It can bring in a hidden PDF layer, white‑on‑white text, alt text, speaker notes, XMP metadata, comments, tracked changes, invisible Unicode, zero‑width characters.
Parser output should be typed. Not everything the parser saw deserves to become first‑class retrieval text.
A good schema:
{
"visible_text": "...",
"alt_text": "...",
"speaker_notes": "...",
"metadata_text": "...",
"hidden_text": "...",
"untrusted_layers": ["alt_text", "speaker_notes", "hidden_text"]
}
From there your options: don't index hidden text; index speaker notes only for internal roles; index alt text with untrusted_layer=true; strip invisible Unicode; flag prompt‑like strings with a risk marker; have retrieval downweight untrusted chunks.
Normalisation is mandatory: NFKC; ftfy / mojibake repair; strip zero‑width chars; remove soft hyphen; collapse whitespace; normalise repeated punctuation. Lemmatisation before dense embeddings is usually not needed. For multilingual subword models it often adds noise. Lemmatisation can be useful in the BM25/sparse branch, but not as preprocessing for a dense embedder.
§ 13Embedding migrations: silent failure
The danger of an embedding migration is that the system may not fall over. It just starts retrieving worse. A classic scenario:
old_index: ada-002 embeddings
new ingestion: text-embedding-3-large
query embedder: text-embedding-3-large
If old and new vectors share one logical space without explicit routing, retrieval becomes a mix of incompatible representations. Even if the dimensionality matches, the embedding spaces may not. Just because the vector DB accepts the write doesn't mean searching across them is meaningful.
Migration done right: blue/green
Phase 1 – shadow write:
- new ingestion writes to old_index and new_index
- production retrieval stays on old_index
Phase 2 – backfill:
- the historical corpus is re-embedded into new_index
- live queries are shadow-evaluated against both indexes
Phase 3 – compare:
- recall@k / MRR / answer correctness
- citation correctness, latency, cost
Phase 4 – flip:
- a feature flag switches retrieval to new_index
- old_index lingers for another 7–14 days
Phase 5 – cleanup:
- old index is deleted after the observability window
Mandatory fields:
{
"embedding_model": "text-embedding-3-small",
"embedding_dim": 1536,
"embedding_space_id": "openai-3-small-2024-01",
"embedded_at": "2026-04-28T10:00:00Z"
}
Without
embedding_model_versionandembedding_space_id, you're not migrating. You're guessing.
§ 14Cost engineering: you optimised the wrong line
RAG ingestion cost has several parts:
cost =
parsing
+ chunking
+ LLM augmentation
+ embeddings
+ vector DB writes
+ storage
+ reindexing
+ evaluation
+ failed jobs / retries
Teams often optimise only embeddings, when the real money is in VLM parsing or vector DB storage.
Hot-doc augmentation
Don't augment everything. The best production pattern:
90% of documents:
structural chunking + title prefix
top 10% hot documents:
contextual retrieval
synthetic questions
image captions
summary vectors
Auto-promote:
document lands in the top traffic bucket
→ a nightly job adds augmentation
→ the index is updated
LLM augmentation is justified when
expected_queries × value_of_accuracy_gain > augmentation_cost + reindex_cost. If a document lives for a year and gets thousands of queries, contextual augmentation almost always pays for itself. If it's a Slack thread that gets read twice and forgotten, the LLM pass is wasted money.
Vector DB cost rule
vector_db_cost =
raw vector storage
+ index overhead
+ payload storage
+ metadata indexes
+ replicas
+ backups
+ read operations
+ write operations
+ reindex jobs
Don't count only raw vectors. Example: 10M vectors × 1536 dims × 4 bytes ≈ 61 GB of raw vectors. But production storage ends up much larger thanks to the HNSW graph, payload, indexes, replication, snapshots, and metadata.
A good Indexer doesn't ask "Which embedding model should I take?". It asks "Which retrieval units exist in my product, which data types live inside them, which vectors and metadata do I need, and which vector DB can search this quickly, safely, and cheaply?".
That's where RAG stops being a demo and becomes a system.