RAGIngestionData Registrar
Part 1 · RAG infra deep rewrite

Data Registrar.

Apr 28, 2026 ~22 min read max

§ 01Hook: Copilot doesn't break your ACLs – it shows they were already broken

Copilot isn't violating your permissions. It does something more uncomfortable: it shows that humans violated those permissions a long time ago. Microsoft 365 Copilot answers based on data the user can already reach through Microsoft Graph and Microsoft 365 permissions. So if your SharePoint has been quietly holding Everyone except external users for years, Copilot isn't "breaking into" the board deck – it just finds, faster than a human would, what was technically already accessible.

Copilot doesn't break your ACLs. It does worse: it instantly shows that your ACLs have been broken for a long time.

Per the Concentric AI Data Risk Report 2025, 16% of business‑critical Microsoft 365 data is overshared, and the average organisation has roughly 802,000 files at risk. One publicly described vendor case walks through a Copilot rollout to tens of thousands of employees, after which an ordinary sales manager pulled a summary of M&A documents that had been sitting in SharePoint with the wrong permissions. Even if you read that as an anonymised vendor anecdote, the class of problem is real: an LLM doesn't create oversharing – it makes existing oversharing instantly exploitable. Knostic separately documents the "lag between permission changes and Copilot sync" class of failure, which only makes things worse.

And this isn't a model problem. It's a problem with the layer that typical RAG architecture diagrams squash into a single arrow labelled "ingestion". That arrow needs to be cut into at least three services: Registrar → Indexer → Retriever. Today is about the Registrar.

§ 02Ingestion is not one arrow on a diagram

The Registrar answers exactly one question, and it's a treacherous one: "which documents exist, in which versions, with which permissions, and which of them are you allowed to index right now?". Not "how to chunk it" or "what to embed it with" – that's the Indexer's job. And not "how to search it" – that's the Retriever. The Registrar owns the registry, and only the registry.

The Registrar is a change feed + permission gateway. It is not a content analyser.

This framing has to be stated plainly, because people break it on a regular basis. The Registrar says exactly one thing: "here is document X at version v17, this is its content_hash, this is its ACL, this is its source, we just observed/updated/removed it". And it emits events whenever any of that changes. That's the whole job.

The Registrar does not say: "this document duplicates that one", "this section is dead", "this paragraph needs an image", "this fact contradicts that one". Any conclusion of that kind requires content analysis – and content analysis is the Indexer (see Part 2). If you try to make the Registrar do it "while it's there anyway", you end up with a service that conflates change‑feed responsibilities with content processing, and it quietly breaks both contracts.

Formally, the Registrar owns four things:

What the Registrar does not do: parse PDFs, compute embeddings, chunk, run trafilatura/Talon/KenLM filters, compute MinHash signatures over body text, or build HNSW. That's all the Indexer's territory. The Registrar runs only a thin quality pass: MIME/size detection, language ID (lid.176), content_hash, oversized/quarantine status, near‑duplicate dedup at the content_hash and MinHash signature level (input‑side dedup is its job). Heavy quality filters (Trafilatura, Talon, KenLM, Gopher heuristics, semantic dedup like SemDeDup/D4) belong squarely to the Indexer.

Picture this: there are 10 million tokens in your index – a normal number for a mid‑sized company. Your model is answering users based on those documents.

What if a document just changed? What if it was deleted? What if a new one was added – and the change is critical? What if its permissions changed? What if different parts of the same tree have different permissions?

Each question on its own is simple, right? But there are a lot of them. And the real question is: how do you stay confident that you've closed them, and keep closing them, every release?

Industry guides love to name data cleaning and ingestion as one of the top reasons RAG projects fail, but for a production architect the percentage matters less than the concrete failure modes: a document is missing, a version is stale, an ACL is rotten, a tombstone never made it to the index. We haven't even gotten to chunking and retrieval yet, and we already have enough risk to retire the "ingestion is one arrow" diagram.

§ 03The Registrar contract: state machine and events

For the Registrar to be an engineering contract rather than "a folder with metadata", you have to describe it through the document lifecycle.

discoveredfetchedactiveacl_only_changed   ──► metadata rewrite, no re-embedding
     → content_changed    ──► Indexer reprocess
     → quarantine         ──► policy violation / instruction-like payload
     → deprecated         ──► admin flag: no new versions, facts still valid
     → sunset             ──► sunset-date in metadata: document retires in N
     → superseded_by      ──► redirect metadata: replaced by another document
     → legacy_archive     ──► archive flag: kept for historical reference
     → tombstone_pending  ──► confirmed deletion at the source
     → purged             ──► hard delete + audit

Transition triggers:

The lifecycle spectrum – an explicit decision, not a Registrar inference

The state machine above is deliberately broader than the simple version: active → tombstone_pending → purged is a binary picture, and real product corpora don't live inside it. Between "current" and "deleted" there are usually four more statuses, and each of them is an explicit decision arriving from the source or from an admin tool. The Registrar doesn't infer them; it carries them as fact.

Status Where it comes from and what it means
active the default – a current document.
deprecated admin flag in the source system: no new versions are being added, but the facts are still valid.
sunset a product decision, sunset‑date in metadata: the document or section will be retired in N.
superseded_by redirect metadata from the source: replaced by another document, with a canonical link.
legacy_archive archive flag in the source: for historical reference, not for default retrieval.
tombstone_pending / purged as in the state machine above – confirmed absence from the source.

The Registrar stores four fields alongside lifecycle_status: lifecycle_decision_by, lifecycle_decision_at, lifecycle_reason, superseded_by_doc_id (where applicable). All of it arrived from outside – the Registrar just recorded it and propagated it downstream.

What does the Registrar do with these statuses at retrieval time? Nothing. Decisions like "show deprecated with a warning", "don't show legacy_archive without opt‑in", "penalise sunset in ranking" – those all belong to the Retriever (see Part 3).

Event contract

Downstream services (Indexer, Retriever, audit) subscribe to an event bus. The minimum event set:

{
  "event_type": "DocumentContentChanged",
  "doc_id": "uuid",
  "source": "sharepoint",
  "source_id": "driveItem:abc",
  "version": 18,
  "content_hash": "sha256:...",
  "acl_hash": "sha256:...",
  "occurred_at": "2026-04-21T14:32:00Z",
  "fetched_at": "2026-04-21T14:34:12Z"
}

Event categories:

This is the engineering contract: everything the Registrar promises downstream is expressed through these nine events. If it reliably emits this set, it has done its job. Everything else is derived – semantics, conflict detection, section liveness, multimodal binding – and lives in the Indexer and the Retriever.

§ 04Document registry: schema and idempotency

The document registry is a separate store (usually Postgres + a Kafka topic as audit log) that keeps, for every document:

doc_id                  UUID
source                  enum (confluence|drive|sharepoint|s3|...)
source_id               text
content_hash            sha256
normalized_hash         sha256
version                 int
modified_at             timestamptz
fetched_at              timestamptz
acl_hash                sha256
principals_allow        text[]
principals_deny         text[]
lifecycle_status        enum (active|deprecated|sunset|superseded_by|
                              legacy_archive|stale|quarantine|
                              tombstone_pending|purged)
lifecycle_decision_by   text         -- who made the call (user/system)
lifecycle_decision_at   timestamptz  -- when
lifecycle_reason        text         -- why (free-form audit)
superseded_by_doc_id    UUID nullable
canonical_id            UUID nullable
source_trust            enum (kb|docs|release_notes|forum|...)  -- policy
lineage                 jsonb (delivery_mode, latency_ms, retry_count)
quarantine_flags        text[]

The upsert idempotency key is (source, source_id, content_hash). If the same payload arrives again, do nothing (no re‑embedding). If only acl_hash changes, rewrite metadata in the index, without re‑embedding. If content_hash changes, re‑index.

This pattern is critical: 80% of "updates" in Confluence/Notion don't actually change content – someone clicked through, swapped tabs for spaces, hit Save. Without a hash compare, you'll regenerate millions of vectors for nothing.

Idempotency saves real money. One public case: a team migrating from text-embedding-ada-002 to text-embedding-3-small skipped diffed upserts and got five‑figure bills instead of two‑figure ones for a properly executed full re‑embedding of 1M documents. At 5–10M documents the gap becomes six figures. Idempotency is a mandatory MVP requirement, not a production‑only luxury: even with one source and 10k documents, without a (source, source_id, content_hash) key you're already fighting your own pipeline from release one.

Exact dedup on content_hash and near‑duplicate MinHash at the front door is the Registrar's job. Semantic dedup (SemDeDup, D4) and quality filters (Trafilatura, Talon, KenLM, Gopher heuristics) belong to the Indexer: they operate on already‑parsed text, shingles and embeddings, and "which chunk is canonical" can only be answered after chunking. The Registrar keeps a thin pass based on hashes and signatures.

Source_trust as a policy field

The schema above has a source_trust field. It's not the result of analysis and not a Registrar inference – it's a source‑level configuration, set by the product: KB > developer docs > release notes > forum, or whatever ranking makes sense in your context. The Registrar reads the value from a policy config and propagates it as document metadata. That's it.

Trust lives in the Registrar precisely because it comes from config, not from content. Using trust is somebody else's problem: the Indexer stores it as chunk metadata, the Retriever applies it at the rerank stage and when resolving conflicts between sources (see Part 3). The Registrar's job is just to record "this document is from source X with trust = Y" and to update that whenever a product owner changes the config.

Same logic as the lifecycle statuses: an explicit decision arriving from the outside is recorded in the registry and emitted downstream. No heuristics, no "let's look at the text and guess".

§ 05Freshness modes: 6 levels of latency × cost

When people ask "how often should we sync?", they usually mean one of six modes. They differ fundamentally in architecture and in money.

Level Source→index latency Mechanics When to use
Manual / one-off backfill hours–days one‑shot dump via bulk API initial load, embedding‑model migration
Scheduled full scan 6–24 hours cron + content_hash compare legacy without delta API, safety net
Incremental polling 15 min – 1 hour modified_after=<cursor> Confluence, Notion, older SaaS
Webhook + fetch seconds–minutes webhook = signal, fetch = truth Slack, Linear, GitHub, Notion
Delta API / change feed 1–5 minutes provider-managed cursor Drive, Graph (SharePoint/OneDrive), Atlassian
CDC / streaming ms–seconds Debezium / Flink / Pathway fraud, trading, ops, e-discovery

5.1. Manual / one-off backfill

What it physically does. A one‑shot dump of all documents via bulk API or export. Triggered manually, runs for hours or days.

Pros. Simple, predictable, no subscription infrastructure required. Ideal for the first load and for migrations between embedding models.

Cons. Zero freshness. The moment the backfill ends, the index starts going stale – any ingest without a delta mechanism turns the corpus into a historical snapshot.

When to use. At the start of a RAG system, for one‑off archive imports, when changing the embedding model on an existing corpus. Never leave it as the only mechanism.

5.2. Scheduled full scan

What it physically does. A cron job, every N hours, walks every document in the source and compares content_hash against the registry. Effectively, batch reconciliation.

Pros. Catches drift and gaps that a delta mechanism missed. Simple to implement – you only need listing and hash compare. Works well as a safety net.

Cons. Expensive in API quota and compute, latency starts at 6 hours. On large corpora a full scan may not fit its window – you'll have to partition by space/site.

When to use. For legacy sources without a usable delta API; as a safety net alongside polling/webhook to catch dropped events. Never as the primary mechanism for a large enterprise corpus.

5.3. Incremental polling

What it physically does. A worker, ticking every 5–60 minutes, polls the source with modified_after=<cursor> or a delta token, pulls only what changed, and advances the cursor.

Pros. Simple, cheap, works almost everywhere. Fully controllable load. Enough for most enterprise RAG cases.

Cons. Latency of 15–60 minutes. On large sources, Confluence Cloud rate limits and pagination can eat real quota – this isn't always "pennies", especially with 50k+ pages and a 5‑minute tick.

When to use. Confluence, Notion, older SaaS without push, the default for any enterprise setup where a 30‑minute SLA doesn't block the product.

5.4. Webhook + fetch

What it physically does. The source emits a push event ("something changed for doc X"). A worker enqueues (doc_id, source, received_at) and goes off to do GET document + GET permissions.

Pros. Seconds‑to‑minutes latency. No constant polling load on the source.

Cons. Webhooks lie (see §6). Delivery guarantees vary wildly between providers. You need fetch‑after‑signal, idempotency, and a reconciliation loop.

When to use. When the source ships reliable webhooks and a 5‑minute SLA already creates product value. Always paired with a reconciliation loop, never as a single source of truth.

5.5. Delta API / change feed

What it physically does. The provider keeps an ordered change log itself (Drive changes.list, Microsoft Graph delta, Atlassian Activity Stream). The client holds a cursor and pulls the next page.

Pros. Ordered, complete, 1–5 minute latency. Works as a primary mechanism, no webhooks needed.

Cons. Not every source has a usable change feed. Cursor state has to be preserved and replayable with a safety overlap.

When to use. Google Drive, SharePoint/OneDrive via Graph API, Atlassian. For these, this is the default and is better than webhooks: you get delivery guarantees plus order.

5.6. CDC / streaming

What it physically does. Debezium reads the database WAL, pushes changes into Kafka, then Flink/Pathway/RisingWave enriches the event with an embedding and writes it into a vector DB. Latency is measured in milliseconds.

Pros. Real real‑time freshness, one pipeline shared across consumers.

Cons. Hard to operate, expensive, requires access to source WAL. Overkill for most enterprise RAG.

When to use. Fraud detection, trading, ops alerts, e‑discovery with deadlines – wherever "10 minutes behind" means "money lost".

Headline takeaway: don't pay for streaming when your risk is already covered by a delta API. The Registrar picks a freshness mode based on risk/SLA, not on stack fashion.


§ 06Webhook ≠ truth

The most damaging belief in production RAG: "we have webhooks, so freshness is guaranteed". A webhook is not truth. It's a "go and recheck" signal.

The specific things webhooks will quietly do to you:

Webhook = signal, not truth. That's the central thesis. The correct pattern: a webhook puts (doc_id, source, received_at) on a "needs recheck" queue, and a separate worker, rate‑limited and idempotent, goes off to GET document and GET permissions. In parallel a reconciliation loop runs every 6 hours, walking the delta API and checking the registry against the source for "events that never made it". Without reconciliation, sooner or later you'll find 0.5% of documents in your index living two versions behind – and you'll learn about it from a user.

§ 07ACL – the most expensive ingestion mistake

Back to oversharing. Technically there are three root problems.

First, broken inheritance in SharePoint. The default is for permissions to inherit from the site down through the hierarchy. Any object can "break" inheritance and stand up its own set. According to Microsoft Learn, a library can hold up to 50,000 unique ACLs (≤5,000 is recommended in practice), and inheritance can't be broken at all once a folder hits 100,000+ items. Real corporate sites live in the hot zone between 5k and 50k scopes. On top of that, SharePoint Advanced Management and Copilot governance now actively flag the baseline grants Everyone except external users, People in your organization, Anyone – exactly the ones that most reliably turn into Copilot oversharing.

Second, Confluence permission context. In Confluence, the danger isn't just an explicit grant: restrictions inherit from the parent, and copying, moving or restructuring a tree can accidentally change effective visibility. Atlassian explicitly warns that moving pages can expose child pages when inherited restrictions no longer apply. The Registrar has to read effective permissions rather than try to derive access from one API field.

Third, Google Drive's "anyone with the link". The Drive API exposes this as a type=anyone permission, often with allowFileDiscovery=false: the document isn't indexed as public by search engines, but for RAG access control it's still "available to anyone with the link". A manager who shared a salary file with one colleague back in 2019 forgot about it – your RAG didn't.

Effective access model, not raw ACL

Real‑world ACL is not a single principals_allow array. The Registrar has to deal with:

The Registrar stores not "raw ACL", but an effective access model, normalised to a principal namespace the Retriever understands.

ACL capture pattern

  1. A full pass of permissions.list on every document change (not on every ACL change – ACL webhooks aren't reliable).
  2. Group expansion down to leaves (group_id → user_ids) with a 4–6 hour TTL.
  3. Keep three fields in the index: principals_allow, principals_deny, acl_hash (sha256 of the sorted contents).
  4. Enforcement at retrieval time, not at ingestion time. This is harder, but it protects you from the race condition "fired at 14:00, documents reindex at 14:15, the user got their question in at 14:05".

Platforms like Glean publicly describe this approach: real‑time permission checks on the query, with the index storing only a snapshot. The open‑source world has small examples like voitta‑rag (March 2026), which walks the SharePoint Graph API and persists permissions next to chunks.

§ 08Tombstones, erasure, and source unavailable ≠ deleted

A document that lands in your RAG turns into dozens of vectors, a copy in blob storage, a cache entry in the reranker, a snapshot in analytics. GDPR Article 17 requires the operator to delete a subject's data when the right to erasure applies (with the conditions and exceptions described in the article). Soft delete on its own doesn't close the right to erasure if the data remains retrievable/processable; for AI systems, that creates legal exposure, because embeddings and Retriever caches keep returning the "deleted" document.

What the Registrar needs for an erasure flow

  1. Reverse mapping subject_id → document_ids → chunk_ids → vector_ids. Built at ingestion time: every time we detect a name/email/phone number, we write an entry in this index.
  2. Hard delete on request: vectors out of the vector DB, chunks out of blob storage, the source document out of object storage, retrieval and rerank caches invalidated.
  3. Audit log: subject_id, operation, vector_id, namespace, threshold, result count, timestamp. Retention is set by privacy/legal policy, limitation periods, and legal hold; the audit log itself is also minimised.
  4. TTL on collections – automatic retention without manual scripts.
  5. Reindex strategies: for faiss IVF, periodic rebuilds after large deletes; for HNSW, deletes are often implemented via tombstones/mark‑delete, and periodic compaction/rebuild may be required depending on the engine.

Source unavailable ≠ deleted

If the Confluence/SharePoint API returns 404/403/timeout, that is not a tombstone. Real causes include:

The Registrar has to distinguish between states:

not_found_confirmed_by_source   → tombstone
not_found_once                  → pending_recheck
source_unavailable              → stale (NOT tombstone)
connector_permission_lost       → incident, not a deletion

Without this, a connector outage can wipe half your index by accident. The only valid trigger for a tombstone is confirmed absence from the source, via a consistent delta API or multi‑attempt verification (at least 2–3 attempts, spaced over time, with an independent connector health check).

Legal hold and retention conflict

A classic conflict, and one worth preparing for in advance:

In this scenario the Registrar moves the document into quarantine (retrieval disabled) with legal_hold=true, defers physical deletion until the hold is lifted, and the audit log captures both events – the user request and the retrieval block. This isn't a theoretical scenario: any regulated industry – finance, healthcare, legal – runs into exactly this conflict eventually.

Why this isn't an academic concern

A short fines round‑up. OpenAI vs Garante – €15M in December 2024; in March 2026 the Court of Rome annulled the Garante's decision (Wilson Sonsini), but parallel investigations in Germany, France and Spain haven't gone anywhere. Total GDPR fines in 2025 reached €1.15 billion per the EDPB Annual Report. EDPB Opinion 28/2024 covers anonymisation in AI models and the legitimate‑interest assessment before deployment – which directly affects embedding providers and any ingestion pipeline with personal data.

§ 09Observability: dashboard and SLOs for the Registrar

The Registrar is almost always invisible in dashboards. On the retrieval side you see p95 search latency, recall@10, faithfulness; on the answer side, generation cost; on ingestion – well, sometimes a count of chunks per hour. But "how many documents have been stale for an hour or more", "how many had ACL changes without reindexing", "how many deletion requests closed within SLA" – none of that is there. That's the blind spot the 802,000 files live in.

The minimum metric set, below which the Registrar is not production‑ready:

Metric What it catches
registrar_staleness_p95 documents not reconciled with the source for a long time
acl_drift_lag_p95 delay between an ACL change and the effective ACL update in the index
tombstone_propagation_p95 time from tombstone to deletion across all downstream
orphan_chunks_count chunks without an active registry record
acl_hash_mismatch_count index holds a stale ACL relative to the registry
source_429_rate source API throttling
reconciliation_misses_total webhook drop / delta gap caught by reconciliation
index_metadata_rewrite_lag latency of ACL‑only updates in the index
purge_failures_total erasure / hard‑delete failures
lifecycle_status_distribution breakdown across active / deprecated / sunset / superseded / archive – the Registrar genuinely knows this
lifecycle_transition_rate lifecycle transitions over a window – a spike signals a mass migration or a sunset campaign
source_trust_distribution document distribution by trust level – a sanity check on the policy config

Each metric surfaces its own class of bugs: acl_drift_lag – a fired employee still sees documents; tombstone_propagation – a deleted document is still in the index; orphan_chunks – garbage in the vector DB; reconciliation_misses – webhooks dying quietly; lifecycle_transition_rate – a mass migration/sunset that downstream must react to. Without these metrics the Registrar degrades silently, and you'll learn about it from a user or a regulator.

What's deliberately missing from this table: conflict_count, multimodal_anchor_misses, dead_section_ratio, same_title_collisions. Those are Indexer‑side quality metrics – they show up in Part 2, because they require content analysis, not registry analysis.

§ 10MVP vs Production Registrar

Component MVP Production
Idempotency (source, source_id, content_hash) – mandatory same + normalized_hash + canonical_id
Exact dedup SHA-256 on content SHA-256 raw + normalised; near-dup MinHash at the front door
Tombstones Hard delete on confirmed deletion reverse mapping subject_id→vectors + audit log + legal hold
ACL principals_allow deny rules + group expansion + dynamic groups + retrieval‑time enforcement
Freshness polling + scheduled scan delta API / webhook + reconciliation loop
Time‑to‑index SLA best effort P95 ≤ 10 minutes, with a dedicated staleness metric
Observability counter of processed documents full SLO dashboard from §9
Quarantine flag + retrieval skip flags + provenance + instruction‑like payload detection
Lifecycle states active / deleted active + deprecated + sunset + superseded_by + legacy_archive + audit (lifecycle_decision_by/at/reason)
Source_trust all sources equal policy config with priorities (KB > docs > release notes > forum), a field in the registry
Lifecycle events LifecycleStatusChanged for downstream consumers (Indexer + Retriever)

This MVP→Production jump is not a one‑week task. It's 2–4 months for a team of 2–3 engineers, and it's painful to do retroactively on 5M documents. Carve the Registrar out as a separate service from day one – even if at MVP stage that's just three modules and a Postgres.

§ 11What's next

The Registrar is the layer where 80% of enterprise RAG projects collapse, because you can't paper over it with a "good vector DB" or hide it behind a "smart retriever". The contract is simple: document registry + effective ACL + lifecycle + lineage + tombstones, expressed through 9 events and 12 metrics. If the Registrar reliably emits that, it has done its job.

In part 2 – the Indexer: chunking, quality filters (Trafilatura, Talon, KenLM, Gopher/C4), near‑dup and semantic dedup (MinHash/LSH, SemDeDup, D4), multimodal parsing, instruction sanitisation as a content‑processing step. That's also where everything that requires content analysis but keeps getting smuggled into the Registrar belongs: conflict detection across documents, section‑level liveness (dead sections, broken anchors), multimodal binding (deictic markers next to an image), same‑title collisions, chunk‑level scope tags. Anything about "how to cut it up and what to throw away" lives there.

In part 3 – the Retriever: hybrid retrieval, reranking, retrieval‑time ACL enforcement, cascades and cost, protection against prompt injection in the rendering layer, CSP and markdown guardrails. Also: lifecycle penalty in ranking (how to penalise sunset, how to allow opt‑in for legacy_archive), conflict resolution in the final answer, and using source_trust at rerank time.

If you finished this piece thinking "but our diagram shows ingestion as a single arrow…", the article did its job.