Data Registrar – RAG infra · rootwise papers

§ 01Hook: Copilot doesn't break your ACLs – it shows they were already broken

Copilot isn't violating your permissions. It does something more uncomfortable: it shows that humans violated those permissions a long time ago. Microsoft 365 Copilot answers based on data the user can already reach through Microsoft Graph and Microsoft 365 permissions. So if your SharePoint has been quietly holding Everyone except external users for years, Copilot isn't "breaking into" the board deck – it just finds, faster than a human would, what was technically already accessible.

Copilot doesn't break your ACLs. It does worse: it instantly shows that your ACLs have been broken for a long time.

Per the Concentric AI Data Risk Report 2025, 16% of business‑critical Microsoft 365 data is overshared, and the average organisation has roughly 802,000 files at risk. One publicly described vendor case walks through a Copilot rollout to tens of thousands of employees, after which an ordinary sales manager pulled a summary of M&A documents that had been sitting in SharePoint with the wrong permissions. Even if you read that as an anonymised vendor anecdote, the class of problem is real: an LLM doesn't create oversharing – it makes existing oversharing instantly exploitable. Knostic separately documents the "lag between permission changes and Copilot sync" class of failure, which only makes things worse.

And this isn't a model problem. It's a problem with the layer that typical RAG architecture diagrams squash into a single arrow labelled "ingestion". That arrow needs to be cut into at least three services: Registrar → Indexer → Retriever. Today is about the Registrar.

§ 02Ingestion is not one arrow on a diagram

The Registrar answers exactly one question, and it's a treacherous one: "which documents exist, in which versions, with which permissions, and which of them are you allowed to index right now?". Not "how to chunk it" or "what to embed it with" – that's the Indexer's job. And not "how to search it" – that's the Retriever. The Registrar owns the registry, and only the registry.

The Registrar is a change feed + permission gateway. It is not a content analyser.

This framing has to be stated plainly, because people break it on a regular basis. The Registrar says exactly one thing: "here is document X at version v17, this is its content_hash, this is its ACL, this is its source, we just observed/updated/removed it". And it emits events whenever any of that changes. That's the whole job.

The Registrar does not say: "this document duplicates that one", "this section is dead", "this paragraph needs an image", "this fact contradicts that one". Any conclusion of that kind requires content analysis – and content analysis is the Indexer (see Part 2). If you try to make the Registrar do it "while it's there anyway", you end up with a service that conflates change‑feed responsibilities with content processing, and it quietly breaks both contracts.

Formally, the Registrar owns four things:

Document registry: the canonical record – "document X exists, version v17, source Confluence space ENG, last modified 2026‑04‑21 14:32 UTC, content hash sha256:…".
Effective ACL snapshot: "who actually has access to this document right now, with groups expanded, deny rules applied, external guests resolved and broken inheritance accounted for, normalised to a principal namespace the Retriever understands".
Lineage: "where it came from, by what mechanism – webhook, polling, CDC – and what the delivery latency was".
Tombstones and deletion log: "these documents or versions must be deleted, including from every downstream index".

What the Registrar does not do: parse PDFs, compute embeddings, chunk, run trafilatura/Talon/KenLM filters, compute MinHash signatures over body text, or build HNSW. That's all the Indexer's territory. The Registrar runs only a thin quality pass: MIME/size detection, language ID (lid.176), content_hash, oversized/quarantine status, near‑duplicate dedup at the content_hash and MinHash signature level (input‑side dedup is its job). Heavy quality filters (Trafilatura, Talon, KenLM, Gopher heuristics, semantic dedup like SemDeDup/D4) belong squarely to the Indexer.

Picture this: there are 10 million tokens in your index – a normal number for a mid‑sized company. Your model is answering users based on those documents.

What if a document just changed? What if it was deleted? What if a new one was added – and the change is critical? What if its permissions changed? What if different parts of the same tree have different permissions?

Each question on its own is simple, right? But there are a lot of them. And the real question is: how do you stay confident that you've closed them, and keep closing them, every release?

Industry guides love to name data cleaning and ingestion as one of the top reasons RAG projects fail, but for a production architect the percentage matters less than the concrete failure modes: a document is missing, a version is stale, an ACL is rotten, a tombstone never made it to the index. We haven't even gotten to chunking and retrieval yet, and we already have enough risk to retire the "ingestion is one arrow" diagram.

§ 03The Registrar contract: state machine and events

For the Registrar to be an engineering contract rather than "a folder with metadata", you have to describe it through the document lifecycle.

discovered
  → fetched
  → active
     → acl_only_changed   ──► metadata rewrite, no re-embedding
     → content_changed    ──► Indexer reprocess
     → quarantine         ──► policy violation / instruction-like payload
     → deprecated         ──► admin flag: no new versions, facts still valid
     → sunset             ──► sunset-date in metadata: document retires in N
     → superseded_by      ──► redirect metadata: replaced by another document
     → legacy_archive     ──► archive flag: kept for historical reference
     → tombstone_pending  ──► confirmed deletion at the source
     → purged             ──► hard delete + audit

Transition triggers:

discovered → fetched: the registry has learned about the source (via delta API/webhook/scan), but content has not yet been pulled.
fetched → active: the document has been fetched, content_hash and acl_hash have been computed, the Indexer has been notified.
active → acl_only_changed: acl_hash changed, content_hash is the same. Index metadata is rewritten without re‑embedding.
active → content_changed: content_hash changed. The Indexer re‑parses and re‑embeds.
active/* → quarantine: policy violation, suspicious payload, signs of instruction injection, or a source trust breach.
active → deprecated / sunset / superseded_by / legacy_archive: an explicit decision from the source or an admin tool – not "it aged out by itself" but something that arrived from outside (see below).
active → tombstone_pending: the source has confirmed deletion via a consistent delta API or multi‑attempt verification (see §8). Only confirmed absence is grounds for a tombstone.
tombstone_pending → purged: hard delete completed, audit log closed, downstream services confirmed invalidation.

The lifecycle spectrum – an explicit decision, not a Registrar inference

The state machine above is deliberately broader than the simple version: active → tombstone_pending → purged is a binary picture, and real product corpora don't live inside it. Between "current" and "deleted" there are usually four more statuses, and each of them is an explicit decision arriving from the source or from an admin tool. The Registrar doesn't infer them; it carries them as fact.

Status	Where it comes from and what it means
`active`	the default – a current document.
`deprecated`	admin flag in the source system: no new versions are being added, but the facts are still valid.
`sunset`	a product decision, sunset‑date in metadata: the document or section will be retired in N.
`superseded_by`	redirect metadata from the source: replaced by another document, with a canonical link.
`legacy_archive`	archive flag in the source: for historical reference, not for default retrieval.
`tombstone_pending` / `purged`	as in the state machine above – confirmed absence from the source.

The Registrar stores four fields alongside lifecycle_status: lifecycle_decision_by, lifecycle_decision_at, lifecycle_reason, superseded_by_doc_id (where applicable). All of it arrived from outside – the Registrar just recorded it and propagated it downstream.

What does the Registrar do with these statuses at retrieval time? Nothing. Decisions like "show deprecated with a warning", "don't show legacy_archive without opt‑in", "penalise sunset in ranking" – those all belong to the Retriever (see Part 3).

Event contract

Downstream services (Indexer, Retriever, audit) subscribe to an event bus. The minimum event set:

{
  "event_type": "DocumentContentChanged",
  "doc_id": "uuid",
  "source": "sharepoint",
  "source_id": "driveItem:abc",
  "version": 18,
  "content_hash": "sha256:...",
  "acl_hash": "sha256:...",
  "occurred_at": "2026-04-21T14:32:00Z",
  "fetched_at": "2026-04-21T14:34:12Z"
}

Event categories:

DocumentDiscovered – the registry learned about the source.
DocumentContentChanged – content_hash changed.
DocumentAclChanged – acl_hash changed, content is the same.
DocumentTombstoned – confirmed deletion.
DocumentQuarantined – policy/security flag.
DocumentPurged – hard delete completed.
LifecycleStatusChanged – {doc_id, from, to, by, at, reason}: explicit decision from the source or an admin tool (deprecated / sunset / superseded_by / legacy_archive).
SourceCursorAdvanced – delta‑API/CDC checkpoint advanced.
ReconciliationGapDetected – reconciliation found a mismatch between the registry and the source.

This is the engineering contract: everything the Registrar promises downstream is expressed through these nine events. If it reliably emits this set, it has done its job. Everything else is derived – semantics, conflict detection, section liveness, multimodal binding – and lives in the Indexer and the Retriever.

§ 04Document registry: schema and idempotency

The document registry is a separate store (usually Postgres + a Kafka topic as audit log) that keeps, for every document:

doc_id                  UUID
source                  enum (confluence|drive|sharepoint|s3|...)
source_id               text
content_hash            sha256
normalized_hash         sha256
version                 int
modified_at             timestamptz
fetched_at              timestamptz
acl_hash                sha256
principals_allow        text[]
principals_deny         text[]
lifecycle_status        enum (active|deprecated|sunset|superseded_by|
                              legacy_archive|stale|quarantine|
                              tombstone_pending|purged)
lifecycle_decision_by   text         -- who made the call (user/system)
lifecycle_decision_at   timestamptz  -- when
lifecycle_reason        text         -- why (free-form audit)
superseded_by_doc_id    UUID nullable
canonical_id            UUID nullable
source_trust            enum (kb|docs|release_notes|forum|...)  -- policy
lineage                 jsonb (delivery_mode, latency_ms, retry_count)
quarantine_flags        text[]

The upsert idempotency key is (source, source_id, content_hash). If the same payload arrives again, do nothing (no re‑embedding). If only acl_hash changes, rewrite metadata in the index, without re‑embedding. If content_hash changes, re‑index.

This pattern is critical: 80% of "updates" in Confluence/Notion don't actually change content – someone clicked through, swapped tabs for spaces, hit Save. Without a hash compare, you'll regenerate millions of vectors for nothing.

Idempotency saves real money. One public case: a team migrating from text-embedding-ada-002 to text-embedding-3-small skipped diffed upserts and got five‑figure bills instead of two‑figure ones for a properly executed full re‑embedding of 1M documents. At 5–10M documents the gap becomes six figures. Idempotency is a mandatory MVP requirement, not a production‑only luxury: even with one source and 10k documents, without a (source, source_id, content_hash) key you're already fighting your own pipeline from release one.

Exact dedup on content_hash and near‑duplicate MinHash at the front door is the Registrar's job. Semantic dedup (SemDeDup, D4) and quality filters (Trafilatura, Talon, KenLM, Gopher heuristics) belong to the Indexer: they operate on already‑parsed text, shingles and embeddings, and "which chunk is canonical" can only be answered after chunking. The Registrar keeps a thin pass based on hashes and signatures.

Source_trust as a policy field

The schema above has a source_trust field. It's not the result of analysis and not a Registrar inference – it's a source‑level configuration, set by the product: KB > developer docs > release notes > forum, or whatever ranking makes sense in your context. The Registrar reads the value from a policy config and propagates it as document metadata. That's it.

Trust lives in the Registrar precisely because it comes from config, not from content. Using trust is somebody else's problem: the Indexer stores it as chunk metadata, the Retriever applies it at the rerank stage and when resolving conflicts between sources (see Part 3). The Registrar's job is just to record "this document is from source X with trust = Y" and to update that whenever a product owner changes the config.

Same logic as the lifecycle statuses: an explicit decision arriving from the outside is recorded in the registry and emitted downstream. No heuristics, no "let's look at the text and guess".

§ 05Freshness modes: 6 levels of latency × cost

When people ask "how often should we sync?", they usually mean one of six modes. They differ fundamentally in architecture and in money.

Level	Source→index latency	Mechanics	When to use
Manual / one-off backfill	hours–days	one‑shot dump via bulk API	initial load, embedding‑model migration
Scheduled full scan	6–24 hours	cron + content_hash compare	legacy without delta API, safety net
Incremental polling	15 min – 1 hour	`modified_after=<cursor>`	Confluence, Notion, older SaaS
Webhook + fetch	seconds–minutes	webhook = signal, fetch = truth	Slack, Linear, GitHub, Notion
Delta API / change feed	1–5 minutes	provider-managed cursor	Drive, Graph (SharePoint/OneDrive), Atlassian
CDC / streaming	ms–seconds	Debezium / Flink / Pathway	fraud, trading, ops, e-discovery

5.1. Manual / one-off backfill

What it physically does. A one‑shot dump of all documents via bulk API or export. Triggered manually, runs for hours or days.

Pros. Simple, predictable, no subscription infrastructure required. Ideal for the first load and for migrations between embedding models.

Cons. Zero freshness. The moment the backfill ends, the index starts going stale – any ingest without a delta mechanism turns the corpus into a historical snapshot.

When to use. At the start of a RAG system, for one‑off archive imports, when changing the embedding model on an existing corpus. Never leave it as the only mechanism.

5.2. Scheduled full scan

What it physically does. A cron job, every N hours, walks every document in the source and compares content_hash against the registry. Effectively, batch reconciliation.

Pros. Catches drift and gaps that a delta mechanism missed. Simple to implement – you only need listing and hash compare. Works well as a safety net.

Cons. Expensive in API quota and compute, latency starts at 6 hours. On large corpora a full scan may not fit its window – you'll have to partition by space/site.

When to use. For legacy sources without a usable delta API; as a safety net alongside polling/webhook to catch dropped events. Never as the primary mechanism for a large enterprise corpus.

5.3. Incremental polling

What it physically does. A worker, ticking every 5–60 minutes, polls the source with modified_after=<cursor> or a delta token, pulls only what changed, and advances the cursor.

Pros. Simple, cheap, works almost everywhere. Fully controllable load. Enough for most enterprise RAG cases.

Cons. Latency of 15–60 minutes. On large sources, Confluence Cloud rate limits and pagination can eat real quota – this isn't always "pennies", especially with 50k+ pages and a 5‑minute tick.

When to use. Confluence, Notion, older SaaS without push, the default for any enterprise setup where a 30‑minute SLA doesn't block the product.

5.4. Webhook + fetch

What it physically does. The source emits a push event ("something changed for doc X"). A worker enqueues (doc_id, source, received_at) and goes off to do GET document + GET permissions.

Pros. Seconds‑to‑minutes latency. No constant polling load on the source.

Cons. Webhooks lie (see §6). Delivery guarantees vary wildly between providers. You need fetch‑after‑signal, idempotency, and a reconciliation loop.

When to use. When the source ships reliable webhooks and a 5‑minute SLA already creates product value. Always paired with a reconciliation loop, never as a single source of truth.

5.5. Delta API / change feed

What it physically does. The provider keeps an ordered change log itself (Drive changes.list, Microsoft Graph delta, Atlassian Activity Stream). The client holds a cursor and pulls the next page.

Pros. Ordered, complete, 1–5 minute latency. Works as a primary mechanism, no webhooks needed.

Cons. Not every source has a usable change feed. Cursor state has to be preserved and replayable with a safety overlap.

When to use. Google Drive, SharePoint/OneDrive via Graph API, Atlassian. For these, this is the default and is better than webhooks: you get delivery guarantees plus order.

5.6. CDC / streaming

What it physically does. Debezium reads the database WAL, pushes changes into Kafka, then Flink/Pathway/RisingWave enriches the event with an embedding and writes it into a vector DB. Latency is measured in milliseconds.

Pros. Real real‑time freshness, one pipeline shared across consumers.

Cons. Hard to operate, expensive, requires access to source WAL. Overkill for most enterprise RAG.

When to use. Fraud detection, trading, ops alerts, e‑discovery with deadlines – wherever "10 minutes behind" means "money lost".

Headline takeaway: don't pay for streaming when your risk is already covered by a delta API. The Registrar picks a freshness mode based on risk/SLA, not on stack fashion.

Sidebar · Streaming embeddings landscape 2025

┌────────────┐   CDC     ┌────────────┐  embed   ┌─────────────┐
│  Postgres  │──────────▶│   Kafka    │─────────▶│  Vector DB  │
└────────────┘ Debezium  └─────┬──────┘  Flink   └─────────────┘
                               │         Pathway
                               │         RisingWave
                               ▼
                         Materialized
                         views / search

Debezium 3.1 (April 2025) shipped the FieldToEmbedding SMT: a Kafka Connect pipeline can now fetch a vector inline (MiniLM locally or Ollama via LangChain4j) and write it into io.debezium.data.FloatVector. Debezium 3.1.1 patched the edge cases – the NPE on delete records and a crash when the source field name was a substring of the embedding field name (release notes 3.1.1). Debezium 3.2 (July 2025) added ollama.operation.timeout.ms (release notes 3.2). It's a clean reference pattern where a CDC event is enriched with an embedding inline, with no separate ingestion service.

RisingWave v2.6 (October 2025) introduced a vector(n) type and experimental HNSW compatible with the pgvector API (Highlights v2.6) – one database holding both materialised views and similarity search via the <-> distance operator. Role: a streaming DB in which you can store chunks and search them at the same time.

Pathway (Rust + Differential Dataflow) is a streaming/RAG framework with public success stories, including NATO. Its own benchmarks place it at the top of the streaming‑engine pack, but specific numbers like "90× faster than Flink" are best avoided without a precise scenario (pathway/pathway). Role: a compute engine for real‑time RAG pipelines with incremental joins.

Confluent Flink SQL. Model resources in Flink SQL appeared in Confluent Cloud in 2024. In 2025 the headline practical addition was the Create Embeddings action: turn a Kafka stream into a stream of embeddings and ship it onwards into Pinecone/Milvus/Qdrant/Mongo/Weaviate. Role: a SQL‑level option for streaming embeddings in a managed environment.

The headline takeaway from this sidebar matches the body of the article: the streaming stack is impressive, but it's a tool for a specific risk/SLA – not a default.

§ 06Webhook ≠ truth

The most damaging belief in production RAG: "we have webhooks, so freshness is guaranteed". A webhook is not truth. It's a "go and recheck" signal.

The specific things webhooks will quietly do to you:

Different retry guarantees. Notion documents up to 8 retries; Atlassian Jira retries on 408/409/425/429/5xx and connection failures; SharePoint/Confluence drop events under load. There's no universal "503 → the event will come back" rule – plan for "delivery may be late, duplicated, or lost".
Out‑of‑order delivery. Update v17 can arrive before update v16, especially through AWS API Gateway with retries.
Stale payload. The body has a title and id, but the actual content may have changed again by the time you go to read it.
Security. Webhooks usually don't carry ACLs – you get "document X updated", but who can see it now is something you have to go and ask separately.

Webhook = signal, not truth. That's the central thesis. The correct pattern: a webhook puts (doc_id, source, received_at) on a "needs recheck" queue, and a separate worker, rate‑limited and idempotent, goes off to GET document and GET permissions. In parallel a reconciliation loop runs every 6 hours, walking the delta API and checking the registry against the source for "events that never made it". Without reconciliation, sooner or later you'll find 0.5% of documents in your index living two versions behind – and you'll learn about it from a user.

§ 07ACL – the most expensive ingestion mistake

Back to oversharing. Technically there are three root problems.

First, broken inheritance in SharePoint. The default is for permissions to inherit from the site down through the hierarchy. Any object can "break" inheritance and stand up its own set. According to Microsoft Learn, a library can hold up to 50,000 unique ACLs (≤5,000 is recommended in practice), and inheritance can't be broken at all once a folder hits 100,000+ items. Real corporate sites live in the hot zone between 5k and 50k scopes. On top of that, SharePoint Advanced Management and Copilot governance now actively flag the baseline grants Everyone except external users, People in your organization, Anyone – exactly the ones that most reliably turn into Copilot oversharing.

Second, Confluence permission context. In Confluence, the danger isn't just an explicit grant: restrictions inherit from the parent, and copying, moving or restructuring a tree can accidentally change effective visibility. Atlassian explicitly warns that moving pages can expose child pages when inherited restrictions no longer apply. The Registrar has to read effective permissions rather than try to derive access from one API field.

Third, Google Drive's "anyone with the link". The Drive API exposes this as a type=anyone permission, often with allowFileDiscovery=false: the document isn't indexed as public by search engines, but for RAG access control it's still "available to anyone with the link". A manager who shared a salary file with one colleague back in 2019 forgot about it – your RAG didn't.

Effective access model, not raw ACL

Real‑world ACL is not a single principals_allow array. The Registrar has to deal with:

explicit deny – deny outranks allow, and it must not be lost during normalisation;
inherited allow / deny from the parent (site, space, folder);
external guests and domain‑wide grants;
"anyone with link" as a separate principal with an allowFileDiscovery flag;
Everyone except external users / People in your organization / Anyone as a baseline pattern – its own risk class;
group nesting – expanding the group hierarchy down to leaves, with a TTL;
dynamic groups (Entra ID dynamic membership, Okta group rules) that change membership without an explicit event;
deleted users – a deleted user must not stay in principals_allow;
SCIM / Entra ID delays – the directory provider itself is eventually consistent, and propagation can take minutes to hours.

The Registrar stores not "raw ACL", but an effective access model, normalised to a principal namespace the Retriever understands.

ACL capture pattern

A full pass of permissions.list on every document change (not on every ACL change – ACL webhooks aren't reliable).
Group expansion down to leaves (group_id → user_ids) with a 4–6 hour TTL.
Keep three fields in the index: principals_allow, principals_deny, acl_hash (sha256 of the sorted contents).
Enforcement at retrieval time, not at ingestion time. This is harder, but it protects you from the race condition "fired at 14:00, documents reindex at 14:15, the user got their question in at 14:05".

Platforms like Glean publicly describe this approach: real‑time permission checks on the query, with the index storing only a snapshot. The open‑source world has small examples like voitta‑rag (March 2026), which walks the SharePoint Graph API and persists permissions next to chunks.

§ 08Tombstones, erasure, and source unavailable ≠ deleted

A document that lands in your RAG turns into dozens of vectors, a copy in blob storage, a cache entry in the reranker, a snapshot in analytics. GDPR Article 17 requires the operator to delete a subject's data when the right to erasure applies (with the conditions and exceptions described in the article). Soft delete on its own doesn't close the right to erasure if the data remains retrievable/processable; for AI systems, that creates legal exposure, because embeddings and Retriever caches keep returning the "deleted" document.

What the Registrar needs for an erasure flow

Reverse mapping subject_id → document_ids → chunk_ids → vector_ids. Built at ingestion time: every time we detect a name/email/phone number, we write an entry in this index.
Hard delete on request: vectors out of the vector DB, chunks out of blob storage, the source document out of object storage, retrieval and rerank caches invalidated.
Audit log: subject_id, operation, vector_id, namespace, threshold, result count, timestamp. Retention is set by privacy/legal policy, limitation periods, and legal hold; the audit log itself is also minimised.
TTL on collections – automatic retention without manual scripts.
Reindex strategies: for faiss IVF, periodic rebuilds after large deletes; for HNSW, deletes are often implemented via tombstones/mark‑delete, and periodic compaction/rebuild may be required depending on the engine.

Source unavailable ≠ deleted

If the Confluence/SharePoint API returns 404/403/timeout, that is not a tombstone. Real causes include:

source API outage;
expired credentials;
connector lost permission;
item moved;
tenant throttling;
transient 5xx.

The Registrar has to distinguish between states:

not_found_confirmed_by_source   → tombstone
not_found_once                  → pending_recheck
source_unavailable              → stale (NOT tombstone)
connector_permission_lost       → incident, not a deletion

Without this, a connector outage can wipe half your index by accident. The only valid trigger for a tombstone is confirmed absence from the source, via a consistent delta API or multi‑attempt verification (at least 2–3 attempts, spaced over time, with an independent connector health check).

Legal hold and retention conflict

A classic conflict, and one worth preparing for in advance:

a user requests erasure;
the document is under legal hold (investigation, litigation hold);
the audit log must record the deletion operation;
downstream must stop retrieval, even though the source object can't be physically deleted right away.

In this scenario the Registrar moves the document into quarantine (retrieval disabled) with legal_hold=true, defers physical deletion until the hold is lifted, and the audit log captures both events – the user request and the retrieval block. This isn't a theoretical scenario: any regulated industry – finance, healthcare, legal – runs into exactly this conflict eventually.

Why this isn't an academic concern

A short fines round‑up. OpenAI vs Garante – €15M in December 2024; in March 2026 the Court of Rome annulled the Garante's decision (Wilson Sonsini), but parallel investigations in Germany, France and Spain haven't gone anywhere. Total GDPR fines in 2025 reached €1.15 billion per the EDPB Annual Report. EDPB Opinion 28/2024 covers anonymisation in AI models and the legitimate‑interest assessment before deployment – which directly affects embedding providers and any ingestion pipeline with personal data.

Post‑mortem · EchoLeak (CVE‑2025‑32711, CVSS 9.3)

January 2025 – Aim Labs reports privately to Microsoft. May 2025 – server‑side fix. June 2025 – public disclosure on Patch Tuesday (Hacker News, arXiv 2509.10540). The first documented zero‑click in a production LLM. Email with a markdown payload → Copilot ingests it as the user works → payload bypasses the XPIA classifier → reference‑style markdown bypasses link redaction → auto‑loaded images via the Teams proxy (whitelisted in CSP) exfiltrate the data.

Lesson for the Registrar: store provenance, source trust, quarantine flags, and signs of instruction‑like payloads at ingest. Ingest‑time trust evaluation lives here. Lesson for Retriever/Responder: strict isolation of retrieved content from system instructions, CSP, markdown/image rendering guardrails. Don't blur the layers: the Registrar doesn't do CSP, the Retriever/Responder doesn't recompute source trust.

§ 09Observability: dashboard and SLOs for the Registrar

The Registrar is almost always invisible in dashboards. On the retrieval side you see p95 search latency, recall@10, faithfulness; on the answer side, generation cost; on ingestion – well, sometimes a count of chunks per hour. But "how many documents have been stale for an hour or more", "how many had ACL changes without reindexing", "how many deletion requests closed within SLA" – none of that is there. That's the blind spot the 802,000 files live in.

The minimum metric set, below which the Registrar is not production‑ready:

Metric	What it catches
`registrar_staleness_p95`	documents not reconciled with the source for a long time
`acl_drift_lag_p95`	delay between an ACL change and the effective ACL update in the index
`tombstone_propagation_p95`	time from tombstone to deletion across all downstream
`orphan_chunks_count`	chunks without an active registry record
`acl_hash_mismatch_count`	index holds a stale ACL relative to the registry
`source_429_rate`	source API throttling
`reconciliation_misses_total`	webhook drop / delta gap caught by reconciliation
`index_metadata_rewrite_lag`	latency of ACL‑only updates in the index
`purge_failures_total`	erasure / hard‑delete failures
`lifecycle_status_distribution`	breakdown across active / deprecated / sunset / superseded / archive – the Registrar genuinely knows this
`lifecycle_transition_rate`	lifecycle transitions over a window – a spike signals a mass migration or a sunset campaign
`source_trust_distribution`	document distribution by trust level – a sanity check on the policy config

Each metric surfaces its own class of bugs: acl_drift_lag – a fired employee still sees documents; tombstone_propagation – a deleted document is still in the index; orphan_chunks – garbage in the vector DB; reconciliation_misses – webhooks dying quietly; lifecycle_transition_rate – a mass migration/sunset that downstream must react to. Without these metrics the Registrar degrades silently, and you'll learn about it from a user or a regulator.

What's deliberately missing from this table: conflict_count, multimodal_anchor_misses, dead_section_ratio, same_title_collisions. Those are Indexer‑side quality metrics – they show up in Part 2, because they require content analysis, not registry analysis.

Post‑mortem · Cloudflare, 18 November 2025

For around 5.5 hours, ChatGPT, Claude, Perplexity, and a chunk of AI tooling went down at the same time (Cloudflare post‑mortem). Root cause: a permissions change in one of the databases caused a query to return duplicate rows, doubling the size of the Bot Management feature file – from ~60 features to >200 – which exceeded a hard‑coded limit in the core proxy. The file propagated across the network and took machines down with it.

Practical takeaway for the Registrar: the embedding/backfill pipeline needs a queue, exponential backoff, a circuit breaker, and a rate‑limit budget. Otherwise, when a provider recovers you get a thundering herd on 429s, the embedding rate‑limit budget is eaten in hours, and recovery time stretches out. Backpressure and observability are how you avoid turning someone else's outage into your own.

§ 10MVP vs Production Registrar

Component	MVP	Production
Idempotency	`(source, source_id, content_hash)` – mandatory	same + normalized_hash + canonical_id
Exact dedup	SHA-256 on content	SHA-256 raw + normalised; near-dup MinHash at the front door
Tombstones	Hard delete on confirmed deletion	reverse mapping subject_id→vectors + audit log + legal hold
ACL	`principals_allow`	deny rules + group expansion + dynamic groups + retrieval‑time enforcement
Freshness	polling + scheduled scan	delta API / webhook + reconciliation loop
Time‑to‑index SLA	best effort	P95 ≤ 10 minutes, with a dedicated staleness metric
Observability	counter of processed documents	full SLO dashboard from §9
Quarantine	flag + retrieval skip	flags + provenance + instruction‑like payload detection
Lifecycle states	active / deleted	active + deprecated + sunset + superseded_by + legacy_archive + audit (`lifecycle_decision_by/at/reason`)
Source_trust	all sources equal	policy config with priorities (KB > docs > release notes > forum), a field in the registry
Lifecycle events	–	`LifecycleStatusChanged` for downstream consumers (Indexer + Retriever)

This MVP→Production jump is not a one‑week task. It's 2–4 months for a team of 2–3 engineers, and it's painful to do retroactively on 5M documents. Carve the Registrar out as a separate service from day one – even if at MVP stage that's just three modules and a Postgres.

§ 11What's next

The Registrar is the layer where 80% of enterprise RAG projects collapse, because you can't paper over it with a "good vector DB" or hide it behind a "smart retriever". The contract is simple: document registry + effective ACL + lifecycle + lineage + tombstones, expressed through 9 events and 12 metrics. If the Registrar reliably emits that, it has done its job.

In part 2 – the Indexer: chunking, quality filters (Trafilatura, Talon, KenLM, Gopher/C4), near‑dup and semantic dedup (MinHash/LSH, SemDeDup, D4), multimodal parsing, instruction sanitisation as a content‑processing step. That's also where everything that requires content analysis but keeps getting smuggled into the Registrar belongs: conflict detection across documents, section‑level liveness (dead sections, broken anchors), multimodal binding (deictic markers next to an image), same‑title collisions, chunk‑level scope tags. Anything about "how to cut it up and what to throw away" lives there.

In part 3 – the Retriever: hybrid retrieval, reranking, retrieval‑time ACL enforcement, cascades and cost, protection against prompt injection in the rendering layer, CSP and markdown guardrails. Also: lifecycle penalty in ranking (how to penalise sunset, how to allow opt‑in for legacy_archive), conflict resolution in the final answer, and using source_trust at rerank time.

If you finished this piece thinking "but our diagram shows ingestion as a single arrow…", the article did its job.

Data Registrar.