RAG infrastructure is more than a vector database

The vector database is not the system

When teams talk about building RAG, the conversation usually centres on which vector database to use — Pinecone, Weaviate, Qdrant, pgvector. That is the wrong starting point. The vector database is one component in a system that has at least six distinct infrastructure layers, and it is rarely the one that causes production failures.

Enterprise RAG that works reliably under real conditions requires careful design of each layer: ingestion, chunking, embedding, storage, retrieval, and the observability that monitors all of them.

Ingestion: where quality is determined

Retrieval quality is fixed at ingestion time. If the content entering the vector store is poorly chunked, missing metadata, or stale, no amount of retrieval tuning at query time will fix it.

Production ingestion needs to solve:

Document parsing: PDFs, Word documents, HTML, Confluence pages and Notion exports all require different parsing strategies. A generic parser that works for plain text will silently lose tables, lists and structured sections from formatted documents.
Chunking strategy: Chunks that are too large dilute relevance. Chunks that are too small lose context. Sentence-window chunking (storing surrounding sentences alongside each chunk) often outperforms naive fixed-size chunking for enterprise documents.
Metadata extraction: Every chunk should carry document-level metadata: source system, document type, author, date, department, access tier. This metadata enables filtered retrieval — a query from the legal team can exclude engineering documents — and audit trails.
Incremental updates: Documents change. The ingestion pipeline needs to detect updates, re-embed modified content and delete embeddings for documents that have been removed from the source system.

A RAG system that ingested documents six months ago and has not updated them since is not a knowledge base. It is a snapshot.

Retrieval: beyond cosine similarity

Semantic search via embedding similarity is a good baseline but not sufficient for enterprise knowledge retrieval. Production systems typically layer multiple retrieval strategies:

Hybrid search: Combining dense (embedding) retrieval with sparse (BM25 keyword) retrieval, then merging results with reciprocal rank fusion. Hybrid consistently outperforms pure semantic search for queries that include specific names, codes or product identifiers.
Metadata filtering: Pre-filtering by document type, date range, department or access tier before semantic search reduces result noise and improves precision.
Re-ranking: A cross-encoder re-ranker applied to the top-k candidates before passing them to the LLM significantly improves the relevance of final context. This is often the highest-ROI retrieval improvement available.

Test retrieval quality separately from answer quality. A model producing correct answers from wrong documents is fragile; it works until the wrong document changes.

Security and access control

This is the layer most enterprise RAG deployments get wrong. If a user without access to HR records can phrase a question that retrieves HR document chunks, the RAG system is a data leak vector.

Access control for RAG requires:

Storing access tier or permission group metadata on every chunk at ingestion time
Filtering retrieval by the requesting user's permissions before returning results to the LLM
Logging every retrieval event with user identity, query and documents accessed — this is an audit requirement in regulated industries
Regular testing of retrieval boundaries: can a standard employee query retrieve executive or HR content?

Some vector databases support native access control at the namespace or collection level. Others require application-layer filtering. Know which approach your chosen database supports before you design the permission model.

Observability for RAG systems

RAG failures are subtle. The system continues to return answers — they are just wrong, outdated or hallucinated. Standard uptime monitoring does not catch this. You need metrics specific to retrieval and generation quality:

Retrieval hit rate: What percentage of queries return at least one relevant document above a similarity threshold?
Context utilisation: Is the LLM using the retrieved content or ignoring it and generating from parametric knowledge?
Answer confidence signals: For use cases where hallucination is high-risk, add a self-consistency check or a lightweight classifier that flags low-confidence responses for human review.
Index freshness: Track the age of the most recently ingested document per source. Alert when a source has not been updated in longer than expected.

Infrastructure choices that matter

For most enterprise RAG deployments, the practical infrastructure stack looks like this:

Ingestion: Cloud-native queue (SQS, Pub/Sub) feeding an ingestion worker (Lambda, Cloud Run) with dead-letter handling
Vector store: Managed service (Pinecone, Weaviate Cloud, pgvector on RDS) — avoid self-hosting unless you have dedicated SRE capacity
Orchestration: LangChain, LlamaIndex or a custom retrieval service — keep this layer thin and testable
Observability: LangSmith, Helicone or a custom logging layer pushing to your existing observability stack
Cache: Redis or DynamoDB for caching frequently repeated query results to reduce vector DB load and model API cost

The teams that get RAG right in production treat it as a data engineering problem first and a model problem second. The retrieval pipeline is the product; the LLM is the interface.