Building Production RAG Systems: Lessons from Enterprise Deployments
Key architectural decisions, retrieval optimization, and evaluation frameworks for RAG at scale.
Introduction
After deploying RAG systems across multiple enterprise contexts at HTC Global Services, I've learned that the gap between a working demo and a production-ready system is substantial. This post shares key lessons from building a multimodal RAG system that achieved 95% first-query resolution across 1,000+ documents.
The Architecture That Worked
Our final architecture used pgvector for vector storage combined with LangChain for orchestration. Here's why this combination proved effective:
Vector Database Selection
We evaluated several options:
- Pinecone: Great managed service, but cost scaled linearly with document count
- Weaviate: Feature-rich but added operational complexity
- pgvector: Won due to existing PostgreSQL infrastructure and surprisingly competitive performance
The key insight: for most enterprise use cases, pgvector's performance is sufficient, and the operational simplicity of staying within PostgreSQL outweighs marginal performance gains from specialized vector DBs.
Chunking Strategy
Our chunking evolved through three iterations:
- Fixed-size chunks (512 tokens): Simple but broke context at sentence boundaries
- Semantic chunking: Better coherence but inconsistent chunk sizes
- Hybrid approach: Semantic splitting with size guardrails (256-1024 tokens)
The hybrid approach gave us the best retrieval accuracy while maintaining predictable token usage.
Evaluation Framework
The most impactful decision was creating a gold-question set before optimizing anything else.
Building the Gold Set
We collected 200 real user queries from stakeholders and manually annotated:
- The ideal retrieved documents
- The expected answer
- Edge cases and ambiguous queries
This investment paid dividends—every optimization could be measured against ground truth.
Metrics That Mattered
- First-query resolution rate: Did users get their answer without reformulating?
- Retrieval precision@5: Were the top 5 chunks relevant?
- Answer faithfulness: Did the generated answer align with retrieved context?
Key Optimizations
1. Query Expansion
Single queries often miss relevant documents due to vocabulary mismatch. We implemented HyDE (Hypothetical Document Embeddings):
def expand_query(query: str) -> list[str]: # Generate hypothetical answer hypothetical = llm.generate( f"Write a paragraph that would answer: {query}" ) return [query, hypothetical]
This improved retrieval precision by 23%.
2. Reranking
Initial retrieval with embedding similarity is fast but imprecise. We added a cross-encoder reranking step:
from sentence_transformers import CrossEncoder reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') reranked = reranker.rank(query, initial_results, top_k=5)
3. Metadata Filtering
For enterprise documents, metadata (department, date, document type) proved crucial. Pre-filtering by metadata before vector search reduced noise significantly.
Lessons Learned
- Start with evaluation: Build your gold set before optimizing
- Hybrid retrieval wins: Combine semantic and keyword search
- Chunking matters more than embedding models: Spend time on chunking strategy
- Monitor in production: User reformulation rate is your north star metric
Results
After these optimizations:
- 95% first-query resolution (up from 67%)
- 40% reduction in discovery time across 10+ teams
- Deployed across 1,000+ documents with consistent performance
The system now handles natural-language queries across technical documentation, policies, and historical records—significantly reducing the time employees spend searching for information.