Back to blog
GenAIJanuary 15, 20258 min read

Building Production RAG Systems: Lessons from Enterprise Deployments

Key architectural decisions, retrieval optimization, and evaluation frameworks for RAG at scale.

Introduction

After deploying RAG systems across multiple enterprise contexts at HTC Global Services, I've learned that the gap between a working demo and a production-ready system is substantial. This post shares key lessons from building a multimodal RAG system that achieved 95% first-query resolution across 1,000+ documents.

The Architecture That Worked

Our final architecture used pgvector for vector storage combined with LangChain for orchestration. Here's why this combination proved effective:

Vector Database Selection

We evaluated several options:

  • Pinecone: Great managed service, but cost scaled linearly with document count
  • Weaviate: Feature-rich but added operational complexity
  • pgvector: Won due to existing PostgreSQL infrastructure and surprisingly competitive performance

The key insight: for most enterprise use cases, pgvector's performance is sufficient, and the operational simplicity of staying within PostgreSQL outweighs marginal performance gains from specialized vector DBs.

Chunking Strategy

Our chunking evolved through three iterations:

  1. Fixed-size chunks (512 tokens): Simple but broke context at sentence boundaries
  2. Semantic chunking: Better coherence but inconsistent chunk sizes
  3. Hybrid approach: Semantic splitting with size guardrails (256-1024 tokens)

The hybrid approach gave us the best retrieval accuracy while maintaining predictable token usage.

Evaluation Framework

The most impactful decision was creating a gold-question set before optimizing anything else.

Building the Gold Set

We collected 200 real user queries from stakeholders and manually annotated:

  • The ideal retrieved documents
  • The expected answer
  • Edge cases and ambiguous queries

This investment paid dividends—every optimization could be measured against ground truth.

Metrics That Mattered

  • First-query resolution rate: Did users get their answer without reformulating?
  • Retrieval precision@5: Were the top 5 chunks relevant?
  • Answer faithfulness: Did the generated answer align with retrieved context?

Key Optimizations

1. Query Expansion

Single queries often miss relevant documents due to vocabulary mismatch. We implemented HyDE (Hypothetical Document Embeddings):

def expand_query(query: str) -> list[str]:
    # Generate hypothetical answer
    hypothetical = llm.generate(
        f"Write a paragraph that would answer: {query}"
    )
    return [query, hypothetical]

This improved retrieval precision by 23%.

2. Reranking

Initial retrieval with embedding similarity is fast but imprecise. We added a cross-encoder reranking step:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
reranked = reranker.rank(query, initial_results, top_k=5)

3. Metadata Filtering

For enterprise documents, metadata (department, date, document type) proved crucial. Pre-filtering by metadata before vector search reduced noise significantly.

Lessons Learned

  1. Start with evaluation: Build your gold set before optimizing
  2. Hybrid retrieval wins: Combine semantic and keyword search
  3. Chunking matters more than embedding models: Spend time on chunking strategy
  4. Monitor in production: User reformulation rate is your north star metric

Results

After these optimizations:

  • 95% first-query resolution (up from 67%)
  • 40% reduction in discovery time across 10+ teams
  • Deployed across 1,000+ documents with consistent performance

The system now handles natural-language queries across technical documentation, policies, and historical records—significantly reducing the time employees spend searching for information.