Building Reliable RAG Pipelines in Production: Lessons from Real Deployments

Retrieval-Augmented Generation looked straightforward in the tutorial. Embed your docs, store in a vector DB, retrieve top-k at query time, stuff into the prompt, done. Ship it.

Then you hit production. Users ask questions in ways you didn’t anticipate. Retrieval returns the wrong chunks. The model confidently answers from outdated context. Latency spikes. Costs balloon.

This post is about making RAG actually work — the patterns that separate demos from production systems.

RAG Pipeline Architecture Photo by imgix on Unsplash

The Core Problem: Naive Retrieval Fails on Real Queries

The default RAG setup — embed query, cosine similarity search, take top-5 — works well when:

Queries closely match the surface language of documents
Questions are self-contained
Document corpus is small and homogeneous

It fails when:

Users ask vague, conversational questions
Multi-hop questions require combining multiple chunks
The corpus spans many domains with different terminology
Query intent doesn’t match document vocabulary (lexical gap)

Let’s work through the main failure modes and their fixes.

Failure Mode 1: Vocabulary Mismatch

A user asks “how do I cancel my subscription?” Your docs say “terminate your membership” and “account deactivation.” Dense vector search should handle this — but doesn’t always, especially for domain-specific or product-specific terms.

Fix: Hybrid Search (Dense + Sparse)

Combine dense vector search with BM25 (sparse, keyword-based) using a reciprocal rank fusion (RRF) merge:

from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, chunks, embeddings, embed_fn, alpha=0.5):
        """
        alpha: weight for dense scores (0 = pure BM25, 1 = pure dense)
        """
        self.chunks = chunks
        self.embeddings = embeddings
        self.embed_fn = embed_fn
        self.alpha = alpha
        
        # BM25 index
        tokenized = [chunk.split() for chunk in chunks]
        self.bm25 = BM25Okapi(tokenized)
    
    def retrieve(self, query: str, top_k: int = 5) -> list[str]:
        # Dense scores
        query_emb = self.embed_fn(query)
        dense_scores = np.dot(self.embeddings, query_emb)
        dense_ranks = np.argsort(-dense_scores)
        
        # Sparse scores
        sparse_scores = self.bm25.get_scores(query.split())
        sparse_ranks = np.argsort(-sparse_scores)
        
        # RRF merge
        rrf_scores = {}
        k = 60  # RRF constant
        
        for rank, idx in enumerate(dense_ranks):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + self.alpha / (k + rank + 1)
        
        for rank, idx in enumerate(sparse_ranks):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - self.alpha) / (k + rank + 1)
        
        sorted_indices = sorted(rrf_scores, key=rrf_scores.get, reverse=True)
        return [self.chunks[i] for i in sorted_indices[:top_k]]

In practice, hybrid search with alpha around 0.6-0.7 (slightly favoring dense) consistently outperforms pure dense or pure sparse search on enterprise corpora.

Failure Mode 2: Conversation Context Lost in Multi-Turn

Query: “What are the pricing tiers?” → retrieval works fine. Follow-up: “Which one supports SSO?” → retrieval fails because “SSO” doesn’t appear in the query context needed to understand it’s still about pricing tiers.

Fix: Query Rewriting for Conversation

Before retrieval, rewrite the query using conversation history to make it self-contained:

async def rewrite_query_for_retrieval(
    conversation_history: list[dict],
    current_query: str,
    llm_client
) -> str:
    """Rewrite current_query into a standalone question with full context."""
    
    if len(conversation_history) < 2:
        return current_query  # No context to add
    
    history_text = "\n".join([
        f"{msg['role'].capitalize()}: {msg['content']}"
        for msg in conversation_history[-4:]  # Last 2 turns
    ])
    
    response = await llm_client.chat.completions.create(
        model="gpt-4o-mini",  # Use cheap model for rewriting
        messages=[
            {
                "role": "system",
                "content": "Rewrite the user's follow-up question as a complete, "
                           "standalone question that includes all necessary context "
                           "from the conversation. Output only the rewritten question."
            },
            {
                "role": "user",
                "content": f"Conversation:\n{history_text}\n\n"
                           f"Follow-up question: {current_query}\n\n"
                           "Standalone question:"
            }
        ],
        max_tokens=150
    )
    
    return response.choices[0].message.content.strip()

“Which one supports SSO?” becomes “Which pricing tier of [Product] supports Single Sign-On (SSO)?” — and retrieval works.

Failure Mode 3: Chunk Boundaries Cut Context

A chunk ends mid-explanation. The critical sentence is in the next chunk. Top-k retrieval gets the first chunk but not the second.

Fix: Sentence-Window Retrieval

Index at sentence level but retrieve at window level:

class SentenceWindowIndex:
    """
    Index individual sentences for precise retrieval,
    but return surrounding context window when a match is found.
    """
    def __init__(self, documents: list[str], window_size: int = 3):
        self.window_size = window_size
        self.sentences = []
        self.doc_sentence_map = []  # (doc_idx, sent_idx) for each stored sentence
        
        for doc_idx, doc in enumerate(documents):
            sents = self._split_sentences(doc)
            for sent_idx, sent in enumerate(sents):
                self.sentences.append(sent)
                self.doc_sentence_map.append((doc_idx, sent_idx, len(sents)))
    
    def retrieve_with_context(self, query_embedding, doc_sentences_by_doc, top_k=5):
        """Returns sentence matches expanded to their surrounding window."""
        # ... vector search to find matching sentence indices
        results = []
        for sent_idx in top_matching_indices:
            doc_idx, s_idx, total_sents = self.doc_sentence_map[sent_idx]
            
            # Expand to window
            start = max(0, s_idx - self.window_size)
            end = min(total_sents, s_idx + self.window_size + 1)
            
            window_sentences = doc_sentences_by_doc[doc_idx][start:end]
            results.append(" ".join(window_sentences))
        
        return results
    
    def _split_sentences(self, text: str) -> list[str]:
        import re
        return re.split(r'(?<=[.!?])\s+', text)

Index at sentence level (fine-grained matching), retrieve at paragraph level (full context). Significant quality improvement on long-form documents.

Failure Mode 4: No Hallucination Detection

Your retrieval returned relevant chunks. The model answered using those chunks. But it also added a detail that wasn’t in any chunk — and it was wrong.

Fix: Citation Grounding + Faithfulness Check

Require citations, then verify them:

GROUNDED_SYSTEM_PROMPT = """
You are a helpful assistant. Answer using ONLY the provided context.
For every claim, cite the source passage in [brackets].
If the context doesn't contain enough information, say so explicitly.
Do not add information from your training data.
"""

async def answer_with_grounding_check(
    query: str,
    context_chunks: list[str],
    llm_client
) -> dict:
    
    context_text = "\n\n".join([
        f"[Source {i+1}]: {chunk}"
        for i, chunk in enumerate(context_chunks)
    ])
    
    # Get grounded answer
    response = await llm_client.chat.completions.create(
        model="gpt-5",
        messages=[
            {"role": "system", "content": GROUNDED_SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {query}"}
        ]
    )
    
    answer = response.choices[0].message.content
    
    # Faithfulness check: ask the model to verify its own claims
    faithfulness_check = await llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": f"Context:\n{context_text}\n\n"
                           f"Answer: {answer}\n\n"
                           "Does every claim in the answer appear in the context? "
                           "Reply with JSON: "
            }
        ],
        response_format={"type": "json_object"}
    )
    
    import json
    check_result = json.loads(faithfulness_check.choices[0].message.content)
    
    return {
        "answer": answer,
        "faithful": check_result["faithful"],
        "issues": check_result.get("issues", [])
    }

For high-stakes applications (legal, medical, financial), gate answers on faithfulness check. For consumer products, log the issues and use them to improve your pipeline.

Evaluation: You Can’t Improve What You Don’t Measure

Build a RAG Evaluation Set

Your eval set should have:

50-200 question/answer/source triples
Coverage of common query patterns
Adversarial cases (questions where no answer exists, ambiguous queries)

eval_dataset = [
    {
        "question": "What is the refund policy for annual plans?",
        "expected_answer": "30-day money-back guarantee",
        "source_chunk_id": "pricing-faq-chunk-12",
        "difficulty": "easy"
    },
    {
        "question": "Can I downgrade mid-billing cycle?",
        "expected_answer": "Yes, but changes take effect next billing period",
        "source_chunk_id": "billing-chunk-7",
        "difficulty": "medium"
    },
    {
        "question": "What's the best plan for a team of 5?",
        "expected_answer": None,  # Opinion question, no "correct" answer
        "source_chunk_id": None,
        "difficulty": "hard"
    }
]

Key Metrics

Metric	What It Measures	Tool
Retrieval Recall@k	Did the right chunk appear in top-k?	Custom
Answer Relevance	Does the answer address the question?	LLM-as-judge
Faithfulness	Are claims grounded in context?	LLM-as-judge
Context Precision	Are retrieved chunks actually used?	LLM-as-judge
Latency (p50/p95)	End-to-end response time	Prometheus

Use RAGAS for automated LLM-as-judge evaluation — it operationalizes most of these metrics.

Production Architecture

Production RAG System Photo by Marvin Meyer on Unsplash

User Query
    │
    ▼
Query Rewriter (conversation history → standalone query)
    │
    ▼
Hybrid Retriever (dense + sparse → RRF merge)
    │
    ▼
Reranker (cross-encoder for final top-k selection)
    │
    ▼
Context Assembly (sentence-window expansion + dedup)
    │
    ▼
LLM Generation (grounded prompting + citations)
    │
    ▼
Faithfulness Check (async, for monitoring)
    │
    ▼
Response

Each stage is independently swappable. The reranker (e.g., Cohere Rerank, BGE-reranker) is often the highest-ROI addition for an existing pipeline — it’s a small cross-encoder that re-scores your retrieval candidates and significantly improves precision.

Summary: The RAG Quality Ladder

Level 1: Basic dense retrieval → get something working
Level 2: Hybrid search → fix vocabulary mismatch
Level 3: Query rewriting → fix multi-turn conversations
Level 4: Sentence-window retrieval → fix chunk boundary issues
Level 5: Reranking → improve precision before generation
Level 6: Faithfulness checking + eval suite → know what’s breaking

Most production systems live at Level 2-3. Level 4-6 is where the quality gap between “okay demo” and “users trust this thing” gets closed.

Start with Level 1, measure everything, and climb the ladder based on what your metrics tell you is actually failing.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)