Building Production-Ready RAG Systems: Beyond the Basics
on Ai, Rag, Vector search, Llm, Nlp, Python
Building Production-Ready RAG Systems: Beyond the Basics
Every developer has built a basic RAG system. You chunk some documents, embed them, shove them in a vector store, and retrieve the top-k nearest neighbors. It works fine in demos. Then you put it in production and watch it fail in creative ways.
This post covers the techniques that separate toy RAG implementations from systems that actually work reliably at scale.
Photo by Lautaro Andreani on Unsplash
The Problem with Naive RAG
Naive RAG fails in predictable ways:
- Chunking too coarsely → context is diluted, irrelevant content retrieved
- Chunking too finely → context is fragmented, answers lack coherence
- Pure semantic search → misses exact keyword matches (product names, error codes)
- Top-k retrieval → irrelevant chunks often score higher than relevant ones
- No evaluation → you don’t know when quality degrades
Let’s solve each one.
1. Advanced Chunking Strategies
Semantic Chunking
Instead of fixed-size chunks, split where topic changes — not at arbitrary character counts.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90
)
chunks = text_splitter.split_text(document_text)
Hierarchical Chunking (Parent-Child)
Store both large parent chunks and small child chunks. Retrieve small chunks for precision, but return the parent for context.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Parent splitter: large context windows
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# Child splitter: small, precise chunks for retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = Chroma(
collection_name="full_documents",
embedding_function=embeddings
)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(docs)
Proposition-Based Chunking
Convert documents into atomic, self-contained propositions. Each chunk is a single verifiable fact.
proposition_prompt = """
Decompose the following passage into a list of simple, factual propositions.
Each proposition should:
- Be a single, complete statement
- Be understandable without additional context
- Not contain pronouns that reference other propositions
Passage: {text}
Return as a JSON array of strings.
"""
def extract_propositions(text: str) -> list[str]:
response = llm.invoke(proposition_prompt.format(text=text))
return json.loads(response.content)
2. Hybrid Search: Combining Dense and Sparse Retrieval
Pure vector search misses exact matches. Pure BM25 misses semantic similarity. Combine them.
Reciprocal Rank Fusion (RRF)
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# Sparse retriever (keyword search)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Dense retriever (semantic search)
faiss_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Hybrid with RRF fusion
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, faiss_retriever],
weights=[0.4, 0.6] # Tune based on your data
)
results = ensemble_retriever.invoke("What is the refund policy?")
When to Weight Toward BM25
- Queries with specific model numbers, error codes, or product names
- Legal/medical documents with precise terminology
- Code documentation (function names, API endpoints)
When to Weight Toward Dense
- Conceptual questions (“How does X work?”)
- Multi-language retrieval
- Paraphrase-heavy queries
3. Reranking: The Second Stage That Changes Everything
Retrieval gets you 20-50 candidates. Reranking selects the best 3-5.
Cross-Encoder Reranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_documents(query: str, documents: list, top_k: int = 5) -> list:
"""Rerank documents using a cross-encoder."""
pairs = [(query, doc.page_content) for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, score in ranked[:top_k]]
LLM-as-Judge Reranking (for high-stakes scenarios)
def llm_rerank(query: str, documents: list, top_k: int = 3) -> list:
"""Use LLM to select most relevant documents."""
doc_list = "\n\n".join([
f"[{i+1}] {doc.page_content[:500]}"
for i, doc in enumerate(documents)
])
response = llm.invoke(f"""
Query: {query}
Documents:
{doc_list}
Select the {top_k} most relevant document indices for answering the query.
Return ONLY a JSON array of indices, e.g.: [2, 5, 1]
""")
indices = json.loads(response.content)
return [documents[i-1] for i in indices if 1 <= i <= len(documents)]
4. Query Transformation
Don’t send the raw user query to the retriever. Transform it first.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer and embed that instead of the question.
def hyde_retrieve(query: str, retriever) -> list:
"""HyDE: generate a hypothetical answer for better retrieval."""
hypothesis = llm.invoke(f"""
Write a detailed, factual paragraph that would directly answer this question.
Write as if you're certain of the answer.
Question: {query}
Answer:
""").content
# Embed the hypothesis, not the query
return retriever.invoke(hypothesis)
Query Decomposition
Break complex queries into sub-queries, retrieve for each, then synthesize.
def decompose_and_retrieve(query: str, retriever) -> list:
"""Decompose complex queries into sub-queries."""
sub_queries = llm.invoke(f"""
Break this complex question into 2-3 simpler sub-questions.
Return as a JSON array of strings.
Question: {query}
""").content
sub_queries = json.loads(sub_queries)
all_docs = []
for sub_query in sub_queries:
docs = retriever.invoke(sub_query)
all_docs.extend(docs)
# Deduplicate
seen = set()
unique_docs = []
for doc in all_docs:
if doc.page_content not in seen:
seen.add(doc.page_content)
unique_docs.append(doc)
return unique_docs
5. RAG Evaluation: Know When Your System Breaks
Without evaluation, you’re flying blind. Use RAGAS for automated evaluation.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
def evaluate_rag_pipeline(
questions: list[str],
answers: list[str],
contexts: list[list[str]],
ground_truths: list[str]
) -> dict:
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truth": ground_truths,
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset=dataset,
metrics=[
faithfulness, # Does answer stay true to context?
answer_relevancy, # Is answer relevant to question?
context_precision, # Are retrieved docs relevant?
context_recall, # Did we retrieve all necessary info?
],
)
return result
Setting Up Continuous Evaluation
import mlflow
def log_rag_metrics(metrics: dict, run_name: str):
"""Log RAG quality metrics to MLflow."""
with mlflow.start_run(run_name=run_name):
mlflow.log_metrics({
"faithfulness": metrics["faithfulness"],
"answer_relevancy": metrics["answer_relevancy"],
"context_precision": metrics["context_precision"],
"context_recall": metrics["context_recall"],
})
Photo by Luke Chesser on Unsplash
6. Production Architecture
Here’s a battle-tested architecture for production RAG:
User Query
↓
Query Preprocessing
├── Intent Classification
├── Query Expansion
└── HyDE Generation
↓
Hybrid Retrieval (BM25 + Dense)
↓
Reranking (Cross-Encoder)
↓
Context Assembly
├── Deduplication
├── Context Windowing
└── Metadata Enrichment
↓
Generation (LLM)
↓
Post-Processing
├── Citation Extraction
├── Hallucination Check
└── Response Formatting
↓
User Response + Sources
Key Infrastructure Components
| Component | Tool Options |
|---|---|
| Vector Store | Qdrant, Weaviate, pgvector |
| Sparse Search | Elasticsearch, Typesense |
| Reranker | Cohere Rerank, cross-encoder |
| Evaluation | RAGAS, DeepEval, TruLens |
| Observability | LangSmith, Arize Phoenix |
Common Pitfalls and How to Avoid Them
1. Embedding Model Mismatch
Always use the same embedding model for indexing and querying. Changing models requires re-indexing.
2. Missing Metadata Filtering
Add metadata to chunks and filter before semantic search:
results = vectorstore.similarity_search(
query,
k=20,
filter={"document_type": "policy", "year": {"$gte": 2024}}
)
3. No Fallback for Low-Confidence Retrievals
def retrieve_with_confidence(query: str, threshold: float = 0.7) -> tuple:
results = vectorstore.similarity_search_with_score(query, k=5)
high_confidence = [(doc, score) for doc, score in results if score >= threshold]
if not high_confidence:
return [], "I don't have reliable information to answer this question."
return [doc for doc, _ in high_confidence], None
Conclusion
Production RAG is an engineering discipline, not just a few API calls. The difference between a demo and a reliable system comes down to:
- Smart chunking — hierarchical and semantic
- Hybrid search — dense + sparse, fused intelligently
- Reranking — don’t trust retrieval ordering
- Query transformation — HyDE and decomposition
- Continuous evaluation — measure faithfulness and relevancy
Start with these techniques and you’ll save yourself weeks of debugging strange retrieval failures in production.
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
