RAG Architecture Deep Dive: Building Production-Ready Retrieval-Augmented Generation Systems
on Rag, Llm, Ai, Vector database, Machine learning, Nlp
RAG Architecture Deep Dive: Building Production-Ready Retrieval-Augmented Generation Systems
Retrieval-Augmented Generation (RAG) has become the cornerstone of enterprise AI applications. By combining the power of large language models with external knowledge retrieval, RAG systems deliver accurate, contextual, and up-to-date responses without the need for constant model retraining.
Photo by Possessed Photography on Unsplash
Why RAG Matters in 2026
Traditional LLMs face fundamental limitations:
- Knowledge cutoff: Training data becomes stale
- Hallucinations: Confident but incorrect responses
- No access to private data: Can’t answer company-specific questions
- Cost: Fine-tuning is expensive and time-consuming
RAG solves these by retrieving relevant context at inference time.
RAG Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ User Query ──► Query Processing ──► Retrieval ──► Ranking │
│ │ │
│ ▼ │
│ Vector Database │
│ │ │
│ ▼ │
│ Context + Query ──► LLM ──► Response │
│ │
└─────────────────────────────────────────────────────────────┘
Building a Production RAG System
Step 1: Document Processing Pipeline
from langchain.document_loaders import DirectoryLoader, PDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib
class DocumentProcessor:
def __init__(self, chunk_size=1000, chunk_overlap=200):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
def process_documents(self, directory: str):
# Load documents
loader = DirectoryLoader(
directory,
glob="**/*.{pdf,md,txt}",
show_progress=True
)
documents = loader.load()
# Split into chunks
chunks = self.text_splitter.split_documents(documents)
# Add metadata
for chunk in chunks:
chunk.metadata["chunk_hash"] = hashlib.md5(
chunk.page_content.encode()
).hexdigest()
chunk.metadata["char_count"] = len(chunk.page_content)
return chunks
Step 2: Embedding and Vector Storage
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
import numpy as np
class VectorStore:
def __init__(self, collection_name: str):
self.client = QdrantClient(host="localhost", port=6333)
self.openai = OpenAI()
self.collection_name = collection_name
self.embedding_model = "text-embedding-3-large"
self.embedding_dim = 3072
# Create collection if not exists
self._init_collection()
def _init_collection(self):
collections = self.client.get_collections().collections
if self.collection_name not in [c.name for c in collections]:
self.client.create_collection(
collection_name=self.collection_name,
vectors_config=VectorParams(
size=self.embedding_dim,
distance=Distance.COSINE
)
)
def embed_text(self, text: str) -> list[float]:
response = self.openai.embeddings.create(
model=self.embedding_model,
input=text
)
return response.data[0].embedding
def add_documents(self, chunks: list):
points = []
for i, chunk in enumerate(chunks):
embedding = self.embed_text(chunk.page_content)
points.append(PointStruct(
id=i,
vector=embedding,
payload={
"content": chunk.page_content,
"metadata": chunk.metadata
}
))
self.client.upsert(
collection_name=self.collection_name,
points=points
)
def search(self, query: str, top_k: int = 5):
query_embedding = self.embed_text(query)
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_embedding,
limit=top_k
)
return results
Step 3: Query Processing and Rewriting
Photo by Carlos Muza on Unsplash
class QueryProcessor:
def __init__(self):
self.openai = OpenAI()
def rewrite_query(self, original_query: str, chat_history: list = None) -> str:
"""Rewrite query for better retrieval"""
system_prompt = """You are a query rewriter. Given a user query and optional chat history,
rewrite the query to be more specific and suitable for semantic search.
Rules:
1. Expand abbreviations
2. Add relevant context from chat history
3. Make implicit references explicit
4. Keep the rewritten query concise but complete
Output only the rewritten query, nothing else."""
messages = [{"role": "system", "content": system_prompt}]
if chat_history:
messages.append({
"role": "user",
"content": f"Chat history:\n{chat_history}\n\nOriginal query: {original_query}"
})
else:
messages.append({"role": "user", "content": original_query})
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0
)
return response.choices[0].message.content
def generate_hypothetical_answer(self, query: str) -> str:
"""HyDE: Generate hypothetical answer for better retrieval"""
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Write a short, factual answer to this question (even if you need to imagine the answer): {query}"
}],
temperature=0.7
)
return response.choices[0].message.content
Step 4: Advanced Retrieval Strategies
from typing import List
from dataclasses import dataclass
@dataclass
class RetrievalResult:
content: str
score: float
metadata: dict
class HybridRetriever:
def __init__(self, vector_store: VectorStore):
self.vector_store = vector_store
self.query_processor = QueryProcessor()
def retrieve(
self,
query: str,
top_k: int = 10,
use_hyde: bool = True,
use_reranking: bool = True
) -> List[RetrievalResult]:
results = []
# Strategy 1: Direct semantic search
direct_results = self.vector_store.search(query, top_k)
results.extend(direct_results)
# Strategy 2: HyDE (Hypothetical Document Embeddings)
if use_hyde:
hypothetical = self.query_processor.generate_hypothetical_answer(query)
hyde_results = self.vector_store.search(hypothetical, top_k)
results.extend(hyde_results)
# Deduplicate
seen_content = set()
unique_results = []
for r in results:
if r.payload["content"] not in seen_content:
seen_content.add(r.payload["content"])
unique_results.append(r)
# Rerank
if use_reranking:
unique_results = self._rerank(query, unique_results)
return unique_results[:top_k]
def _rerank(self, query: str, results: list) -> list:
"""Rerank using cross-encoder or LLM"""
# Using Cohere Rerank API
import cohere
co = cohere.Client()
documents = [r.payload["content"] for r in results]
rerank_response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=len(documents)
)
reranked = []
for r in rerank_response.results:
original = results[r.index]
original.score = r.relevance_score
reranked.append(original)
return sorted(reranked, key=lambda x: x.score, reverse=True)
Step 5: Context Assembly and Generation
class RAGGenerator:
def __init__(self, retriever: HybridRetriever):
self.retriever = retriever
self.openai = OpenAI()
def generate(
self,
query: str,
chat_history: list = None,
max_context_tokens: int = 4000
) -> str:
# Retrieve relevant documents
retrieved_docs = self.retriever.retrieve(query, top_k=10)
# Assemble context with token limit
context = self._assemble_context(retrieved_docs, max_context_tokens)
# Generate response
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Rules:
1. Only use information from the provided context
2. If the context doesn't contain relevant information, say so
3. Cite sources when possible using [Source: filename]
4. Be concise but thorough"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
if chat_history:
messages = [messages[0]] + chat_history + [messages[-1]]
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.3,
max_tokens=1000
)
return response.choices[0].message.content
def _assemble_context(self, docs: list, max_tokens: int) -> str:
context_parts = []
total_chars = 0
char_limit = max_tokens * 4 # Approximate
for doc in docs:
content = doc.payload["content"]
source = doc.payload["metadata"].get("source", "unknown")
formatted = f"[Source: {source}]\n{content}\n"
if total_chars + len(formatted) > char_limit:
break
context_parts.append(formatted)
total_chars += len(formatted)
return "\n---\n".join(context_parts)
Evaluation and Optimization
RAG Evaluation Metrics
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
def evaluate_rag_system(questions, ground_truths, rag_system):
results = []
for q, gt in zip(questions, ground_truths):
response = rag_system.generate(q)
contexts = rag_system.retriever.retrieve(q)
results.append({
"question": q,
"answer": response,
"contexts": [c.payload["content"] for c in contexts],
"ground_truth": gt
})
# Evaluate with RAGAS
scores = evaluate(
results,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
]
)
return scores
Performance Optimization Tips
- Chunk size tuning: Start with 500-1000 characters, adjust based on content type
- Embedding model selection: Balance quality vs. cost (text-embedding-3-large vs. small)
- Caching: Cache embeddings and frequent queries
- Batch processing: Process documents in batches for efficiency
- Index optimization: Use appropriate vector index types (HNSW, IVF)
Production Considerations
Scaling Architecture
# docker-compose.yml for production RAG
version: '3.8'
services:
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
deploy:
resources:
limits:
memory: 8G
rag-api:
build: .
ports:
- "8000:8000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- QDRANT_HOST=qdrant
deploy:
replicas: 3
volumes:
qdrant_data:
Conclusion
Building production RAG systems requires careful attention to each component:
- Document processing: Clean, well-chunked data is foundational
- Embedding quality: Choose appropriate models for your domain
- Retrieval strategies: Combine multiple approaches (semantic, HyDE, reranking)
- Context assembly: Optimize for relevance within token limits
- Evaluation: Continuously measure and improve
RAG is not a one-size-fits-all solution—iterate on each component based on your specific use case and evaluation metrics.
Building a RAG system? Share your architecture choices in the comments!
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
