RAG Architecture Deep Dive: Building Production-Ready Retrieval-Augmented Generation Systems



RAG Architecture Deep Dive: Building Production-Ready Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) has become the cornerstone of enterprise AI applications. By combining the power of large language models with external knowledge retrieval, RAG systems deliver accurate, contextual, and up-to-date responses without the need for constant model retraining.

AI and Machine Learning Photo by Possessed Photography on Unsplash

Why RAG Matters in 2026

Traditional LLMs face fundamental limitations:

  • Knowledge cutoff: Training data becomes stale
  • Hallucinations: Confident but incorrect responses
  • No access to private data: Can’t answer company-specific questions
  • Cost: Fine-tuning is expensive and time-consuming

RAG solves these by retrieving relevant context at inference time.

RAG Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  User Query ──► Query Processing ──► Retrieval ──► Ranking │
│                                          │                  │
│                                          ▼                  │
│                                   Vector Database           │
│                                          │                  │
│                                          ▼                  │
│                    Context + Query ──► LLM ──► Response     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Building a Production RAG System

Step 1: Document Processing Pipeline

from langchain.document_loaders import DirectoryLoader, PDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib

class DocumentProcessor:
    def __init__(self, chunk_size=1000, chunk_overlap=200):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n\n", "\n", ". ", " ", ""]
        )
    
    def process_documents(self, directory: str):
        # Load documents
        loader = DirectoryLoader(
            directory,
            glob="**/*.{pdf,md,txt}",
            show_progress=True
        )
        documents = loader.load()
        
        # Split into chunks
        chunks = self.text_splitter.split_documents(documents)
        
        # Add metadata
        for chunk in chunks:
            chunk.metadata["chunk_hash"] = hashlib.md5(
                chunk.page_content.encode()
            ).hexdigest()
            chunk.metadata["char_count"] = len(chunk.page_content)
        
        return chunks

Step 2: Embedding and Vector Storage

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
import numpy as np

class VectorStore:
    def __init__(self, collection_name: str):
        self.client = QdrantClient(host="localhost", port=6333)
        self.openai = OpenAI()
        self.collection_name = collection_name
        self.embedding_model = "text-embedding-3-large"
        self.embedding_dim = 3072
        
        # Create collection if not exists
        self._init_collection()
    
    def _init_collection(self):
        collections = self.client.get_collections().collections
        if self.collection_name not in [c.name for c in collections]:
            self.client.create_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(
                    size=self.embedding_dim,
                    distance=Distance.COSINE
                )
            )
    
    def embed_text(self, text: str) -> list[float]:
        response = self.openai.embeddings.create(
            model=self.embedding_model,
            input=text
        )
        return response.data[0].embedding
    
    def add_documents(self, chunks: list):
        points = []
        for i, chunk in enumerate(chunks):
            embedding = self.embed_text(chunk.page_content)
            points.append(PointStruct(
                id=i,
                vector=embedding,
                payload={
                    "content": chunk.page_content,
                    "metadata": chunk.metadata
                }
            ))
        
        self.client.upsert(
            collection_name=self.collection_name,
            points=points
        )
    
    def search(self, query: str, top_k: int = 5):
        query_embedding = self.embed_text(query)
        
        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding,
            limit=top_k
        )
        
        return results

Step 3: Query Processing and Rewriting

Data Analysis Photo by Carlos Muza on Unsplash

class QueryProcessor:
    def __init__(self):
        self.openai = OpenAI()
    
    def rewrite_query(self, original_query: str, chat_history: list = None) -> str:
        """Rewrite query for better retrieval"""
        
        system_prompt = """You are a query rewriter. Given a user query and optional chat history,
        rewrite the query to be more specific and suitable for semantic search.
        
        Rules:
        1. Expand abbreviations
        2. Add relevant context from chat history
        3. Make implicit references explicit
        4. Keep the rewritten query concise but complete
        
        Output only the rewritten query, nothing else."""
        
        messages = [{"role": "system", "content": system_prompt}]
        
        if chat_history:
            messages.append({
                "role": "user",
                "content": f"Chat history:\n{chat_history}\n\nOriginal query: {original_query}"
            })
        else:
            messages.append({"role": "user", "content": original_query})
        
        response = self.openai.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0
        )
        
        return response.choices[0].message.content
    
    def generate_hypothetical_answer(self, query: str) -> str:
        """HyDE: Generate hypothetical answer for better retrieval"""
        
        response = self.openai.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"Write a short, factual answer to this question (even if you need to imagine the answer): {query}"
            }],
            temperature=0.7
        )
        
        return response.choices[0].message.content

Step 4: Advanced Retrieval Strategies

from typing import List
from dataclasses import dataclass

@dataclass
class RetrievalResult:
    content: str
    score: float
    metadata: dict

class HybridRetriever:
    def __init__(self, vector_store: VectorStore):
        self.vector_store = vector_store
        self.query_processor = QueryProcessor()
    
    def retrieve(
        self,
        query: str,
        top_k: int = 10,
        use_hyde: bool = True,
        use_reranking: bool = True
    ) -> List[RetrievalResult]:
        
        results = []
        
        # Strategy 1: Direct semantic search
        direct_results = self.vector_store.search(query, top_k)
        results.extend(direct_results)
        
        # Strategy 2: HyDE (Hypothetical Document Embeddings)
        if use_hyde:
            hypothetical = self.query_processor.generate_hypothetical_answer(query)
            hyde_results = self.vector_store.search(hypothetical, top_k)
            results.extend(hyde_results)
        
        # Deduplicate
        seen_content = set()
        unique_results = []
        for r in results:
            if r.payload["content"] not in seen_content:
                seen_content.add(r.payload["content"])
                unique_results.append(r)
        
        # Rerank
        if use_reranking:
            unique_results = self._rerank(query, unique_results)
        
        return unique_results[:top_k]
    
    def _rerank(self, query: str, results: list) -> list:
        """Rerank using cross-encoder or LLM"""
        
        # Using Cohere Rerank API
        import cohere
        co = cohere.Client()
        
        documents = [r.payload["content"] for r in results]
        
        rerank_response = co.rerank(
            model="rerank-english-v3.0",
            query=query,
            documents=documents,
            top_n=len(documents)
        )
        
        reranked = []
        for r in rerank_response.results:
            original = results[r.index]
            original.score = r.relevance_score
            reranked.append(original)
        
        return sorted(reranked, key=lambda x: x.score, reverse=True)

Step 5: Context Assembly and Generation

class RAGGenerator:
    def __init__(self, retriever: HybridRetriever):
        self.retriever = retriever
        self.openai = OpenAI()
    
    def generate(
        self,
        query: str,
        chat_history: list = None,
        max_context_tokens: int = 4000
    ) -> str:
        
        # Retrieve relevant documents
        retrieved_docs = self.retriever.retrieve(query, top_k=10)
        
        # Assemble context with token limit
        context = self._assemble_context(retrieved_docs, max_context_tokens)
        
        # Generate response
        system_prompt = """You are a helpful assistant that answers questions based on the provided context.

Rules:
1. Only use information from the provided context
2. If the context doesn't contain relevant information, say so
3. Cite sources when possible using [Source: filename]
4. Be concise but thorough"""
        
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
        
        if chat_history:
            messages = [messages[0]] + chat_history + [messages[-1]]
        
        response = self.openai.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0.3,
            max_tokens=1000
        )
        
        return response.choices[0].message.content
    
    def _assemble_context(self, docs: list, max_tokens: int) -> str:
        context_parts = []
        total_chars = 0
        char_limit = max_tokens * 4  # Approximate
        
        for doc in docs:
            content = doc.payload["content"]
            source = doc.payload["metadata"].get("source", "unknown")
            
            formatted = f"[Source: {source}]\n{content}\n"
            
            if total_chars + len(formatted) > char_limit:
                break
            
            context_parts.append(formatted)
            total_chars += len(formatted)
        
        return "\n---\n".join(context_parts)

Evaluation and Optimization

RAG Evaluation Metrics

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

def evaluate_rag_system(questions, ground_truths, rag_system):
    results = []
    
    for q, gt in zip(questions, ground_truths):
        response = rag_system.generate(q)
        contexts = rag_system.retriever.retrieve(q)
        
        results.append({
            "question": q,
            "answer": response,
            "contexts": [c.payload["content"] for c in contexts],
            "ground_truth": gt
        })
    
    # Evaluate with RAGAS
    scores = evaluate(
        results,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall
        ]
    )
    
    return scores

Performance Optimization Tips

  1. Chunk size tuning: Start with 500-1000 characters, adjust based on content type
  2. Embedding model selection: Balance quality vs. cost (text-embedding-3-large vs. small)
  3. Caching: Cache embeddings and frequent queries
  4. Batch processing: Process documents in batches for efficiency
  5. Index optimization: Use appropriate vector index types (HNSW, IVF)

Production Considerations

Scaling Architecture

# docker-compose.yml for production RAG
version: '3.8'
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
    deploy:
      resources:
        limits:
          memory: 8G
  
  rag-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - QDRANT_HOST=qdrant
    deploy:
      replicas: 3

volumes:
  qdrant_data:

Conclusion

Building production RAG systems requires careful attention to each component:

  • Document processing: Clean, well-chunked data is foundational
  • Embedding quality: Choose appropriate models for your domain
  • Retrieval strategies: Combine multiple approaches (semantic, HyDE, reranking)
  • Context assembly: Optimize for relevance within token limits
  • Evaluation: Continuously measure and improve

RAG is not a one-size-fits-all solution—iterate on each component based on your specific use case and evaluation metrics.


Building a RAG system? Share your architecture choices in the comments!

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)