Building Production RAG Applications: A Practical Guide to Vector Databases and Retrieval

Learn how to build reliable Retrieval Augmented Generation (RAG) applications using vector databases. Covers Pinecone, Weaviate, pgvector, chunking strategies, and real-world optimization.



Building Production RAG Applications with Vector Databases

Everyone demos RAG. Few ship it reliably. The gap between a working prototype and a production system that actually gives useful answers is enormous. This guide covers what you need to bridge that gap.

AI visualization Photo by Steve Johnson on Unsplash

What RAG Actually Is (And Isn’t)

Retrieval Augmented Generation connects an LLM to your data. Instead of fine-tuning a model (expensive, brittle), you:

  1. Embed your documents into vectors
  2. Store those vectors in a database
  3. Retrieve relevant chunks at query time
  4. Generate an answer using the LLM + retrieved context

RAG is not a silver bullet. It fails silently when retrieval is bad, and no amount of prompt engineering fixes fundamentally broken retrieval.

Choosing a Vector Database

Pinecone

The managed-first option. Zero infrastructure to manage:

import pinecone
from openai import OpenAI

pc = pinecone.Pinecone(api_key="your-key")
index = pc.Index("documents")
client = OpenAI()

# Upsert documents
def embed_and_store(docs):
    for doc in docs:
        embedding = client.embeddings.create(
            input=doc["text"],
            model="text-embedding-3-large"
        ).data[0].embedding

        index.upsert([(
            doc["id"],
            embedding,
            {"text": doc["text"], "source": doc["source"]}
        )])

# Query
def search(query, top_k=5):
    query_embedding = client.embeddings.create(
        input=query,
        model="text-embedding-3-large"
    ).data[0].embedding

    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    return [match.metadata["text"] for match in results.matches]

Pros: Zero ops, fast, serverless pricing option Cons: Vendor lock-in, can get expensive at scale, limited filtering

pgvector (PostgreSQL Extension)

If you already run Postgres, this is the pragmatic choice:

-- Enable the extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    source VARCHAR(255),
    embedding vector(3072),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Query: find similar documents
SELECT content, source,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE source = 'engineering-docs'
ORDER BY embedding <=> $1::vector
LIMIT 5;

Pros: No new infrastructure, SQL filtering, ACID transactions, familiar tooling Cons: Scaling requires Postgres scaling, not as fast as purpose-built solutions at >10M vectors

Weaviate

A full-featured vector database with built-in ML capabilities:

import weaviate
import weaviate.classes as wvc

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="your-cluster-url",
    auth_credentials=weaviate.auth.AuthApiKey("your-key"),
)

# Create collection with auto-vectorization
documents = client.collections.create(
    name="Document",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-large"
    ),
    properties=[
        wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="source", data_type=wvc.config.DataType.TEXT),
    ],
)

# Hybrid search (vector + keyword)
results = documents.query.hybrid(
    query="kubernetes deployment strategies",
    alpha=0.7,  # 0=keyword, 1=vector
    limit=5,
)

Pros: Hybrid search (vector + BM25), built-in vectorization, GraphQL API, multi-tenancy Cons: More complex to operate, higher resource requirements

The Part Everyone Gets Wrong: Chunking

Bad chunking is the #1 reason RAG systems give bad answers. Here’s what works:

Don’t: Fixed-Size Chunks

# This is lazy and lossy
chunks = [text[i:i+500] for i in range(0, len(text), 500)]

You’ll split sentences, break context, and lose meaning at chunk boundaries.

Do: Semantic Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = splitter.split_text(document)

Better — respects paragraph and sentence boundaries with overlap for context preservation.

Best: Document-Structure-Aware Chunking

def chunk_by_structure(markdown_text):
    """Split by headers, keeping hierarchy context."""
    sections = []
    current_section = {"headers": [], "content": ""}

    for line in markdown_text.split("\n"):
        if line.startswith("#"):
            if current_section["content"].strip():
                sections.append(current_section.copy())
            level = len(line.split(" ")[0])
            current_section["headers"] = (
                current_section["headers"][:level-1] + [line.lstrip("# ")]
            )
            current_section["content"] = ""
        else:
            current_section["content"] += line + "\n"

    # Each chunk includes its header hierarchy for context
    return [
        {
            "text": f"{' > '.join(s['headers'])}\n\n{s['content']}",
            "metadata": {"section": s["headers"][-1] if s["headers"] else ""},
        }
        for s in sections
        if s["content"].strip()
    ]

Respects document structure. Headers, code blocks, and tables stay intact.

Data flow Photo by Luke Chesser on Unsplash

Advanced Retrieval Strategies

Combine vector similarity with keyword matching:

def hybrid_search(query, alpha=0.7):
    """
    alpha: 0.0 = pure keyword (BM25)
           1.0 = pure vector similarity
           0.7 = mostly semantic with keyword boost
    """
    vector_results = vector_search(query, top_k=20)
    keyword_results = bm25_search(query, top_k=20)

    # Reciprocal Rank Fusion
    scores = {}
    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + alpha / (rank + 60)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + (1 - alpha) / (rank + 60)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:5]

Hybrid search catches cases where exact terminology matters (error codes, product names) that pure vector search misses.

Re-ranking

Retrieve broadly, then re-rank with a cross-encoder for precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def search_and_rerank(query, initial_k=20, final_k=5):
    # Broad retrieval
    candidates = vector_search(query, top_k=initial_k)

    # Precise re-ranking
    pairs = [(query, doc.text) for doc in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:final_k]]

Re-ranking typically improves answer relevance by 15-30%.

Query Transformation

The user’s query is often not the best search query:

def expand_query(original_query, llm):
    """Generate multiple search queries from user question."""
    prompt = f"""Given this user question, generate 3 diverse search queries
    that would help find relevant information. Return one per line.

    Question: {original_query}"""

    queries = llm.generate(prompt).strip().split("\n")
    queries.append(original_query)  # Keep original too

    # Search with all queries and deduplicate results
    all_results = []
    seen_ids = set()
    for q in queries:
        for result in vector_search(q, top_k=5):
            if result.id not in seen_ids:
                all_results.append(result)
                seen_ids.add(result.id)

    return all_results

Production Checklist

Before shipping RAG to production:

  • Evaluation pipeline: Build a test set of questions + expected answers. Measure retrieval precision and answer quality. Automate it.
  • Monitoring: Track retrieval scores, answer latency, and user feedback. Low similarity scores = your retrieval is failing.
  • Fallback behavior: When retrieval confidence is low, say “I don’t know” instead of hallucinating.
  • Chunk metadata: Store source, page number, date, and author. Users need to verify answers.
  • Index maintenance: Documents change. Build a pipeline to detect updates and re-embed affected chunks.
  • Cost tracking: Embedding calls add up. Cache embeddings for repeated queries.

Cost Breakdown (Typical Production App)

For a system with 1M document chunks and 10K queries/day:

ComponentMonthly Cost
Embeddings (OpenAI text-embedding-3-large)~$50
Vector DB (Pinecone Serverless)~$70
LLM generation (GPT-4o)~$300
Re-ranking model (self-hosted)~$50 (GPU)
Total~$470/month

Using pgvector on existing infrastructure can cut the vector DB cost to near zero.

Conclusion

Production RAG is an engineering challenge, not a prompt engineering challenge. The retrieval pipeline — chunking, embedding, indexing, querying, re-ranking — determines 80% of your answer quality.

Start with pgvector if you already have Postgres. Use semantic chunking from day one. Add hybrid search and re-ranking when you need better precision. And always, always build an evaluation pipeline before you ship.

The LLM is the easy part. Retrieval is where the real work lives.


RAG done right feels like magic. RAG done wrong feels like a broken search engine with extra steps.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)