Building Production RAG Applications: A Practical Guide to Vector Databases and Retrieval
in Ai on Rag, Vector-database, Ai, Llm, Pinecone, Weaviate, Pgvector, Embeddings
Building Production RAG Applications with Vector Databases
Everyone demos RAG. Few ship it reliably. The gap between a working prototype and a production system that actually gives useful answers is enormous. This guide covers what you need to bridge that gap.
Photo by Steve Johnson on Unsplash
What RAG Actually Is (And Isn’t)
Retrieval Augmented Generation connects an LLM to your data. Instead of fine-tuning a model (expensive, brittle), you:
- Embed your documents into vectors
- Store those vectors in a database
- Retrieve relevant chunks at query time
- Generate an answer using the LLM + retrieved context
RAG is not a silver bullet. It fails silently when retrieval is bad, and no amount of prompt engineering fixes fundamentally broken retrieval.
Choosing a Vector Database
Pinecone
The managed-first option. Zero infrastructure to manage:
import pinecone
from openai import OpenAI
pc = pinecone.Pinecone(api_key="your-key")
index = pc.Index("documents")
client = OpenAI()
# Upsert documents
def embed_and_store(docs):
for doc in docs:
embedding = client.embeddings.create(
input=doc["text"],
model="text-embedding-3-large"
).data[0].embedding
index.upsert([(
doc["id"],
embedding,
{"text": doc["text"], "source": doc["source"]}
)])
# Query
def search(query, top_k=5):
query_embedding = client.embeddings.create(
input=query,
model="text-embedding-3-large"
).data[0].embedding
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
return [match.metadata["text"] for match in results.matches]
Pros: Zero ops, fast, serverless pricing option Cons: Vendor lock-in, can get expensive at scale, limited filtering
pgvector (PostgreSQL Extension)
If you already run Postgres, this is the pragmatic choice:
-- Enable the extension
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
source VARCHAR(255),
embedding vector(3072),
created_at TIMESTAMP DEFAULT NOW()
);
-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Query: find similar documents
SELECT content, source,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE source = 'engineering-docs'
ORDER BY embedding <=> $1::vector
LIMIT 5;
Pros: No new infrastructure, SQL filtering, ACID transactions, familiar tooling Cons: Scaling requires Postgres scaling, not as fast as purpose-built solutions at >10M vectors
Weaviate
A full-featured vector database with built-in ML capabilities:
import weaviate
import weaviate.classes as wvc
client = weaviate.connect_to_weaviate_cloud(
cluster_url="your-cluster-url",
auth_credentials=weaviate.auth.AuthApiKey("your-key"),
)
# Create collection with auto-vectorization
documents = client.collections.create(
name="Document",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
model="text-embedding-3-large"
),
properties=[
wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
wvc.config.Property(name="source", data_type=wvc.config.DataType.TEXT),
],
)
# Hybrid search (vector + keyword)
results = documents.query.hybrid(
query="kubernetes deployment strategies",
alpha=0.7, # 0=keyword, 1=vector
limit=5,
)
Pros: Hybrid search (vector + BM25), built-in vectorization, GraphQL API, multi-tenancy Cons: More complex to operate, higher resource requirements
The Part Everyone Gets Wrong: Chunking
Bad chunking is the #1 reason RAG systems give bad answers. Here’s what works:
Don’t: Fixed-Size Chunks
# This is lazy and lossy
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
You’ll split sentences, break context, and lose meaning at chunk boundaries.
Do: Semantic Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = splitter.split_text(document)
Better — respects paragraph and sentence boundaries with overlap for context preservation.
Best: Document-Structure-Aware Chunking
def chunk_by_structure(markdown_text):
"""Split by headers, keeping hierarchy context."""
sections = []
current_section = {"headers": [], "content": ""}
for line in markdown_text.split("\n"):
if line.startswith("#"):
if current_section["content"].strip():
sections.append(current_section.copy())
level = len(line.split(" ")[0])
current_section["headers"] = (
current_section["headers"][:level-1] + [line.lstrip("# ")]
)
current_section["content"] = ""
else:
current_section["content"] += line + "\n"
# Each chunk includes its header hierarchy for context
return [
{
"text": f"{' > '.join(s['headers'])}\n\n{s['content']}",
"metadata": {"section": s["headers"][-1] if s["headers"] else ""},
}
for s in sections
if s["content"].strip()
]
Respects document structure. Headers, code blocks, and tables stay intact.
Photo by Luke Chesser on Unsplash
Advanced Retrieval Strategies
Hybrid Search
Combine vector similarity with keyword matching:
def hybrid_search(query, alpha=0.7):
"""
alpha: 0.0 = pure keyword (BM25)
1.0 = pure vector similarity
0.7 = mostly semantic with keyword boost
"""
vector_results = vector_search(query, top_k=20)
keyword_results = bm25_search(query, top_k=20)
# Reciprocal Rank Fusion
scores = {}
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + alpha / (rank + 60)
for rank, doc in enumerate(keyword_results):
scores[doc.id] = scores.get(doc.id, 0) + (1 - alpha) / (rank + 60)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:5]
Hybrid search catches cases where exact terminology matters (error codes, product names) that pure vector search misses.
Re-ranking
Retrieve broadly, then re-rank with a cross-encoder for precision:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def search_and_rerank(query, initial_k=20, final_k=5):
# Broad retrieval
candidates = vector_search(query, top_k=initial_k)
# Precise re-ranking
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:final_k]]
Re-ranking typically improves answer relevance by 15-30%.
Query Transformation
The user’s query is often not the best search query:
def expand_query(original_query, llm):
"""Generate multiple search queries from user question."""
prompt = f"""Given this user question, generate 3 diverse search queries
that would help find relevant information. Return one per line.
Question: {original_query}"""
queries = llm.generate(prompt).strip().split("\n")
queries.append(original_query) # Keep original too
# Search with all queries and deduplicate results
all_results = []
seen_ids = set()
for q in queries:
for result in vector_search(q, top_k=5):
if result.id not in seen_ids:
all_results.append(result)
seen_ids.add(result.id)
return all_results
Production Checklist
Before shipping RAG to production:
- Evaluation pipeline: Build a test set of questions + expected answers. Measure retrieval precision and answer quality. Automate it.
- Monitoring: Track retrieval scores, answer latency, and user feedback. Low similarity scores = your retrieval is failing.
- Fallback behavior: When retrieval confidence is low, say “I don’t know” instead of hallucinating.
- Chunk metadata: Store source, page number, date, and author. Users need to verify answers.
- Index maintenance: Documents change. Build a pipeline to detect updates and re-embed affected chunks.
- Cost tracking: Embedding calls add up. Cache embeddings for repeated queries.
Cost Breakdown (Typical Production App)
For a system with 1M document chunks and 10K queries/day:
| Component | Monthly Cost |
|---|---|
| Embeddings (OpenAI text-embedding-3-large) | ~$50 |
| Vector DB (Pinecone Serverless) | ~$70 |
| LLM generation (GPT-4o) | ~$300 |
| Re-ranking model (self-hosted) | ~$50 (GPU) |
| Total | ~$470/month |
Using pgvector on existing infrastructure can cut the vector DB cost to near zero.
Conclusion
Production RAG is an engineering challenge, not a prompt engineering challenge. The retrieval pipeline — chunking, embedding, indexing, querying, re-ranking — determines 80% of your answer quality.
Start with pgvector if you already have Postgres. Use semantic chunking from day one. Add hybrid search and re-ranking when you need better precision. And always, always build an evaluation pipeline before you ship.
The LLM is the easy part. Retrieval is where the real work lives.
RAG done right feels like magic. RAG done wrong feels like a broken search engine with extra steps.
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
