Vector Databases for AI: Powering RAG Systems at Scale
on Vector database, Rag, Ai, Embeddings, Pinecone, Weaviate, Chromadb, Llm
Vector Databases for AI: Powering RAG Systems at Scale
As Large Language Models (LLMs) become central to modern applications, the need for efficient retrieval systems has skyrocketed. Vector databases have emerged as the critical infrastructure powering Retrieval-Augmented Generation (RAG) systems.
Photo by Luke Chesser on Unsplash
Why Vector Databases?
Traditional databases search by exact matches or keywords. Vector databases search by meaning—finding semantically similar content even when words differ.
Key Capabilities
- Semantic Search: Find similar content based on meaning
- High-Dimensional Indexing: Handle millions of embedding vectors
- Real-Time Updates: Add new documents without reindexing
- Hybrid Search: Combine vector and keyword search
Understanding Embeddings
Embeddings are numerical representations of text (or images, audio) that capture semantic meaning:
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-large",
input=text
)
return response.data[0].embedding
# Similar concepts have similar embeddings
embed1 = get_embedding("The cat sat on the mat")
embed2 = get_embedding("A feline rested on the rug")
# These vectors will be close in vector space!
Popular Vector Databases Compared
1. Pinecone
Fully managed, highly scalable:
import pinecone
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="your-key")
# Create index
pc.create_index(
name="documents",
dimension=3072, # text-embedding-3-large
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("documents")
# Upsert vectors
index.upsert(vectors=[
{"id": "doc1", "values": embedding, "metadata": {"source": "wiki"}},
])
# Query
results = index.query(vector=query_embedding, top_k=10)
2. Weaviate
Open-source with built-in vectorization:
import weaviate
client = weaviate.connect_to_local()
# Define collection with auto-vectorization
collection = client.collections.create(
name="Document",
vectorizer_config=weaviate.Configure.Vectorizer.text2vec_openai(),
properties=[
weaviate.Property(name="content", data_type=weaviate.DataType.TEXT),
weaviate.Property(name="source", data_type=weaviate.DataType.TEXT),
]
)
# Add documents (automatically vectorized)
collection.data.insert({"content": "Your document text", "source": "manual"})
# Semantic search
results = collection.query.near_text(query="search query", limit=5)
3. ChromaDB
Lightweight, perfect for development:
import chromadb
client = chromadb.Client()
collection = client.create_collection("documents")
# Add documents with automatic embedding
collection.add(
documents=["Document 1 content", "Document 2 content"],
ids=["doc1", "doc2"],
metadatas=[{"source": "a"}, {"source": "b"}]
)
# Query
results = collection.query(query_texts=["search query"], n_results=5)
Photo by JJ Ying on Unsplash
Building a Production RAG System
Here’s a complete RAG implementation:
from openai import OpenAI
import chromadb
class RAGSystem:
def __init__(self):
self.llm = OpenAI()
self.db = chromadb.PersistentClient(path="./data")
self.collection = self.db.get_or_create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}
)
def add_documents(self, documents: list[dict]):
"""Add documents to the knowledge base."""
self.collection.add(
documents=[d["content"] for d in documents],
ids=[d["id"] for d in documents],
metadatas=[{"source": d.get("source", "unknown")} for d in documents]
)
def retrieve(self, query: str, top_k: int = 5) -> list[str]:
"""Retrieve relevant documents."""
results = self.collection.query(
query_texts=[query],
n_results=top_k
)
return results["documents"][0]
def generate(self, query: str) -> str:
"""Generate answer using retrieved context."""
# Retrieve relevant documents
context_docs = self.retrieve(query)
context = "\n\n".join(context_docs)
# Generate response
response = self.llm.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": f"""Answer based on the following context:
{context}
If the context doesn't contain relevant information, say so."""
},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
# Usage
rag = RAGSystem()
rag.add_documents([
{"id": "1", "content": "Python 3.13 introduces JIT compilation...", "source": "docs"},
{"id": "2", "content": "FastAPI is a modern web framework...", "source": "docs"},
])
answer = rag.generate("What's new in Python 3.13?")
Advanced Techniques
1. Hybrid Search
Combine semantic and keyword search:
from rank_bm25 import BM25Okapi
class HybridSearch:
def __init__(self, vector_db, documents):
self.vector_db = vector_db
self.bm25 = BM25Okapi([doc.split() for doc in documents])
self.documents = documents
def search(self, query: str, alpha: float = 0.5):
# Vector search scores
vector_results = self.vector_db.query(query)
# BM25 keyword scores
bm25_scores = self.bm25.get_scores(query.split())
# Combine with weighted fusion
combined = alpha * vector_results + (1 - alpha) * bm25_scores
return combined
2. Chunking Strategies
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Smart chunking with overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(long_document)
3. Reranking
Improve results with a reranking model:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, documents: list[str], top_k: int = 3):
pairs = [(query, doc) for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]
Performance Optimization
Indexing Strategies
| Index Type | Best For | Trade-offs |
|---|---|---|
| HNSW | General use | Higher memory, fast search |
| IVF | Large datasets | Requires training, good recall |
| PQ | Memory-constrained | Some accuracy loss |
| Flat | Small datasets | Exact, but slow at scale |
Scaling Tips
- Batch Operations: Insert in batches of 100-1000 vectors
- Async Queries: Use async clients for concurrent searches
- Dimension Reduction: Use Matryoshka embeddings for smaller vectors
- Filtering: Apply metadata filters before vector search
Monitoring and Observability
Track your RAG system performance:
import time
from dataclasses import dataclass
@dataclass
class RAGMetrics:
retrieval_latency_ms: float
generation_latency_ms: float
num_chunks_retrieved: int
relevance_score: float
def monitored_query(rag: RAGSystem, query: str) -> tuple[str, RAGMetrics]:
start = time.time()
docs = rag.retrieve(query)
retrieval_time = (time.time() - start) * 1000
start = time.time()
answer = rag.generate(query)
generation_time = (time.time() - start) * 1000
return answer, RAGMetrics(
retrieval_latency_ms=retrieval_time,
generation_latency_ms=generation_time,
num_chunks_retrieved=len(docs),
relevance_score=calculate_relevance(query, docs)
)
Conclusion
Vector databases are the backbone of modern AI applications. Whether you choose Pinecone for scale, Weaviate for features, or ChromaDB for simplicity, understanding these systems is essential for building effective RAG applications.
Start small, measure performance, and scale as needed. The combination of powerful embeddings and efficient vector search is transforming how we build AI-powered search and retrieval systems.
Ready to build your own RAG system? Pick a vector database and start experimenting!
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
