Vector Databases for AI: Powering RAG Systems at Scale

As Large Language Models (LLMs) become central to modern applications, the need for efficient retrieval systems has skyrocketed. Vector databases have emerged as the critical infrastructure powering Retrieval-Augmented Generation (RAG) systems.

Data Visualization Photo by Luke Chesser on Unsplash

Why Vector Databases?

Traditional databases search by exact matches or keywords. Vector databases search by meaning—finding semantically similar content even when words differ.

Key Capabilities

Semantic Search: Find similar content based on meaning
High-Dimensional Indexing: Handle millions of embedding vectors
Real-Time Updates: Add new documents without reindexing
Hybrid Search: Combine vector and keyword search

Understanding Embeddings

Embeddings are numerical representations of text (or images, audio) that capture semantic meaning:

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

# Similar concepts have similar embeddings
embed1 = get_embedding("The cat sat on the mat")
embed2 = get_embedding("A feline rested on the rug")
# These vectors will be close in vector space!

Popular Vector Databases Compared

1. Pinecone

Fully managed, highly scalable:

import pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-key")

# Create index
pc.create_index(
    name="documents",
    dimension=3072,  # text-embedding-3-large
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("documents")

# Upsert vectors
index.upsert(vectors=[
    {"id": "doc1", "values": embedding, "metadata": {"source": "wiki"}},
])

# Query
results = index.query(vector=query_embedding, top_k=10)

2. Weaviate

Open-source with built-in vectorization:

import weaviate

client = weaviate.connect_to_local()

# Define collection with auto-vectorization
collection = client.collections.create(
    name="Document",
    vectorizer_config=weaviate.Configure.Vectorizer.text2vec_openai(),
    properties=[
        weaviate.Property(name="content", data_type=weaviate.DataType.TEXT),
        weaviate.Property(name="source", data_type=weaviate.DataType.TEXT),
    ]
)

# Add documents (automatically vectorized)
collection.data.insert({"content": "Your document text", "source": "manual"})

# Semantic search
results = collection.query.near_text(query="search query", limit=5)

3. ChromaDB

Lightweight, perfect for development:

import chromadb

client = chromadb.Client()
collection = client.create_collection("documents")

# Add documents with automatic embedding
collection.add(
    documents=["Document 1 content", "Document 2 content"],
    ids=["doc1", "doc2"],
    metadatas=[{"source": "a"}, {"source": "b"}]
)

# Query
results = collection.query(query_texts=["search query"], n_results=5)

Database Architecture Photo by JJ Ying on Unsplash

Building a Production RAG System

Here’s a complete RAG implementation:

from openai import OpenAI
import chromadb

class RAGSystem:
    def __init__(self):
        self.llm = OpenAI()
        self.db = chromadb.PersistentClient(path="./data")
        self.collection = self.db.get_or_create_collection(
            name="knowledge_base",
            metadata={"hnsw:space": "cosine"}
        )
    
    def add_documents(self, documents: list[dict]):
        """Add documents to the knowledge base."""
        self.collection.add(
            documents=[d["content"] for d in documents],
            ids=[d["id"] for d in documents],
            metadatas=[{"source": d.get("source", "unknown")} for d in documents]
        )
    
    def retrieve(self, query: str, top_k: int = 5) -> list[str]:
        """Retrieve relevant documents."""
        results = self.collection.query(
            query_texts=[query],
            n_results=top_k
        )
        return results["documents"][0]
    
    def generate(self, query: str) -> str:
        """Generate answer using retrieved context."""
        # Retrieve relevant documents
        context_docs = self.retrieve(query)
        context = "\n\n".join(context_docs)
        
        # Generate response
        response = self.llm.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": f"""Answer based on the following context:

{context}

If the context doesn't contain relevant information, say so."""
                },
                {"role": "user", "content": query}
            ]
        )
        return response.choices[0].message.content

# Usage
rag = RAGSystem()
rag.add_documents([
    {"id": "1", "content": "Python 3.13 introduces JIT compilation...", "source": "docs"},
    {"id": "2", "content": "FastAPI is a modern web framework...", "source": "docs"},
])

answer = rag.generate("What's new in Python 3.13?")

Advanced Techniques

1. Hybrid Search

Combine semantic and keyword search:

from rank_bm25 import BM25Okapi

class HybridSearch:
    def __init__(self, vector_db, documents):
        self.vector_db = vector_db
        self.bm25 = BM25Okapi([doc.split() for doc in documents])
        self.documents = documents
    
    def search(self, query: str, alpha: float = 0.5):
        # Vector search scores
        vector_results = self.vector_db.query(query)
        
        # BM25 keyword scores
        bm25_scores = self.bm25.get_scores(query.split())
        
        # Combine with weighted fusion
        combined = alpha * vector_results + (1 - alpha) * bm25_scores
        return combined

2. Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Smart chunking with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = splitter.split_text(long_document)

3. Reranking

Improve results with a reranking model:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, documents: list[str], top_k: int = 3):
    pairs = [(query, doc) for doc in documents]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

Performance Optimization

Indexing Strategies

Index Type	Best For	Trade-offs
HNSW	General use	Higher memory, fast search
IVF	Large datasets	Requires training, good recall
PQ	Memory-constrained	Some accuracy loss
Flat	Small datasets	Exact, but slow at scale

Scaling Tips

Batch Operations: Insert in batches of 100-1000 vectors
Async Queries: Use async clients for concurrent searches
Dimension Reduction: Use Matryoshka embeddings for smaller vectors
Filtering: Apply metadata filters before vector search

Monitoring and Observability

Track your RAG system performance:

import time
from dataclasses import dataclass

@dataclass
class RAGMetrics:
    retrieval_latency_ms: float
    generation_latency_ms: float
    num_chunks_retrieved: int
    relevance_score: float

def monitored_query(rag: RAGSystem, query: str) -> tuple[str, RAGMetrics]:
    start = time.time()
    docs = rag.retrieve(query)
    retrieval_time = (time.time() - start) * 1000
    
    start = time.time()
    answer = rag.generate(query)
    generation_time = (time.time() - start) * 1000
    
    return answer, RAGMetrics(
        retrieval_latency_ms=retrieval_time,
        generation_latency_ms=generation_time,
        num_chunks_retrieved=len(docs),
        relevance_score=calculate_relevance(query, docs)
    )

Conclusion

Vector databases are the backbone of modern AI applications. Whether you choose Pinecone for scale, Weaviate for features, or ChromaDB for simplicity, understanding these systems is essential for building effective RAG applications.

Start small, measure performance, and scale as needed. The combination of powerful embeddings and efficient vector search is transforming how we build AI-powered search and retrieval systems.

Ready to build your own RAG system? Pick a vector database and start experimenting!

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)