Vector Databases for AI: Powering RAG Systems at Scale



Vector Databases for AI: Powering RAG Systems at Scale

As Large Language Models (LLMs) become central to modern applications, the need for efficient retrieval systems has skyrocketed. Vector databases have emerged as the critical infrastructure powering Retrieval-Augmented Generation (RAG) systems.

Data Visualization Photo by Luke Chesser on Unsplash

Why Vector Databases?

Traditional databases search by exact matches or keywords. Vector databases search by meaning—finding semantically similar content even when words differ.

Key Capabilities

  • Semantic Search: Find similar content based on meaning
  • High-Dimensional Indexing: Handle millions of embedding vectors
  • Real-Time Updates: Add new documents without reindexing
  • Hybrid Search: Combine vector and keyword search

Understanding Embeddings

Embeddings are numerical representations of text (or images, audio) that capture semantic meaning:

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

# Similar concepts have similar embeddings
embed1 = get_embedding("The cat sat on the mat")
embed2 = get_embedding("A feline rested on the rug")
# These vectors will be close in vector space!

1. Pinecone

Fully managed, highly scalable:

import pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-key")

# Create index
pc.create_index(
    name="documents",
    dimension=3072,  # text-embedding-3-large
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("documents")

# Upsert vectors
index.upsert(vectors=[
    {"id": "doc1", "values": embedding, "metadata": {"source": "wiki"}},
])

# Query
results = index.query(vector=query_embedding, top_k=10)

2. Weaviate

Open-source with built-in vectorization:

import weaviate

client = weaviate.connect_to_local()

# Define collection with auto-vectorization
collection = client.collections.create(
    name="Document",
    vectorizer_config=weaviate.Configure.Vectorizer.text2vec_openai(),
    properties=[
        weaviate.Property(name="content", data_type=weaviate.DataType.TEXT),
        weaviate.Property(name="source", data_type=weaviate.DataType.TEXT),
    ]
)

# Add documents (automatically vectorized)
collection.data.insert({"content": "Your document text", "source": "manual"})

# Semantic search
results = collection.query.near_text(query="search query", limit=5)

3. ChromaDB

Lightweight, perfect for development:

import chromadb

client = chromadb.Client()
collection = client.create_collection("documents")

# Add documents with automatic embedding
collection.add(
    documents=["Document 1 content", "Document 2 content"],
    ids=["doc1", "doc2"],
    metadatas=[{"source": "a"}, {"source": "b"}]
)

# Query
results = collection.query(query_texts=["search query"], n_results=5)

Database Architecture Photo by JJ Ying on Unsplash

Building a Production RAG System

Here’s a complete RAG implementation:

from openai import OpenAI
import chromadb

class RAGSystem:
    def __init__(self):
        self.llm = OpenAI()
        self.db = chromadb.PersistentClient(path="./data")
        self.collection = self.db.get_or_create_collection(
            name="knowledge_base",
            metadata={"hnsw:space": "cosine"}
        )
    
    def add_documents(self, documents: list[dict]):
        """Add documents to the knowledge base."""
        self.collection.add(
            documents=[d["content"] for d in documents],
            ids=[d["id"] for d in documents],
            metadatas=[{"source": d.get("source", "unknown")} for d in documents]
        )
    
    def retrieve(self, query: str, top_k: int = 5) -> list[str]:
        """Retrieve relevant documents."""
        results = self.collection.query(
            query_texts=[query],
            n_results=top_k
        )
        return results["documents"][0]
    
    def generate(self, query: str) -> str:
        """Generate answer using retrieved context."""
        # Retrieve relevant documents
        context_docs = self.retrieve(query)
        context = "\n\n".join(context_docs)
        
        # Generate response
        response = self.llm.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": f"""Answer based on the following context:

{context}

If the context doesn't contain relevant information, say so."""
                },
                {"role": "user", "content": query}
            ]
        )
        return response.choices[0].message.content

# Usage
rag = RAGSystem()
rag.add_documents([
    {"id": "1", "content": "Python 3.13 introduces JIT compilation...", "source": "docs"},
    {"id": "2", "content": "FastAPI is a modern web framework...", "source": "docs"},
])

answer = rag.generate("What's new in Python 3.13?")

Advanced Techniques

Combine semantic and keyword search:

from rank_bm25 import BM25Okapi

class HybridSearch:
    def __init__(self, vector_db, documents):
        self.vector_db = vector_db
        self.bm25 = BM25Okapi([doc.split() for doc in documents])
        self.documents = documents
    
    def search(self, query: str, alpha: float = 0.5):
        # Vector search scores
        vector_results = self.vector_db.query(query)
        
        # BM25 keyword scores
        bm25_scores = self.bm25.get_scores(query.split())
        
        # Combine with weighted fusion
        combined = alpha * vector_results + (1 - alpha) * bm25_scores
        return combined

2. Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Smart chunking with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = splitter.split_text(long_document)

3. Reranking

Improve results with a reranking model:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, documents: list[str], top_k: int = 3):
    pairs = [(query, doc) for doc in documents]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

Performance Optimization

Indexing Strategies

Index TypeBest ForTrade-offs
HNSWGeneral useHigher memory, fast search
IVFLarge datasetsRequires training, good recall
PQMemory-constrainedSome accuracy loss
FlatSmall datasetsExact, but slow at scale

Scaling Tips

  1. Batch Operations: Insert in batches of 100-1000 vectors
  2. Async Queries: Use async clients for concurrent searches
  3. Dimension Reduction: Use Matryoshka embeddings for smaller vectors
  4. Filtering: Apply metadata filters before vector search

Monitoring and Observability

Track your RAG system performance:

import time
from dataclasses import dataclass

@dataclass
class RAGMetrics:
    retrieval_latency_ms: float
    generation_latency_ms: float
    num_chunks_retrieved: int
    relevance_score: float

def monitored_query(rag: RAGSystem, query: str) -> tuple[str, RAGMetrics]:
    start = time.time()
    docs = rag.retrieve(query)
    retrieval_time = (time.time() - start) * 1000
    
    start = time.time()
    answer = rag.generate(query)
    generation_time = (time.time() - start) * 1000
    
    return answer, RAGMetrics(
        retrieval_latency_ms=retrieval_time,
        generation_latency_ms=generation_time,
        num_chunks_retrieved=len(docs),
        relevance_score=calculate_relevance(query, docs)
    )

Conclusion

Vector databases are the backbone of modern AI applications. Whether you choose Pinecone for scale, Weaviate for features, or ChromaDB for simplicity, understanding these systems is essential for building effective RAG applications.

Start small, measure performance, and scale as needed. The combination of powerful embeddings and efficient vector search is transforming how we build AI-powered search and retrieval systems.


Ready to build your own RAG system? Pick a vector database and start experimenting!

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)