Vector Databases Explained: Powering the Next Generation of AI Applications



Vector Databases Explained: Powering the Next Generation of AI Applications

As AI applications explode in complexity, traditional databases struggle with semantic search, similarity matching, and high-dimensional data. Vector databases have emerged as the critical infrastructure layer for modern AI systems, enabling everything from RAG pipelines to recommendation engines.

Data Visualization Photo by Luke Chesser on Unsplash

What Are Vector Databases?

Vector databases store and query high-dimensional vectors—numerical representations of data. Unlike traditional databases that match exact values, vector databases find similar items based on mathematical distance.

Traditional Database:
Query: WHERE name = 'iPhone 15'
Result: Exact match or nothing

Vector Database:
Query: "smartphone with great camera"
Result: Similar products ranked by relevance

Understanding Embeddings

Embeddings convert data into dense vectors that capture semantic meaning:

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Example
text1 = "The cat sat on the mat"
text2 = "A feline rested on the rug"
text3 = "Stock prices rose sharply"

# text1 and text2 will have similar vectors
# text3 will be distant from both

Embedding Dimensions

ModelDimensionsUse Case
text-embedding-3-small1536General purpose
text-embedding-3-large3072High accuracy
Cohere embed-v31024Multilingual
BGE-large1024Open source
all-MiniLM-L6-v2384Fast, lightweight

Similarity Metrics

Cosine Similarity

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Ranges from -1 to 1 (1 = identical direction)

Euclidean Distance

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    return np.linalg.norm(a - b)

# Ranges from 0 to infinity (0 = identical)

Dot Product

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b)

# Requires normalized vectors for meaningful comparison

Pinecone (Managed)

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")

# Upsert vectors
index.upsert(vectors=[
    {
        "id": "doc1",
        "values": embedding,
        "metadata": {"title": "Document 1", "category": "tech"}
    }
])

# Query
results = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    filter={"category": {"$eq": "tech"}}
)

Weaviate (Self-hosted or Managed)

import weaviate
from weaviate.classes.config import Property, DataType, Configure

client = weaviate.connect_to_local()

# Create collection
collection = client.collections.create(
    name="Article",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
    ]
)

# Add objects (auto-vectorized)
collection.data.insert({
    "title": "Introduction to AI",
    "content": "Artificial intelligence is...",
    "category": "tech"
})

# Semantic search
results = collection.query.near_text(
    query="machine learning basics",
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("category").equal("tech")
)

ChromaDB (Lightweight, Local)

import chromadb

client = chromadb.Client()
collection = client.create_collection("documents")

# Add documents (auto-embedded with default model)
collection.add(
    documents=["Doc about AI", "Doc about cooking", "Doc about ML"],
    ids=["id1", "id2", "id3"],
    metadatas=[{"topic": "tech"}, {"topic": "food"}, {"topic": "tech"}]
)

# Query
results = collection.query(
    query_texts=["artificial intelligence"],
    n_results=2,
    where={"topic": "tech"}
)

Technology Abstract Photo by Alexandre Debiève on Unsplash

Qdrant (High Performance)

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

# Insert vectors
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=embedding,
            payload={"title": "AI Guide", "category": "tech"}
        )
    ]
)

# Search with filtering
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=10,
    query_filter={
        "must": [{"key": "category", "match": {"value": "tech"}}]
    }
)

pgvector (PostgreSQL Extension)

-- Enable extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding vector(1536)
);

-- Create index for fast similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Insert document
INSERT INTO documents (title, content, embedding)
VALUES ('AI Guide', 'Content...', '[0.1, 0.2, ...]'::vector);

-- Similarity search
SELECT title, 1 - (embedding <=> query_vector) AS similarity
FROM documents
ORDER BY embedding <=> query_vector
LIMIT 10;

Indexing Algorithms

HNSW (Hierarchical Navigable Small World)

The most popular algorithm for approximate nearest neighbor search:

        Level 2:  [A]-------[B]
                   |         |
        Level 1:  [A]--[C]--[B]--[D]
                   |    |    |    |
        Level 0:  [A][E][C][F][B][G][D][H]

Pros: Fast queries, good recall Cons: High memory usage, slow inserts

IVF (Inverted File Index)

Partitions vectors into clusters:

# Conceptual example
clusters = {
    "cluster_1": [vec1, vec5, vec9],
    "cluster_2": [vec2, vec6, vec10],
    "cluster_3": [vec3, vec4, vec7, vec8]
}

# Query: find cluster first, then search within
nearest_cluster = find_nearest_cluster(query_vector)
results = search_within_cluster(nearest_cluster, query_vector)

Pros: Lower memory, faster builds Cons: Lower recall than HNSW

Building a RAG Pipeline

from openai import OpenAI
import chromadb

class RAGPipeline:
    def __init__(self):
        self.client = OpenAI()
        self.chroma = chromadb.Client()
        self.collection = self.chroma.create_collection("knowledge_base")
    
    def ingest_documents(self, documents: list[dict]):
        """Add documents to the knowledge base."""
        texts = [doc["content"] for doc in documents]
        ids = [doc["id"] for doc in documents]
        metadatas = [{"source": doc.get("source", "")} for doc in documents]
        
        self.collection.add(
            documents=texts,
            ids=ids,
            metadatas=metadatas
        )
    
    def retrieve(self, query: str, n_results: int = 5) -> list[str]:
        """Retrieve relevant documents."""
        results = self.collection.query(
            query_texts=[query],
            n_results=n_results
        )
        return results["documents"][0]
    
    def generate(self, query: str) -> str:
        """Generate answer using retrieved context."""
        # Retrieve relevant documents
        context_docs = self.retrieve(query)
        context = "\n\n".join(context_docs)
        
        # Generate response
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": f"Answer based on this context:\n\n{context}"
                },
                {"role": "user", "content": query}
            ]
        )
        
        return response.choices[0].message.content

# Usage
rag = RAGPipeline()
rag.ingest_documents([
    {"id": "1", "content": "Python is a programming language...", "source": "wiki"},
    {"id": "2", "content": "Machine learning uses algorithms...", "source": "book"},
])

answer = rag.generate("What is Python used for?")

Chunking Strategies

Fixed-Size Chunking

def fixed_chunks(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Semantic Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document)

Sentence-Based Chunking

import nltk
nltk.download('punkt')

def sentence_chunks(text: str, sentences_per_chunk: int = 5) -> list[str]:
    sentences = nltk.sent_tokenize(text)
    chunks = []
    
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = " ".join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)
    
    return chunks

Performance Optimization

Batch Operations

# Bad: Individual inserts
for doc in documents:
    collection.add(documents=[doc], ids=[doc_id])

# Good: Batch insert
collection.add(
    documents=documents,
    ids=ids,
    batch_size=100
)

Combine vector and keyword search:

# Weaviate hybrid search
results = collection.query.hybrid(
    query="machine learning basics",
    alpha=0.5,  # 0 = pure keyword, 1 = pure vector
    limit=10
)
# More efficient: filter then search
results = index.query(
    vector=query_embedding,
    filter={"category": "tech"},  # Applied before similarity search
    top_k=10
)

Choosing a Vector Database

DatabaseBest ForHostingScaling
PineconeProduction, managedCloud onlyAutomatic
WeaviateHybrid search, GraphQLBothManual/Managed
ChromaDBPrototyping, local devLocalLimited
QdrantHigh performanceBothManual
pgvectorPostgreSQL usersSelf-hostedWith PG
MilvusEnterprise, large scaleBothManual

Conclusion

Vector databases are essential infrastructure for modern AI applications:

  • Semantic Search: Find meaning, not just keywords
  • RAG Pipelines: Ground LLMs in your data
  • Recommendations: Similar items, users, content
  • Deduplication: Find near-duplicates at scale

Start with ChromaDB for prototyping, then graduate to Pinecone or Qdrant for production. The choice depends on your scale, hosting preferences, and feature requirements.


Ready to power your AI applications with vector search?

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)