Vector Databases Explained: The Infrastructure Layer Powering Modern AI Applications



Vector Databases Explained: The Infrastructure Layer Powering Modern AI Applications

Ask any team building a production AI application what surprised them most, and the answer is often the same: data infrastructure. Specifically, the need for a system that can store, index, and query high-dimensional vectors at scale — something traditional databases were never designed to do.

Enter vector databases. Two years ago, they were a niche curiosity. Today they’re foundational infrastructure for RAG (Retrieval-Augmented Generation), semantic search, recommendation engines, anomaly detection, and multimodal applications.

This is the explainer I wish I’d had before I started evaluating them.

Server room with glowing lights Photo by Taylor Vick on Unsplash


What Is a Vector, and Why Does It Matter?

A vector is an array of floating-point numbers that represents the semantic meaning of some content. Embeddings models (like OpenAI’s text-embedding-3-large, Google’s text-embedding-004, or open-source models like nomic-embed-text) convert text, images, or audio into these high-dimensional representations.

The magic: semantically similar content has vectors that are numerically close to each other. “dog” and “puppy” have nearby vectors. “king” - “man” + “woman” ≈ “queen” — the famous Word2Vec demonstration.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-large"
    )
    return response.data[0].embedding

# These will have high cosine similarity
v1 = embed("machine learning model training")
v2 = embed("neural network gradient descent")
v3 = embed("French cuisine baguette")

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_similarity(v1, v2))  # ~0.82 — highly similar
print(cosine_similarity(v1, v3))  # ~0.31 — unrelated

A vector database stores these embeddings and supports approximate nearest neighbor (ANN) search — finding the K vectors most similar to a query vector, fast, at scale.


The RAG Architecture: Why Vector DBs Became Essential

RAG — Retrieval-Augmented Generation — is the dominant pattern for giving LLMs access to external knowledge without hallucination.

User Query → Embed Query → Vector Search → Retrieve Top-K Chunks
                                                    ↓
                              LLM Generates Answer Using Retrieved Context

Without a vector database, RAG doesn’t scale. You can’t do semantic search over millions of documents with a traditional SQL LIKE query.

The pipeline:

  1. Ingestion: Split documents into chunks (typically 512-1024 tokens)
  2. Embedding: Convert each chunk to a vector
  3. Storage: Store chunk text + vector in the vector database
  4. Query: Embed the user’s question, find similar chunks, inject into LLM context
  5. Generation: LLM answers based on retrieved context
# Simplified RAG pipeline
from pinecone import Pinecone
from openai import OpenAI

pc = Pinecone(api_key="...")
index = pc.Index("knowledge-base")
openai_client = OpenAI()

def rag_query(question: str) -> str:
    # Step 1: Embed the question
    query_vector = embed(question)
    
    # Step 2: Find similar document chunks
    results = index.query(
        vector=query_vector,
        top_k=5,
        include_metadata=True
    )
    
    # Step 3: Build context from retrieved chunks
    context = "\n\n".join([r.metadata["text"] for r in results.matches])
    
    # Step 4: Generate answer with LLM
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on context:\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

The Landscape: Major Vector Database Options

Pinecone

The managed cloud-only option. Fully serverless, scales to billions of vectors, zero operational overhead. The right choice when you want to not think about infrastructure.

Pros: Zero ops, excellent performance, good SDK support Cons: Vendor lock-in, cost at scale, no self-hosting

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")
pc.create_index(
    name="my-index",
    dimension=1536,  # OpenAI text-embedding-3-small dimensions
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

Weaviate

Open source, self-hostable, with a managed cloud offering. Strong schema support, built-in BM25 hybrid search, GraphQL and REST APIs.

Pros: Hybrid search, rich filtering, open source, multimodal support Cons: More complex setup, resource-intensive

Qdrant

Written in Rust (see the previous post!). Fast, memory-efficient, excellent filtering capabilities, Docker-friendly. Strong choice for teams who want self-hosted with minimal ops overhead.

docker run -p 6333:6333 qdrant/qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("localhost", port=6333)
client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

pgvector (PostgreSQL Extension)

Add vector search to your existing PostgreSQL database. If you’re already on Postgres, this removes an entire service from your architecture.

-- Enable the extension
CREATE EXTENSION vector;

-- Add a vector column to existing table
ALTER TABLE articles ADD COLUMN embedding vector(1536);

-- Create an index for fast ANN search
CREATE INDEX ON articles USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

-- Semantic search query
SELECT title, 1 - (embedding <=> '[...]') AS similarity
FROM articles
ORDER BY embedding <=> '[...]'
LIMIT 10;

Pros: No new service, ACID transactions, joins with relational data Cons: Not as performant as purpose-built solutions at massive scale

Chroma

Developer-friendly, embedded mode for prototyping, persistent for production. The quickest way to get vector search running locally.

import chromadb

client = chromadb.Client()
collection = client.create_collection("my_collection")

collection.add(
    documents=["This is about machine learning", "Python is great"],
    ids=["doc1", "doc2"]
)

results = collection.query(
    query_texts=["neural networks"],
    n_results=2
)

Choosing the Right Vector Database

Use CaseRecommended
Prototype / local devChroma
Already on PostgreSQLpgvector
Self-hosted, productionQdrant
Managed, zero opsPinecone
Complex filtering + hybrid searchWeaviate
Massive scale, multi-tenantPinecone or Weaviate Cloud

Key Concepts to Understand

ANN Algorithms

Vector databases use approximate nearest neighbor algorithms rather than exact search — exact search over millions of vectors is too slow.

  • HNSW (Hierarchical Navigable Small World) — Graph-based, fast queries, high memory usage. Used by Qdrant, Weaviate, pgvector.
  • IVF (Inverted File Index) — Clustering-based, lower memory, slightly slower. Used by pgvector’s ivfflat.
  • Flat — Exact search, only practical for small datasets.

Metadata Filtering

Pure vector search often isn’t enough. You need to filter by structured attributes (date range, category, user ID) AND semantic similarity simultaneously.

# Pinecone filtered query
results = index.query(
    vector=query_vector,
    top_k=10,
    filter={
        "category": {"$in": ["tech", "science"]},
        "published_date": {"$gte": "2025-01-01"}
    }
)

Combining vector (semantic) search with BM25 keyword search. Often beats pure vector search for queries with specific technical terms, product codes, or named entities.


Production Considerations

Embedding consistency: The model you use to query must match the model used during ingestion. Mixing models breaks similarity scores.

Chunking strategy matters: How you split documents significantly impacts retrieval quality. Experiment with chunk size and overlap.

Index warm-up: Large ANN indexes may need warm-up time before reaching full performance.

Cost modeling: Embedding API calls add up. Batch where possible. Cache embeddings for static content.

Versioning: When you upgrade embedding models, you need to re-embed your entire corpus.


Conclusion

Vector databases have moved from “interesting AI research infrastructure” to “standard production tooling” in the span of two years. If you’re building anything that involves semantic search, RAG, or recommendation at scale, you need one.

The good news: the ecosystem has matured dramatically. Whether you want managed simplicity (Pinecone), self-hosted performance (Qdrant), or to minimize new services (pgvector), there’s a well-supported option that fits.

Start with Chroma for your prototype. Graduate to production infrastructure when you know what your scale requirements actually are.


References:


이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)