DuckDB: The In-Process Analytics Engine That Changes Everything



DuckDB: The In-Process Analytics Engine That Changes Everything

For the past decade, serious analytics meant a two-step journey: load data into a warehouse (Redshift, BigQuery, Snowflake), write SQL queries against it. For smaller datasets or exploration, you’d pull data into pandas and fight Python memory limits.

DuckDB is a third path that didn’t exist before: a full-featured OLAP SQL engine that runs in-process — embedded in your Python script, your Go binary, your browser tab — with no server, no cluster, no configuration. It queries Parquet files on S3 at near-Spark speed, replaces pandas for most data science workflows, and makes “spin up a quick analytics pipeline” mean 5 minutes instead of 5 days.

Abstract data visualization with colorful charts and graphs Photo by Luke Chesser on Unsplash


What Makes DuckDB Different

DuckDB’s design choices are the opposite of most databases:

 Traditional OLAP (Snowflake, BigQuery)DuckDB
ArchitectureClient-server, distributedIn-process, embedded
DeploymentCloud serviceSingle binary / Python package
SetupMinutes to hourspip install duckdb
ParallelismDistributed (many machines)Columnar vectorized (one machine)
Best forPB-scale, many concurrent usersGB-TB scale, single user / batch
CostPer-query pricingFree / open source

DuckDB’s vectorized execution engine is the key. Instead of processing one row at a time (like SQLite), it processes 1,024 values simultaneously using SIMD CPU instructions. For analytics queries — aggregations, joins, filters — this is 10–100x faster than row-at-a-time engines.


Getting Started in 60 Seconds

import duckdb

# No setup needed — create a DB or work in-memory
con = duckdb.connect()  # In-memory

# Query CSV directly (no import step!)
con.execute("""
    SELECT 
        category,
        COUNT(*) as orders,
        SUM(amount) as revenue,
        AVG(amount) as avg_order
    FROM read_csv_auto('orders_2026.csv')
    WHERE date >= '2026-01-01'
    GROUP BY category
    ORDER BY revenue DESC
""").df()  # Returns a pandas DataFrame

That’s it. No schema definition, no import, no server. DuckDB scans the CSV, infers the schema, and executes the query — all in-memory, in your Python process.


Querying Parquet on S3 Without a Warehouse

This is where DuckDB becomes genuinely remarkable. You can query Parquet files sitting in S3 as if they were local tables:

import duckdb

con = duckdb.connect()

# Install the HTTP extension for S3 access
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("""
    SET s3_region = 'us-east-1';
    SET s3_access_key_id = '...';
    SET s3_secret_access_key = '...';
""")

# Query Parquet files directly from S3 (with predicate pushdown!)
result = con.execute("""
    SELECT 
        user_id,
        SUM(purchase_amount) as ltv,
        COUNT(*) as transactions,
        MAX(event_date) as last_purchase
    FROM read_parquet('s3://my-data-lake/events/year=2026/month=*/day=*/*.parquet')
    WHERE event_type = 'purchase'
      AND event_date >= '2026-01-01'
    GROUP BY user_id
    HAVING SUM(purchase_amount) > 1000
    ORDER BY ltv DESC
    LIMIT 1000
""").df()

DuckDB pushes predicates into the Parquet reader — it only reads the row groups that match the filter, skipping irrelevant data. For a 500GB Parquet dataset where your query matches 5%, DuckDB reads ~25GB, not 500GB. This competes with Spark on single-table analytics.


Replacing Pandas for Data Science

Pandas is slow above ~1GB because it’s single-threaded and stores data in row format. DuckDB’s Python integration lets you keep the pandas API where you want it, but execute with DuckDB’s columnar engine:

import duckdb
import pandas as pd

# Load a pandas DataFrame
df = pd.read_parquet("events.parquet")  # 10GB file

# DuckDB can directly query pandas DataFrames!
# (It reads the Arrow zero-copy — no data duplication)
result = duckdb.sql("""
    SELECT 
        date_trunc('week', event_time) as week,
        platform,
        COUNT(DISTINCT user_id) as dau,
        COUNT(*) as events
    FROM df
    GROUP BY 1, 2
    ORDER BY 1, 2
""").df()

# Benchmark: pandas groupby on 10M rows → 8.2 seconds
# DuckDB on same data → 0.4 seconds (20x faster)

The typical data science workflow with pandas hits a wall around 4–8GB on a standard laptop. DuckDB processes the same data using all CPU cores, with vectorized operations that pandas can’t match.

The Relational API (No SQL Required)

For those who prefer method chaining over SQL:

import duckdb

rel = (
    duckdb.read_parquet("events.parquet")
    .filter("event_type = 'purchase'")
    .project("user_id, amount, event_date")
    .aggregate(
        "user_id",
        "SUM(amount) as total_spend, COUNT(*) as orders, MAX(event_date) as last_order"
    )
    .filter("total_spend > 100")
    .order("total_spend DESC")
    .limit(100)
)

df = rel.df()

Persistent Databases and the DuckDB File Format

For local-first applications and reporting tools, DuckDB works as a persistent database:

import duckdb

# Create a persistent DuckDB database
con = duckdb.connect("analytics.duckdb")

# Create tables with column-store format
con.execute("""
    CREATE TABLE IF NOT EXISTS events (
        event_id BIGINT PRIMARY KEY,
        user_id VARCHAR,
        event_type VARCHAR,
        event_time TIMESTAMP,
        properties JSON,
    )
""")

# Bulk insert from Parquet (extremely fast)
con.execute("""
    INSERT INTO events 
    SELECT * FROM read_parquet('raw_events/*.parquet')
""")

# Now query with full SQL including window functions
con.execute("""
    SELECT 
        user_id,
        event_type,
        event_time,
        LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event,
        DATEDIFF('minute', 
            LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
            event_time
        ) as minutes_since_last
    FROM events
    WHERE event_time >= NOW() - INTERVAL 7 DAYS
""").df()

A .duckdb file is a single portable file — you can email it, check it into git, or deploy it alongside your app. It supports concurrent readers (many can query simultaneously), but only one writer at a time — a limitation to be aware of for multi-user applications.


DuckDB in Production: Real Use Cases

1. ETL Pipelines Without Spark

# Transform 50GB of raw JSON logs into clean Parquet 
# in 12 minutes on a single m6i.2xlarge (8 CPU, 32GB)
import duckdb

con = duckdb.connect()
con.execute("""
    COPY (
        SELECT 
            json_extract_string(log, '$.user_id') as user_id,
            json_extract_string(log, '$.event') as event_type,
            CAST(json_extract_string(log, '$.timestamp') AS TIMESTAMP) as event_time,
            json_extract(log, '$.properties') as properties
        FROM read_ndjson_auto('s3://raw-logs/2026-02-22/*.ndjson.gz')
        WHERE json_extract_string(log, '$.event') IS NOT NULL
    )
    TO 's3://processed-data/events/date=2026-02-22/'
    (FORMAT PARQUET, PARTITION_BY (event_type), COMPRESSION ZSTD)
""")

This replaces a Spark cluster + EMR setup with a single Python script on a regular EC2 instance. Cost: ~$0.40 for the EC2 time vs ~$8–15 for the Spark cluster.

2. Embedded Analytics in Applications

# Ship analytics inside your application — no external warehouse needed
from fastapi import FastAPI
import duckdb

app = FastAPI()

# Shared connection pool for the app
@app.on_event("startup")
async def startup():
    app.state.db = duckdb.connect("app_analytics.duckdb", read_only=True)

@app.get("/api/analytics/revenue")
async def get_revenue(period: str = "30d"):
    interval_map = {"7d": "7 DAYS", "30d": "30 DAYS", "90d": "90 DAYS"}
    interval = interval_map.get(period, "30 DAYS")
    
    result = app.state.db.execute(f"""
        SELECT 
            date_trunc('day', order_date) as date,
            SUM(amount) as revenue,
            COUNT(*) as orders
        FROM orders
        WHERE order_date >= NOW() - INTERVAL {interval}
        GROUP BY 1
        ORDER BY 1
    """).fetchall()
    
    return {"data": result}

3. Local-First Data Science Notebooks

DuckDB is now the default engine in several notebook environments. Jupyter + DuckDB lets data scientists work on full-scale datasets without uploading to a cloud warehouse:

# In a Jupyter notebook — query 10GB of data locally
%load_ext duckdb_magic  

%%duckdb
SELECT 
    cohort_month,
    months_since_signup,
    COUNT(DISTINCT user_id) as users,
    SUM(revenue) as revenue
FROM read_parquet('~/data/user_cohorts/*.parquet')
GROUP BY 1, 2
ORDER BY 1, 2

Person analyzing data on laptop with visualizations Photo by Carlos Muza on Unsplash


DuckDB-Wasm: Analytics in the Browser

DuckDB compiles to WebAssembly and runs entirely in the browser — no backend required:

import * as duckdb from '@duckdb/duckdb-wasm';

const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

const worker_url = URL.createObjectURL(
  new Blob([`importScripts("${bundle.mainWorker}");`], {type: 'text/javascript'})
);

const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), new Worker(worker_url));
await db.instantiate(bundle.mainModule);

const conn = await db.connect();

// Fetch a Parquet file and query it entirely in-browser
await db.registerFileURL('sales.parquet', 'https://example.com/sales.parquet', 4, false);

const result = await conn.query(`
    SELECT region, SUM(revenue) as total
    FROM parquet_scan('sales.parquet')
    GROUP BY region
    ORDER BY total DESC
`);

console.table(result.toArray());

This enables a class of applications that were previously impossible — data apps that analyze sensitive data entirely client-side, without sending data to a server.


Limitations to Know

DuckDB isn’t a general-purpose database. It’s not designed for:

  • High-concurrency OLTP — use Postgres/MySQL for transactional workloads
  • Multiple concurrent writers — DuckDB supports one writer at a time
  • Data larger than disk — it’s bounded by the local machine’s storage
  • Distributed computation — use Spark or Dask for truly massive datasets

For analytics on datasets up to a few TB on a single machine, it’s hard to beat. Beyond that, you need a distributed system.


The Ecosystem

The DuckDB ecosystem has exploded in 2025–2026:

  • Evidence.dev — BI tool built on DuckDB + Markdown
  • MotherDuck — Cloud DuckDB with a shared catalog
  • Rill Data — Fast dashboards on DuckDB
  • Ibis — Python DataFrame library with DuckDB backend
  • dbt + DuckDB — Run dbt transformations locally without a warehouse

The momentum is real. DuckDB has become the standard for local-first analytics, replacing Spark for single-machine workloads and pandas for analytical data science. If you haven’t added it to your data toolkit, now is the time.

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)