DuckDB: The In-Process Analytics Engine That Changes Everything
on Duckdb, Data engineering, Analytics, Python, Olap
DuckDB: The In-Process Analytics Engine That Changes Everything
For the past decade, serious analytics meant a two-step journey: load data into a warehouse (Redshift, BigQuery, Snowflake), write SQL queries against it. For smaller datasets or exploration, you’d pull data into pandas and fight Python memory limits.
DuckDB is a third path that didn’t exist before: a full-featured OLAP SQL engine that runs in-process — embedded in your Python script, your Go binary, your browser tab — with no server, no cluster, no configuration. It queries Parquet files on S3 at near-Spark speed, replaces pandas for most data science workflows, and makes “spin up a quick analytics pipeline” mean 5 minutes instead of 5 days.
Photo by Luke Chesser on Unsplash
What Makes DuckDB Different
DuckDB’s design choices are the opposite of most databases:
| Traditional OLAP (Snowflake, BigQuery) | DuckDB | |
|---|---|---|
| Architecture | Client-server, distributed | In-process, embedded |
| Deployment | Cloud service | Single binary / Python package |
| Setup | Minutes to hours | pip install duckdb |
| Parallelism | Distributed (many machines) | Columnar vectorized (one machine) |
| Best for | PB-scale, many concurrent users | GB-TB scale, single user / batch |
| Cost | Per-query pricing | Free / open source |
DuckDB’s vectorized execution engine is the key. Instead of processing one row at a time (like SQLite), it processes 1,024 values simultaneously using SIMD CPU instructions. For analytics queries — aggregations, joins, filters — this is 10–100x faster than row-at-a-time engines.
Getting Started in 60 Seconds
import duckdb
# No setup needed — create a DB or work in-memory
con = duckdb.connect() # In-memory
# Query CSV directly (no import step!)
con.execute("""
SELECT
category,
COUNT(*) as orders,
SUM(amount) as revenue,
AVG(amount) as avg_order
FROM read_csv_auto('orders_2026.csv')
WHERE date >= '2026-01-01'
GROUP BY category
ORDER BY revenue DESC
""").df() # Returns a pandas DataFrame
That’s it. No schema definition, no import, no server. DuckDB scans the CSV, infers the schema, and executes the query — all in-memory, in your Python process.
Querying Parquet on S3 Without a Warehouse
This is where DuckDB becomes genuinely remarkable. You can query Parquet files sitting in S3 as if they were local tables:
import duckdb
con = duckdb.connect()
# Install the HTTP extension for S3 access
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("""
SET s3_region = 'us-east-1';
SET s3_access_key_id = '...';
SET s3_secret_access_key = '...';
""")
# Query Parquet files directly from S3 (with predicate pushdown!)
result = con.execute("""
SELECT
user_id,
SUM(purchase_amount) as ltv,
COUNT(*) as transactions,
MAX(event_date) as last_purchase
FROM read_parquet('s3://my-data-lake/events/year=2026/month=*/day=*/*.parquet')
WHERE event_type = 'purchase'
AND event_date >= '2026-01-01'
GROUP BY user_id
HAVING SUM(purchase_amount) > 1000
ORDER BY ltv DESC
LIMIT 1000
""").df()
DuckDB pushes predicates into the Parquet reader — it only reads the row groups that match the filter, skipping irrelevant data. For a 500GB Parquet dataset where your query matches 5%, DuckDB reads ~25GB, not 500GB. This competes with Spark on single-table analytics.
Replacing Pandas for Data Science
Pandas is slow above ~1GB because it’s single-threaded and stores data in row format. DuckDB’s Python integration lets you keep the pandas API where you want it, but execute with DuckDB’s columnar engine:
import duckdb
import pandas as pd
# Load a pandas DataFrame
df = pd.read_parquet("events.parquet") # 10GB file
# DuckDB can directly query pandas DataFrames!
# (It reads the Arrow zero-copy — no data duplication)
result = duckdb.sql("""
SELECT
date_trunc('week', event_time) as week,
platform,
COUNT(DISTINCT user_id) as dau,
COUNT(*) as events
FROM df
GROUP BY 1, 2
ORDER BY 1, 2
""").df()
# Benchmark: pandas groupby on 10M rows → 8.2 seconds
# DuckDB on same data → 0.4 seconds (20x faster)
The typical data science workflow with pandas hits a wall around 4–8GB on a standard laptop. DuckDB processes the same data using all CPU cores, with vectorized operations that pandas can’t match.
The Relational API (No SQL Required)
For those who prefer method chaining over SQL:
import duckdb
rel = (
duckdb.read_parquet("events.parquet")
.filter("event_type = 'purchase'")
.project("user_id, amount, event_date")
.aggregate(
"user_id",
"SUM(amount) as total_spend, COUNT(*) as orders, MAX(event_date) as last_order"
)
.filter("total_spend > 100")
.order("total_spend DESC")
.limit(100)
)
df = rel.df()
Persistent Databases and the DuckDB File Format
For local-first applications and reporting tools, DuckDB works as a persistent database:
import duckdb
# Create a persistent DuckDB database
con = duckdb.connect("analytics.duckdb")
# Create tables with column-store format
con.execute("""
CREATE TABLE IF NOT EXISTS events (
event_id BIGINT PRIMARY KEY,
user_id VARCHAR,
event_type VARCHAR,
event_time TIMESTAMP,
properties JSON,
)
""")
# Bulk insert from Parquet (extremely fast)
con.execute("""
INSERT INTO events
SELECT * FROM read_parquet('raw_events/*.parquet')
""")
# Now query with full SQL including window functions
con.execute("""
SELECT
user_id,
event_type,
event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event,
DATEDIFF('minute',
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
event_time
) as minutes_since_last
FROM events
WHERE event_time >= NOW() - INTERVAL 7 DAYS
""").df()
A .duckdb file is a single portable file — you can email it, check it into git, or deploy it alongside your app. It supports concurrent readers (many can query simultaneously), but only one writer at a time — a limitation to be aware of for multi-user applications.
DuckDB in Production: Real Use Cases
1. ETL Pipelines Without Spark
# Transform 50GB of raw JSON logs into clean Parquet
# in 12 minutes on a single m6i.2xlarge (8 CPU, 32GB)
import duckdb
con = duckdb.connect()
con.execute("""
COPY (
SELECT
json_extract_string(log, '$.user_id') as user_id,
json_extract_string(log, '$.event') as event_type,
CAST(json_extract_string(log, '$.timestamp') AS TIMESTAMP) as event_time,
json_extract(log, '$.properties') as properties
FROM read_ndjson_auto('s3://raw-logs/2026-02-22/*.ndjson.gz')
WHERE json_extract_string(log, '$.event') IS NOT NULL
)
TO 's3://processed-data/events/date=2026-02-22/'
(FORMAT PARQUET, PARTITION_BY (event_type), COMPRESSION ZSTD)
""")
This replaces a Spark cluster + EMR setup with a single Python script on a regular EC2 instance. Cost: ~$0.40 for the EC2 time vs ~$8–15 for the Spark cluster.
2. Embedded Analytics in Applications
# Ship analytics inside your application — no external warehouse needed
from fastapi import FastAPI
import duckdb
app = FastAPI()
# Shared connection pool for the app
@app.on_event("startup")
async def startup():
app.state.db = duckdb.connect("app_analytics.duckdb", read_only=True)
@app.get("/api/analytics/revenue")
async def get_revenue(period: str = "30d"):
interval_map = {"7d": "7 DAYS", "30d": "30 DAYS", "90d": "90 DAYS"}
interval = interval_map.get(period, "30 DAYS")
result = app.state.db.execute(f"""
SELECT
date_trunc('day', order_date) as date,
SUM(amount) as revenue,
COUNT(*) as orders
FROM orders
WHERE order_date >= NOW() - INTERVAL {interval}
GROUP BY 1
ORDER BY 1
""").fetchall()
return {"data": result}
3. Local-First Data Science Notebooks
DuckDB is now the default engine in several notebook environments. Jupyter + DuckDB lets data scientists work on full-scale datasets without uploading to a cloud warehouse:
# In a Jupyter notebook — query 10GB of data locally
%load_ext duckdb_magic
%%duckdb
SELECT
cohort_month,
months_since_signup,
COUNT(DISTINCT user_id) as users,
SUM(revenue) as revenue
FROM read_parquet('~/data/user_cohorts/*.parquet')
GROUP BY 1, 2
ORDER BY 1, 2
Photo by Carlos Muza on Unsplash
DuckDB-Wasm: Analytics in the Browser
DuckDB compiles to WebAssembly and runs entirely in the browser — no backend required:
import * as duckdb from '@duckdb/duckdb-wasm';
const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);
const worker_url = URL.createObjectURL(
new Blob([`importScripts("${bundle.mainWorker}");`], {type: 'text/javascript'})
);
const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), new Worker(worker_url));
await db.instantiate(bundle.mainModule);
const conn = await db.connect();
// Fetch a Parquet file and query it entirely in-browser
await db.registerFileURL('sales.parquet', 'https://example.com/sales.parquet', 4, false);
const result = await conn.query(`
SELECT region, SUM(revenue) as total
FROM parquet_scan('sales.parquet')
GROUP BY region
ORDER BY total DESC
`);
console.table(result.toArray());
This enables a class of applications that were previously impossible — data apps that analyze sensitive data entirely client-side, without sending data to a server.
Limitations to Know
DuckDB isn’t a general-purpose database. It’s not designed for:
- High-concurrency OLTP — use Postgres/MySQL for transactional workloads
- Multiple concurrent writers — DuckDB supports one writer at a time
- Data larger than disk — it’s bounded by the local machine’s storage
- Distributed computation — use Spark or Dask for truly massive datasets
For analytics on datasets up to a few TB on a single machine, it’s hard to beat. Beyond that, you need a distributed system.
The Ecosystem
The DuckDB ecosystem has exploded in 2025–2026:
- Evidence.dev — BI tool built on DuckDB + Markdown
- MotherDuck — Cloud DuckDB with a shared catalog
- Rill Data — Fast dashboards on DuckDB
- Ibis — Python DataFrame library with DuckDB backend
- dbt + DuckDB — Run dbt transformations locally without a warehouse
The momentum is real. DuckDB has become the standard for local-first analytics, replacing Spark for single-machine workloads and pandas for analytical data science. If you haven’t added it to your data toolkit, now is the time.
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
