Apache Iceberg and the Data Lakehouse Revolution: A Complete Guide
on Apache iceberg, Data lakehouse, Data engineering, Big data, Analytics, Cloud storage
Apache Iceberg and the Data Lakehouse Revolution: A Complete Guide
For two decades, enterprises chose between two worlds: data lakes (cheap, flexible, no reliability) or data warehouses (expensive, reliable, no flexibility). The lakehouse concept promised to merge them — and in 2026, Apache Iceberg is delivering on that promise at scale.
Iceberg is now the default table format for production data at Netflix, Apple, LinkedIn, Airbnb, and thousands of other companies. If you’re building or running a modern data platform, you need to understand it.
Photo by Luke Chesser on Unsplash
The Problem with Traditional Data Lakes
A data lake stores files (Parquet, ORC, CSV) in object storage like S3. It’s cheap and scalable, but it has critical limitations:
- No ACID transactions: Concurrent writes corrupt data; partial failures leave inconsistent state
- No schema evolution: Changing a column type means rewriting every file
- Slow queries on small files: Hundreds of thousands of small Parquet files kill query performance
- No time travel: You can’t query data as it existed yesterday
- Hidden data quality issues: No way to enforce constraints or catch bad writes
These limitations pushed companies toward data warehouses (Snowflake, BigQuery, Redshift) — but at 10–50× the storage cost.
What is Apache Iceberg?
Iceberg is an open table format for huge analytic datasets. It adds a metadata layer on top of regular file storage (S3, GCS, ADLS) that provides:
- ✅ ACID transactions — safe concurrent reads and writes
- ✅ Schema evolution — add/rename/drop columns without rewriting data
- ✅ Hidden partitioning — partition pruning without partition columns in queries
- ✅ Time travel — query any historical snapshot with
AS OF TIMESTAMP - ✅ Partition evolution — change partition strategy without rewriting data
- ✅ Row-level deletes — efficient MERGE/UPDATE/DELETE at petabyte scale
Iceberg isn’t a query engine or a storage system. It’s a specification for how to organize and track table metadata, implemented as a library that any engine can use.
Iceberg’s Architecture
Three Layers
Query Engine (Spark, Trino, Flink, DuckDB, Snowflake, BigQuery...)
↓
Iceberg Catalog (Glue, Hive Metastore, Nessie, REST Catalog)
↓
Iceberg Metadata Files → Data Files (Parquet on S3)
Metadata Structure
s3://my-bucket/warehouse/
└── orders/
├── metadata/
│ ├── v1.metadata.json ← table schema, partition spec
│ ├── v2.metadata.json ← schema evolution
│ ├── snap-001.avro ← snapshot 1 manifest list
│ └── snap-002.avro ← snapshot 2 manifest list
└── data/
├── year=2025/month=01/ ← partition directories
│ └── 00001.parquet
└── year=2025/month=02/
└── 00002.parquet
Each snapshot points to a manifest list, which points to manifest files, which list the actual data files. This chain enables:
- Time travel (each snapshot is preserved)
- Concurrent writes (optimistic concurrency with atomic metadata commits)
- Partition pruning (manifests contain column-level statistics)
Getting Started: Spark + Iceberg + S3
1. Setup
# PySpark with Iceberg
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("IcebergDemo") \
.config("spark.jars.packages",
"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "s3://my-bucket/warehouse") \
.getOrCreate()
2. Create an Iceberg Table
-- Create table with hidden partitioning
CREATE TABLE local.db.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10, 2),
status STRING,
created_at TIMESTAMP
) USING iceberg
PARTITIONED BY (days(created_at)) -- hidden: no partition column in schema
TBLPROPERTIES (
'write.format.default' = 'parquet',
'write.target-file-size-bytes' = '134217728' -- 128MB files
);
3. ACID Writes
# Safe concurrent writes — Iceberg handles conflicts
df = spark.createDataFrame([
(1001, 42, 99.99, "pending", "2025-01-15 10:00:00"),
(1002, 43, 149.50, "completed", "2025-01-15 11:00:00"),
], ["order_id", "customer_id", "amount", "status", "created_at"])
df.writeTo("local.db.orders").append()
4. Time Travel
-- Query data as of a specific time
SELECT * FROM local.db.orders
TIMESTAMP AS OF '2025-01-01 00:00:00';
-- Query a specific snapshot
SELECT * FROM local.db.orders VERSION AS OF 1234567890;
-- List all snapshots
SELECT * FROM local.db.orders.snapshots;
5. Schema Evolution
-- Add a column — no data rewriting needed
ALTER TABLE local.db.orders ADD COLUMN discount DECIMAL(5,2);
-- Rename a column — backward compatible
ALTER TABLE local.db.orders RENAME COLUMN status TO order_status;
-- Change column type (widening only without full rewrite)
ALTER TABLE local.db.orders ALTER COLUMN amount TYPE DECIMAL(15, 2);
Row-Level Operations: MERGE, UPDATE, DELETE
One of Iceberg’s killer features is efficient row-level mutations — something traditional Parquet lakes can’t do without rewriting entire partitions.
-- UPSERT (MERGE INTO)
MERGE INTO local.db.orders t
USING staging.new_orders s
ON t.order_id = s.order_id
WHEN MATCHED AND s.status != t.order_status THEN
UPDATE SET order_status = s.status, updated_at = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount, order_status, created_at)
VALUES (s.order_id, s.customer_id, s.amount, s.status, s.created_at);
-- Targeted delete (GDPR compliance)
DELETE FROM local.db.orders WHERE customer_id = 12345;
-- Update
UPDATE local.db.orders
SET order_status = 'cancelled'
WHERE order_id = 1001 AND order_status = 'pending';
Iceberg implements these using delete files (position deletes or equality deletes) that are merged at read time, avoiding full partition rewrites for small updates.
Table Maintenance
Iceberg tables accumulate metadata and small files over time. Regular maintenance is essential:
from pyspark.sql import SparkSession
# Compact small files into larger ones
spark.sql("""
CALL local.system.rewrite_data_files(
table => 'db.orders',
strategy => 'binpack',
options => map('target-file-size-bytes', '134217728')
)
""")
# Expire old snapshots (keep last 7 days)
spark.sql("""
CALL local.system.expire_snapshots(
table => 'db.orders',
older_than => TIMESTAMP '2025-12-01 00:00:00',
retain_last => 5
)
""")
# Remove orphaned files
spark.sql("""
CALL local.system.remove_orphan_files(table => 'db.orders')
""")
Query Engines That Support Iceberg
In 2026, virtually every major query engine supports Iceberg natively:
| Engine | Support Level | Notes |
|---|---|---|
| Apache Spark | Full (read/write/DDL) | Reference implementation |
| Trino | Full | Best for ad-hoc queries |
| Apache Flink | Full (streaming) | Native streaming ingestion |
| DuckDB | Read (+ write via extension) | Local analytics, fast |
| Snowflake | Full via Iceberg Tables | Native catalog integration |
| BigQuery | Full via BigLake | Managed open table format |
| AWS Athena | Full | Serverless S3 queries |
| Databricks | Full | + Delta Lake interop via UniForm |
| StarRocks | Full | High-performance OLAP |
Catalogs: The Missing Piece
A catalog tracks which tables exist and where their current metadata pointer is. Without a catalog, you’d need to pass the metadata path to every query.
| Catalog | Type | Best For |
|---|---|---|
| AWS Glue | Managed | AWS-native workloads |
| Hive Metastore | Self-hosted | Legacy Hadoop environments |
| Project Nessie | Git-like | Multi-branch data development |
| Polaris / Unity | Managed open | Multi-engine, vendor-neutral |
| REST Catalog | Standard interface | Any implementation |
Project Nessie deserves special mention: it gives Iceberg tables Git-like branching. You can create a branch, run transformations, and merge the changes back — data version control like code version control.
# Create a data branch
nessie branch dev main
# All writes on the dev branch don't affect main
spark.sql("USE REFERENCE dev IN nessie")
spark.sql("INSERT INTO db.orders VALUES ...")
# Merge when ready
nessie merge dev --into main
Iceberg vs Delta Lake vs Apache Hudi
| Feature | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| ACID | ✅ | ✅ | ✅ |
| Time Travel | ✅ | ✅ | ✅ |
| Schema Evolution | ✅ Full | ✅ | ✅ |
| Partition Evolution | ✅ | ❌ | ❌ |
| Hidden Partitioning | ✅ | ❌ | ❌ |
| Engine Support | Broadest | Databricks-centric | Spark/Flink focus |
| Governance | Open (Apache) | Linux Foundation | Apache |
Iceberg’s partition evolution and hidden partitioning are its strongest differentiators. Delta Lake has the strongest Databricks integration. Hudi excels in streaming upsert workloads.
In 2026, Iceberg has the broadest multi-engine support and is the de facto standard for new greenfield data platforms.
Real-World Architecture: Modern Data Platform
┌─────────────────────────────────────────────┐
│ Data Sources │
│ (Kafka, CDC, APIs, batch files) │
└──────────────────────┬──────────────────────┘
│
Apache Flink (streaming)
Apache Spark (batch)
│ write
▼
┌─────────────────────────────────────────────┐
│ Iceberg Lakehouse on S3 │
│ ├── raw/ (landing zone) │
│ ├── curated/ (cleaned, typed) │
│ └── serving/ (aggregated marts) │
│ │
│ Catalog: AWS Glue / Polaris │
└──────────────┬──────────────────────────────┘
│ query
┌────────┼────────┐
▼ ▼ ▼
Trino BigQuery DuckDB
(adhoc) (BI tools) (local)
Conclusion
Apache Iceberg is the connective tissue of the modern data stack. It gives you warehouse-grade reliability (ACID, schema evolution, time travel) at lake economics (S3 storage, any query engine). The lakehouse isn’t a buzzword anymore — it’s running at petabyte scale in production.
If you’re building a new data platform in 2026 and you’re not using Iceberg (or Delta Lake/Hudi), you’re taking on technical debt from day one. The tools are mature, the ecosystem is broad, and the operational overhead is manageable.
Start with Iceberg + AWS Glue + Trino for ad-hoc queries. Add Spark for heavy ETL. Add Flink when you need streaming. Your data platform will thank you in five years.
Photo by NASA on Unsplash
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
