Apache Iceberg and the Data Lakehouse Revolution: A Complete Guide



Apache Iceberg and the Data Lakehouse Revolution: A Complete Guide

For two decades, enterprises chose between two worlds: data lakes (cheap, flexible, no reliability) or data warehouses (expensive, reliable, no flexibility). The lakehouse concept promised to merge them — and in 2026, Apache Iceberg is delivering on that promise at scale.

Iceberg is now the default table format for production data at Netflix, Apple, LinkedIn, Airbnb, and thousands of other companies. If you’re building or running a modern data platform, you need to understand it.

Abstract data visualization representing data engineering and analytics pipelines Photo by Luke Chesser on Unsplash


The Problem with Traditional Data Lakes

A data lake stores files (Parquet, ORC, CSV) in object storage like S3. It’s cheap and scalable, but it has critical limitations:

  • No ACID transactions: Concurrent writes corrupt data; partial failures leave inconsistent state
  • No schema evolution: Changing a column type means rewriting every file
  • Slow queries on small files: Hundreds of thousands of small Parquet files kill query performance
  • No time travel: You can’t query data as it existed yesterday
  • Hidden data quality issues: No way to enforce constraints or catch bad writes

These limitations pushed companies toward data warehouses (Snowflake, BigQuery, Redshift) — but at 10–50× the storage cost.


What is Apache Iceberg?

Iceberg is an open table format for huge analytic datasets. It adds a metadata layer on top of regular file storage (S3, GCS, ADLS) that provides:

  • ACID transactions — safe concurrent reads and writes
  • Schema evolution — add/rename/drop columns without rewriting data
  • Hidden partitioning — partition pruning without partition columns in queries
  • Time travel — query any historical snapshot with AS OF TIMESTAMP
  • Partition evolution — change partition strategy without rewriting data
  • Row-level deletes — efficient MERGE/UPDATE/DELETE at petabyte scale

Iceberg isn’t a query engine or a storage system. It’s a specification for how to organize and track table metadata, implemented as a library that any engine can use.


Iceberg’s Architecture

Three Layers

Query Engine (Spark, Trino, Flink, DuckDB, Snowflake, BigQuery...)
        ↓
Iceberg Catalog (Glue, Hive Metastore, Nessie, REST Catalog)
        ↓
Iceberg Metadata Files → Data Files (Parquet on S3)

Metadata Structure

s3://my-bucket/warehouse/
└── orders/
    ├── metadata/
    │   ├── v1.metadata.json       ← table schema, partition spec
    │   ├── v2.metadata.json       ← schema evolution
    │   ├── snap-001.avro          ← snapshot 1 manifest list
    │   └── snap-002.avro          ← snapshot 2 manifest list
    └── data/
        ├── year=2025/month=01/    ← partition directories
        │   └── 00001.parquet
        └── year=2025/month=02/
            └── 00002.parquet

Each snapshot points to a manifest list, which points to manifest files, which list the actual data files. This chain enables:

  • Time travel (each snapshot is preserved)
  • Concurrent writes (optimistic concurrency with atomic metadata commits)
  • Partition pruning (manifests contain column-level statistics)

Getting Started: Spark + Iceberg + S3

1. Setup

# PySpark with Iceberg
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("IcebergDemo") \
    .config("spark.jars.packages", 
            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
    .config("spark.sql.extensions", 
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "s3://my-bucket/warehouse") \
    .getOrCreate()

2. Create an Iceberg Table

-- Create table with hidden partitioning
CREATE TABLE local.db.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    amount       DECIMAL(10, 2),
    status       STRING,
    created_at   TIMESTAMP
) USING iceberg
PARTITIONED BY (days(created_at))   -- hidden: no partition column in schema
TBLPROPERTIES (
    'write.format.default' = 'parquet',
    'write.target-file-size-bytes' = '134217728'  -- 128MB files
);

3. ACID Writes

# Safe concurrent writes — Iceberg handles conflicts
df = spark.createDataFrame([
    (1001, 42, 99.99, "pending", "2025-01-15 10:00:00"),
    (1002, 43, 149.50, "completed", "2025-01-15 11:00:00"),
], ["order_id", "customer_id", "amount", "status", "created_at"])

df.writeTo("local.db.orders").append()

4. Time Travel

-- Query data as of a specific time
SELECT * FROM local.db.orders
TIMESTAMP AS OF '2025-01-01 00:00:00';

-- Query a specific snapshot
SELECT * FROM local.db.orders VERSION AS OF 1234567890;

-- List all snapshots
SELECT * FROM local.db.orders.snapshots;

5. Schema Evolution

-- Add a column — no data rewriting needed
ALTER TABLE local.db.orders ADD COLUMN discount DECIMAL(5,2);

-- Rename a column — backward compatible
ALTER TABLE local.db.orders RENAME COLUMN status TO order_status;

-- Change column type (widening only without full rewrite)
ALTER TABLE local.db.orders ALTER COLUMN amount TYPE DECIMAL(15, 2);

Row-Level Operations: MERGE, UPDATE, DELETE

One of Iceberg’s killer features is efficient row-level mutations — something traditional Parquet lakes can’t do without rewriting entire partitions.

-- UPSERT (MERGE INTO)
MERGE INTO local.db.orders t
USING staging.new_orders s
ON t.order_id = s.order_id
WHEN MATCHED AND s.status != t.order_status THEN
    UPDATE SET order_status = s.status, updated_at = current_timestamp()
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, amount, order_status, created_at)
    VALUES (s.order_id, s.customer_id, s.amount, s.status, s.created_at);

-- Targeted delete (GDPR compliance)
DELETE FROM local.db.orders WHERE customer_id = 12345;

-- Update
UPDATE local.db.orders
SET order_status = 'cancelled'
WHERE order_id = 1001 AND order_status = 'pending';

Iceberg implements these using delete files (position deletes or equality deletes) that are merged at read time, avoiding full partition rewrites for small updates.


Table Maintenance

Iceberg tables accumulate metadata and small files over time. Regular maintenance is essential:

from pyspark.sql import SparkSession

# Compact small files into larger ones
spark.sql("""
    CALL local.system.rewrite_data_files(
        table => 'db.orders',
        strategy => 'binpack',
        options => map('target-file-size-bytes', '134217728')
    )
""")

# Expire old snapshots (keep last 7 days)
spark.sql("""
    CALL local.system.expire_snapshots(
        table => 'db.orders',
        older_than => TIMESTAMP '2025-12-01 00:00:00',
        retain_last => 5
    )
""")

# Remove orphaned files
spark.sql("""
    CALL local.system.remove_orphan_files(table => 'db.orders')
""")

Query Engines That Support Iceberg

In 2026, virtually every major query engine supports Iceberg natively:

EngineSupport LevelNotes
Apache SparkFull (read/write/DDL)Reference implementation
TrinoFullBest for ad-hoc queries
Apache FlinkFull (streaming)Native streaming ingestion
DuckDBRead (+ write via extension)Local analytics, fast
SnowflakeFull via Iceberg TablesNative catalog integration
BigQueryFull via BigLakeManaged open table format
AWS AthenaFullServerless S3 queries
DatabricksFull+ Delta Lake interop via UniForm
StarRocksFullHigh-performance OLAP

Catalogs: The Missing Piece

A catalog tracks which tables exist and where their current metadata pointer is. Without a catalog, you’d need to pass the metadata path to every query.

CatalogTypeBest For
AWS GlueManagedAWS-native workloads
Hive MetastoreSelf-hostedLegacy Hadoop environments
Project NessieGit-likeMulti-branch data development
Polaris / UnityManaged openMulti-engine, vendor-neutral
REST CatalogStandard interfaceAny implementation

Project Nessie deserves special mention: it gives Iceberg tables Git-like branching. You can create a branch, run transformations, and merge the changes back — data version control like code version control.

# Create a data branch
nessie branch dev main

# All writes on the dev branch don't affect main
spark.sql("USE REFERENCE dev IN nessie")
spark.sql("INSERT INTO db.orders VALUES ...")

# Merge when ready
nessie merge dev --into main

Iceberg vs Delta Lake vs Apache Hudi

FeatureIcebergDelta LakeHudi
ACID
Time Travel
Schema Evolution✅ Full
Partition Evolution
Hidden Partitioning
Engine SupportBroadestDatabricks-centricSpark/Flink focus
GovernanceOpen (Apache)Linux FoundationApache

Iceberg’s partition evolution and hidden partitioning are its strongest differentiators. Delta Lake has the strongest Databricks integration. Hudi excels in streaming upsert workloads.

In 2026, Iceberg has the broadest multi-engine support and is the de facto standard for new greenfield data platforms.


Real-World Architecture: Modern Data Platform

┌─────────────────────────────────────────────┐
│  Data Sources                                │
│  (Kafka, CDC, APIs, batch files)            │
└──────────────────────┬──────────────────────┘
                       │
            Apache Flink (streaming)
            Apache Spark (batch)
                       │ write
                       ▼
┌─────────────────────────────────────────────┐
│  Iceberg Lakehouse on S3                     │
│  ├── raw/        (landing zone)              │
│  ├── curated/    (cleaned, typed)            │
│  └── serving/    (aggregated marts)          │
│                                              │
│  Catalog: AWS Glue / Polaris                 │
└──────────────┬──────────────────────────────┘
               │ query
      ┌────────┼────────┐
      ▼        ▼        ▼
   Trino   BigQuery  DuckDB
   (adhoc) (BI tools) (local)

Conclusion

Apache Iceberg is the connective tissue of the modern data stack. It gives you warehouse-grade reliability (ACID, schema evolution, time travel) at lake economics (S3 storage, any query engine). The lakehouse isn’t a buzzword anymore — it’s running at petabyte scale in production.

If you’re building a new data platform in 2026 and you’re not using Iceberg (or Delta Lake/Hudi), you’re taking on technical debt from day one. The tools are mature, the ecosystem is broad, and the operational overhead is manageable.

Start with Iceberg + AWS Glue + Trino for ad-hoc queries. Add Spark for heavy ETL. Add Flink when you need streaming. Your data platform will thank you in five years.

Data center infrastructure representing cloud data storage and processing Photo by NASA on Unsplash

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)