Skip to content

Apache Iceberg Explained: Architecture, Metadata, and Query Optimization

Apache Iceberg is an open-source, high-performance table format designed for massive analytical datasets stored in cloud or on-premises object storage. Originally developed at Netflix in 2017 to address the operational and performance limitations of legacy Hive-style tables, Iceberg was donated to the Apache Software Foundation and graduated to a top-level project in 2020. Today, it serves as a foundational pillar of modern data lakehouse architectures, providing transactional consistency, schema flexibility, and multi-engine compatibility at petabyte scale.

This comprehensive guide explores the inner workings of Apache Iceberg, detailing its metadata architecture, query optimization mechanisms, transactional model, and engine integration patterns. It contrasts Iceberg with traditional table structures and provides data engineers and platform architects with the technical insight needed to build scalable, high-performance data lakes.

The Evolution of Lakehouse Table Formats

To understand why Apache Iceberg is necessary, we must examine the history of data lakes and the architectural challenges that preceded open table formats.

The Legacy of the Hadoop and Hive Ecosystem

The early era of big data analytics was dominated by Apache Hadoop and Apache Hive. Hive defined a table structure on top of files stored in the Hadoop Distributed File System (HDFS). Under the Hive model, a table is mapped directly to a directory path, and partitions are mapped to nested subdirectories (for example, /user/hive/warehouse/orders/year=2024/month=05/).

While this directory-based abstraction enabled query engines to partition datasets and prune directories, it introduced critical limitations as data volumes grew and cloud object storage replaced HDFS.

Key Limitations of Hive-Style Tables

Apache Iceberg solves these problems by shifting the source of truth from the directory structure to a tree of immutable metadata files. In an Iceberg table, the files that belong to a table are explicitly tracked in metadata, allowing query engines to plan queries without listing directories, perform atomic commits, and evolve schemas safely.

The Core Architecture: Three Layers of Metadata

The architecture of an Apache Iceberg table is divided into three distinct layers: the Catalog Layer, the Metadata Layer, and the Data Layer. This hierarchical design separates table tracking from file management, enabling optimistic concurrency and consistent read isolation.

graph TD subgraph Catalog_Layer ["Catalog Layer"] CAT["Catalog (e.g., Polaris, REST, Glue)"] end subgraph Metadata_Layer ["Metadata Layer"] MET["Table Metadata JSON (schema, partitions, snapshot list)"] MLIST["Manifest List (snapshot files & statistics)"] MAN1["Manifest File (data file paths & column stats)"] MAN2["Manifest File (data file paths & column stats)"] end subgraph Data_Layer ["Data Layer"] DF1["Data File (Parquet/ORC/Avro)"] DF2["Data File (Parquet/ORC/Avro)"] DF3["Data File (Parquet/ORC/Avro)"] end CAT -->|points to| MET MET -->|contains| MLIST MLIST -->|references| MAN1 MLIST -->|references| MAN2 MAN1 -->|points to| DF1 MAN1 -->|points to| DF2 MAN2 -->|points to| DF3

1. The Catalog Layer

The Catalog Layer is the entry point for any operation on an Iceberg table. The primary responsibility of the catalog is to maintain a single, atomic pointer to the current table-metadata.json file.

When an engine reads a table, it requests the current metadata location from the catalog. When an engine commits a write, it requests an atomic swap of the pointer, replacing the old metadata file path with the new metadata file path.

Iceberg supports multiple catalog implementations, including standard REST catalogs, Apache Polaris, Project Nessie, AWS Glue Catalog, Hive Metastore, and Snowflake Open Catalog. The REST catalog specification has emerged as the industry standard, providing an engine-neutral HTTP API for metadata management.

The REST Catalog Specification and API Endpoints

The Apache Iceberg REST Catalog specification defines a standard set of HTTP endpoints that allow compute engines to interact with a catalog service. By standardizing this interface, Iceberg enables complete engine-neutral interoperability, meaning any engine that implements the REST client can read from or write to any catalog that implements the REST server.

When an engine interacts with a REST catalog, it follows a standard protocol consisting of several key endpoints:

2. The Metadata Layer

The Metadata Layer manages the structural definition and historical states of the table. It is composed of three file types:

3. The Data Layer

The Data Layer contains the physical files stored in object storage. This includes:

The Metadata Specification: JSON and Avro Internals

To understand how Apache Iceberg coordinates query planning and transactional operations, we must inspect the internal fields of the metadata files. By analyzing the structural definitions of these files, data architects can better understand how Iceberg manages state transitions and optimizes query pruning.

The Table Metadata JSON Fields

The table metadata JSON file acts as the source of truth for the table configuration. Each commit creates a new metadata JSON file. The key fields in this file include:

The Manifest List Avro Schema

The manifest list file represents a single snapshot. Because engines must evaluate this file quickly during query planning, it is stored in Avro, which provides efficient serialization. Each entry in a manifest list file contains:

The Manifest File Avro Schema

Manifest files track the individual physical data files and delete files. Each manifest entry records:

Comparison: Hive-Style vs. Apache Iceberg

The physical layout and tracking mechanisms of Apache Iceberg represent a fundamental departure from legacy directory-based architectures. The table below highlights the core differences between the two models.

Architectural Feature Hive-Style Tables (Legacy Data Lake) Apache Iceberg Tables (Modern Lakehouse)
Table State Tracking Physical directory scans and listings in object storage. Explicit tracking via a structured tree of metadata files.
ACID Guarantees No native transaction support; prone to dirty reads and write conflicts. Full ACID compliance via Optimistic Concurrency Control (OCC).
Metadata Storage Location Central relational database (SQL backing the Hive Metastore). Distributed metadata files located alongside data files in object storage.
Partitioning Model Physical folder hierarchies; requires query author awareness. Hidden partitioning; transforms are applied automatically in metadata.
Schema Evolution Brittle column indexing; reordering or dropping columns causes data corruption. Safe evolution via stable column IDs; metadata-only operations.
Time Travel & Rollback Not supported; historic states are lost when directories are modified. Supported natively; query historic snapshots via metadata pointers.
File-Level Statistics Basic file lists; requires scanning file headers during execution. Advanced column-level stats (min/max, nulls) stored in manifest files.

How a Transactional Write Commit Works

Apache Iceberg guarantees ACID transactions by implementing Optimistic Concurrency Control (OCC). Instead of lock-based locking, which restricts concurrency, Iceberg assumes that write conflicts are rare and resolves them during the commit phase.

The following sequence diagram illustrates the lifecycle of a write commit operation, from loading the table schema to executing the catalog pointer swap.

sequenceDiagram participant E as Compute Engine (Spark/Dremio) participant C as Catalog (e.g., REST/Polaris) participant S as Object Storage (S3/ADLS) E->>C: Request Current Metadata Location C-->>E: Return current table-metadata.json (v1) E->>E: Plan Write (read schema and partition spec) E->>S: Write physical Data Files (Parquet) E->>S: Write new Manifest Files (Avro) E->>S: Write new Manifest List (Avro) representing new snapshot E->>S: Write new table-metadata.json (v2) E->>C: Commit Swap Pointer (v1 to v2) alt Pointer unchanged (no concurrent commits) C->>C: Atomically update pointer to v2 C-->>E: Commit Success else Pointer changed (concurrent commit occurred) C-->>E: Commit Conflict (409 Conflict) E->>E: Retry: Load new metadata, check for overlaps, re-write metadata end

Detailed Step-by-Step Commit Flow

  1. Initialization: The compute engine (such as Apache Spark, Apache Flink, or Dremio) queries the catalog to get the path of the current table-metadata.json file (version 1).
  2. Write Execution: The engine writes the new data files to the storage layer. These files are typically stored in folders organized by partition spec, but the physical location does not dictate table membership.
  3. Manifest Creation: The engine writes manifest files that point to the newly created data files, detailing their column-level statistics and formatting details.
  4. Manifest List Creation: The engine writes a manifest list file that represents a new snapshot. This manifest list merges the new manifest files with existing manifests from the previous snapshot.
  5. Metadata Creation: The engine writes a new table-metadata.json file (version 2) that registers the new manifest list and marks the new snapshot as the current state of the table.
  6. Catalog Swap: The engine attempts to commit by requesting the catalog to swap the metadata pointer from version 1 to version 2. The catalog performs this swap atomically. If another writer committed a change in the meantime, the catalog swap fails, triggering a conflict resolution routine.
  7. Conflict Resolution (Retry): The engine reads the new table metadata file, checks if the concurrent commit overlaps with the files it wrote (for example, writing to the same partition), and if no conflict exists, regenerates the manifest list and retries the catalog swap.

Deep Dive: Catalog Implementations and Pointer Swap Mechanics

The pointer swap is the core mechanism of transactional security in an open data lakehouse. Because object stores like Amazon S3 do not natively support atomic file renaming, the catalog must handle the serialization of metadata swaps. Let us examine how the primary catalog types achieve this:

Project Nessie and Git-like Branching Internals

Project Nessie is a transactional catalog for Apache Iceberg that brings Git-like version control to data lakehouses. It allows developers to create branches, merge changes, and tag specific states of the entire catalog, enabling multi-table consistency and isolation.

Nessie achieves this by storing catalog state as a directed acyclic graph of commit objects, similar to Git. Let us examine the internal components of Nessie's commit database:

Query Optimization: Hidden Partitioning and Partition Evolution

Partitioning is critical for analytical query performance because it allows the engine to skip scanning irrelevant data files. However, traditional partitioning introduces operational complexity and leads to poorly optimized queries when analysts do not filter on physical partition columns.

Hidden Partitioning Mechanics

Apache Iceberg introduces hidden partitioning to decouple physical partitioning from logical query structure. When you create an Iceberg table, you define partition transformations on existing columns rather than creating new columns.

For example, if you partition a table by days(event_time), Iceberg does not add a date column to the table schema. Instead, it tracks the relationship between the event_time timestamp and the partition values in the metadata.

When an analyst queries the table using a filter on the original column:

SELECT * FROM analytics.events
WHERE event_time >= '2026-05-22 00:00:00'
  AND event_time < '2026-05-23 00:00:00';

The query planner reads the partition transformations from the table metadata, identifies that the query targets a specific day, and automatically applies partition pruning. The analyst does not need to know how the table is partitioned, and queries remain optimal regardless of user behavior.

Partition Transforms

Iceberg supports multiple partition transforms out of the box, allowing for flexible physical layouts:

Partition Evolution

As tables grow, partition schemes that were optimal in the past may become inefficient. For example, a table that started with monthly partitioning may grow to the point where daily partitioning is required to keep file sizes manageable.

In Hive, changing the partition scheme requires creating a new table, writing a migration query, and rewriting all historical data. In Apache Iceberg, partition schemes can evolve in place using a metadata-only operation.

/* Evolve the partition scheme from monthly to daily */
ALTER TABLE analytics.events
  REPLACE PARTITION FIELD months(event_time)
  WITH days(event_time);

When this command is run, Iceberg updates the table metadata with a new partition specification. Historical data files remain partitioned by month, and new data files are written partitioned by day.

The query planner uses both partition specs during execution. When a query is planned, Iceberg splits the planning phase: applying the monthly spec to filter old files, and the daily spec to filter new files. This evolution occurs without downtime or historical data rewrites.

Schema Evolution: Safely Evolving Table Structures

In legacy systems, schema changes like renaming, dropping, or reordering columns are risky operations. In Apache Iceberg, schema evolution is a first-class, metadata-only operation that guarantees safety and consistency.

Iceberg achieves this by assigning a stable, unique numeric field ID to every column in the table schema. When a table is written, the field IDs are stored inside the physical Parquet data files alongside the data values.

graph LR subgraph Table_Schema ["Logical Schema in metadata.json"] S1["Field ID 1: order_id (BIGINT)"] S2["Field ID 2: customer_name (STRING) - Renamed from name"] S3["Field ID 3: total (DOUBLE)"] S4["Field ID 4: region (STRING) - Added later"] end subgraph Parquet_File ["Physical Parquet File (Written before Schema Changes)"] F1["field_id=1: 10045"] F2["field_id=2: 'Alice'"] F3["field_id=3: 150.00"] end Table_Schema -.->|resolves fields by ID| Parquet_File

How Column IDs Prevent Corruption

Row-Level Updates: Copy-on-Write vs. Merge-on-Read

Data lakes are primarily read-heavy systems, but modern workloads require row-level mutation capabilities to handle Change Data Capture (CDC), data cleansing, and regulatory deletions (such as GDPR right-to-be-forgotten requests). Apache Iceberg supports two row-level update strategies: Copy-on-Write and Merge-on-Read.

1. Copy-on-Write (CoW)

In Copy-on-Write mode, any update or delete operation rewrites the entire data file containing the affected rows.

2. Merge-on-Read (MoR)

In Merge-on-Read mode, update and delete operations do not rewrite existing data files. Instead, they write separate delete files that record which rows have been modified.

Query Engine Integration: The Role of Dremio

One of the main strengths of Apache Iceberg is its open, engine-neutral design. Rather than locking users into a specific compute stack, Iceberg allows different engines to read and write to the same tables concurrently.

Dremio: High-Performance Data Lakehouse Engine

Dremio is a unified query engine built specifically to deliver sub-second SQL performance on open lakehouse formats like Apache Iceberg. Dremio bypasses the limitations of traditional query planning through several architectural optimizations:

Dremio Sabot and Calcite Optimization Details

Dremio implements a distributed SQL execution engine named Sabot that utilizes Apache Calcite for logical and physical query optimization. When a query is submitted to Dremio, the planner uses Calcite to parse the SQL statement and construct an abstract syntax tree. The Calcite optimizer then applies a series of relational algebra transformation rules to optimize the query structure.

For Apache Iceberg tables, this planning phase is highly optimized:

Other Core Engines in the Ecosystem

Storage Reclamation and Table Maintenance

Although Apache Iceberg optimizes performance and provides ACID transactions, the lifecycle of a table introduces physical fragmentation. Streaming ingestion, high-concurrency writes, and recurring updates create small files and redundant metadata. To maintain optimal query latencies and control storage costs, administrators must implement three core maintenance operations.

1. Compaction: Resolving the Small Files Problem

Streaming jobs write data in frequent intervals, creating thousands of small Parquet files. This introduces the small files problem: query engines must open and close thousands of files to read a small volume of records, which increases query execution time.

Compaction solves this problem by merging small files into larger, optimized files. Iceberg supports two main compaction strategies:

2. Snapshot Expiration: Controlling Storage Growth

Every transaction on an Iceberg table creates a new snapshot, preserving old data files to allow time-travel queries. If snapshots are never removed, storage consumption grows indefinitely.

Running snapshot expiration jobs removes historical snapshot references from the metadata JSON. The expiration job checks which data and manifest files are no longer referenced by any active snapshot and deletes them from object storage, reclaiming storage space.

3. Orphan File Cleanup: Deleting Unreferenced Bytes

Sometimes, compute engines crash during a write transaction. While the write fails to commit to the catalog, the physical data files may have already been written to S3 or object storage. These files are orphan files: they exist on disk but are not referenced by any metadata.

An orphan file cleanup job compares the physical files stored in the object store path with the list of files registered across all table metadata snapshots. Any physical file that does not appear in the metadata is deleted, keeping storage clean and reducing cloud costs.

When Not to Use Apache Iceberg

While Apache Iceberg is an excellent default choice for enterprise data lakehouses, it is not a universal solution for every workload. Architects should avoid Iceberg in the following scenarios:

Frequently Asked Questions (FAQ)

How does Iceberg handle concurrent writes?

Iceberg uses Optimistic Concurrency Control (OCC). Both writers perform their operations independently. At the commit phase, the catalog validates whether the snapshot pointer has changed. If another writer committed first, the second writer retries the operation by merging the changes and attempting another commit.

What is the difference between a table format and a file format?

A file format (like Parquet or ORC) defines how data values are physically laid out and compressed inside a single file. A table format (like Iceberg) defines how multiple files are grouped, structured, and managed together to represent a single logical database table.

Can I use multiple query engines on the same Iceberg table at the same time?

Yes. Because the catalog acts as the single source of truth and manages commits atomically, you can read the table with Trino or Dremio while Spark is performing writes. Readers will continue to see the old snapshot until the write transaction commits.

How does partition evolution affect historical queries?

It does not. When you evolve a partition spec, old files remain in their original layout. Iceberg keeps track of historical partition specs in the table metadata, and the query planner automatically generates split plans to prune files written under different specifications.

How does the catalog ensure credential vending and data storage isolation?

Modern enterprise catalogs like Apache Polaris implement credential vending. When a compute engine requests access to a table, the catalog verifies the engine's RBAC permissions. If authorized, the catalog requests short-lived, scoped security tokens (such as temporary AWS IAM credentials) from the cloud provider. The catalog vends these credentials to the engine, allowing it to read and write directly to the S3 bucket path containing the table's data and metadata files. This design ensures that compute engines never require direct, permanent read and write access to the entire storage bucket, enforcing storage isolation and enhancing data lakehouse security.

Does Iceberg support Git-like operations?

Yes, Iceberg support for branching and tagging is built directly into the metadata model. You can create named branches for write-audit-publish patterns, or tags to bookmark historical snapshots for machine learning model reproducibility.

๐Ÿ“š Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.