Apache Iceberg Explained: Architecture, Metadata, and Query Optimization
Apache Iceberg is an open-source, high-performance table format designed for massive analytical datasets stored in cloud or on-premises object storage. Originally developed at Netflix in 2017 to address the operational and performance limitations of legacy Hive-style tables, Iceberg was donated to the Apache Software Foundation and graduated to a top-level project in 2020. Today, it serves as a foundational pillar of modern data lakehouse architectures, providing transactional consistency, schema flexibility, and multi-engine compatibility at petabyte scale.
This comprehensive guide explores the inner workings of Apache Iceberg, detailing its metadata architecture, query optimization mechanisms, transactional model, and engine integration patterns. It contrasts Iceberg with traditional table structures and provides data engineers and platform architects with the technical insight needed to build scalable, high-performance data lakes.
The Evolution of Lakehouse Table Formats
To understand why Apache Iceberg is necessary, we must examine the history of data lakes and the architectural challenges that preceded open table formats.
The Legacy of the Hadoop and Hive Ecosystem
The early era of big data analytics was dominated by Apache Hadoop and
Apache Hive. Hive defined a table structure on top of files stored in the
Hadoop Distributed File System (HDFS). Under the Hive model, a table is
mapped directly to a directory path, and partitions are mapped to nested
subdirectories (for example, /user/hive/warehouse/orders/year=2024/month=05/).
While this directory-based abstraction enabled query engines to partition datasets and prune directories, it introduced critical limitations as data volumes grew and cloud object storage replaced HDFS.
Key Limitations of Hive-Style Tables
- No Atomic Transactions: Hive tables track state at the directory level. When a writer adds data, it writes new files directly into the partition folders. If a query reads the table while a write operation is active, the query planner will see a partial list of files, leading to dirty reads and inconsistent results. There is no native mechanism to isolate reads from concurrent writes.
- Expensive Directory Listings: To plan a query, the query engine must perform a directory listing operation across all partition folders to identify which data files exist. On cloud object storage systems like Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS), directory listing is a slow, expensive operation that requires multiple HTTP requests. For tables with millions of files, directory listing can take minutes, severely bottlenecking query planning.
- Rigid Partition Schemes: In Hive, partitioning is physical.
If a table is partitioned by date, query authors must explicitly include
the partition column in their
WHEREclauses. If an analyst queries a timestamp column directly without referencing the physical partition column, the query engine is forced to scan the entire table, resulting in wasted compute and high costs. Furthermore, changing a partition scheme (for example, transitioning from monthly to daily partitioning) requires rewriting the entire table. - Brittle Schema Evolution: Hive schema changes rely on matching column names to index positions in files. Dropping or reordering columns often leads to silent data corruption, as the reader maps existing physical data fields to the wrong logical schema positions.
Apache Iceberg solves these problems by shifting the source of truth from the directory structure to a tree of immutable metadata files. In an Iceberg table, the files that belong to a table are explicitly tracked in metadata, allowing query engines to plan queries without listing directories, perform atomic commits, and evolve schemas safely.
The Core Architecture: Three Layers of Metadata
The architecture of an Apache Iceberg table is divided into three distinct layers: the Catalog Layer, the Metadata Layer, and the Data Layer. This hierarchical design separates table tracking from file management, enabling optimistic concurrency and consistent read isolation.
1. The Catalog Layer
The Catalog Layer is the entry point for any operation on an Iceberg
table. The primary responsibility of the catalog is to maintain a single,
atomic pointer to the current table-metadata.json file.
When an engine reads a table, it requests the current metadata location from the catalog. When an engine commits a write, it requests an atomic swap of the pointer, replacing the old metadata file path with the new metadata file path.
Iceberg supports multiple catalog implementations, including standard REST catalogs, Apache Polaris, Project Nessie, AWS Glue Catalog, Hive Metastore, and Snowflake Open Catalog. The REST catalog specification has emerged as the industry standard, providing an engine-neutral HTTP API for metadata management.
The REST Catalog Specification and API Endpoints
The Apache Iceberg REST Catalog specification defines a standard set of HTTP endpoints that allow compute engines to interact with a catalog service. By standardizing this interface, Iceberg enables complete engine-neutral interoperability, meaning any engine that implements the REST client can read from or write to any catalog that implements the REST server.
When an engine interacts with a REST catalog, it follows a standard protocol consisting of several key endpoints:
- Get Config (
GET /v1/config): Before connecting to the catalog, the engine sends a configuration request. The REST server responds with catalog properties, storage credentials, and token configurations, bootstrap-connecting the engine to the catalog environment. - List Namespaces (
GET /v1/namespaces): This endpoint allows the engine to discover the logical database namespaces available in the catalog. Namespaces can be nested, representing hierarchical folder-like groupings (for example,analytics.sales.regional). - Load Table (
GET /v1/namespaces/{namespace}/tables/{table}): When a query engine plans a query, it calls the Load Table endpoint. The catalog returns the schema, partition specifications, sort orders, and, most importantly, the exact location of the currenttable-metadata.jsonfile. The query engine then accesses the storage layer directly to read the metadata and data files. - Commit Table (
POST /v1/namespaces/{namespace}/tables/{table}): When an engine commits a transaction, it sends a commit request containing table updates, such as additions or removals of snapshots. The request payload includes the metadata file path that the engine expects to swap from (the current state) and the path it wants to swap to. The catalog performs an atomic validation and pointer swap. If successful, it updates the pointer and returns a 200 OK status. If another engine committed a write concurrently, the catalog rejects the request with a 409 Conflict, forcing the client to reload the metadata and retry.
2. The Metadata Layer
The Metadata Layer manages the structural definition and historical states of the table. It is composed of three file types:
- Table Metadata File (JSON): This file contains the table's current schema, partition specification, properties, and a list of all active snapshots. Every commit creates a new table metadata JSON file. It stores the table's history by chaining snapshots together, allowing engines to look up historical states for time travel queries.
- Manifest List File (Avro): A manifest list represents a single snapshot of the table. It contains a list of manifest files that make up that snapshot. The manifest list file stores critical optimization data: including the partition specs used to write each manifest, partition min/max summaries, and snapshot statistics. This enables query engines to skip entire manifest files during query planning if their partition summaries do not match the query filters.
- Manifest File (Avro): Manifest files track the actual data and delete files. Each manifest file contains a list of file paths, formatting details, and row counts, along with detailed column-level statistics (min/max values, null counts, and NaN counts). These statistics enable fine-grained file-level pruning, allowing the query engine to identify precisely which files contain the requested data before opening them.
3. The Data Layer
The Data Layer contains the physical files stored in object storage. This includes:
- Data Files: Stored in open columnar formats like Apache Parquet, Apache ORC, or Apache Avro.
- Delete Files: Used in Merge-on-Read configurations to record deleted or updated rows. These files track either the specific positions of deleted rows (Position Deletes) or the values of deleted fields (Equality Deletes).
The Metadata Specification: JSON and Avro Internals
To understand how Apache Iceberg coordinates query planning and transactional operations, we must inspect the internal fields of the metadata files. By analyzing the structural definitions of these files, data architects can better understand how Iceberg manages state transitions and optimizes query pruning.
The Table Metadata JSON Fields
The table metadata JSON file acts as the source of truth for the table configuration. Each commit creates a new metadata JSON file. The key fields in this file include:
-
format-version: An integer (1 or 2) that defines the table specification version. Version 2 tables add support for row-level deletes, introducing position and equality delete file formats to support Merge-on-Read patterns. -
table-uuid: A globally unique identifier generated when the table is created, ensuring that the table remains identifiable even if it is renamed or moved across catalogs. -
location: The base URI of the table directory, defining where data files and metadata files are written by default. -
last-sequence-number: A long integer tracking the commit sequence of the table, used in Version 2 tables to resolve updates and deletes. -
last-updated-ms: A timestamp in milliseconds recording when the metadata file was written. -
properties: A map of configuration options, such as the default file format (parquet, orc, or avro) and metadata file retention limits. -
schemas: A list of all historical schema versions. Each schema is assigned a unique integer ID, allowing engines to read older Parquet files written under previous schemas without mapping columns by name. -
partition-specs: A list of all partition specifications. Like schemas, partition specs are versioned by ID to enable partition evolution in place. -
snapshots: A list of snapshots representing historical states of the table. Each snapshot object records the snapshot ID, parent snapshot ID, creation time, the path to the manifest list file, and a summary map describing the operation (such as append, overwrite, or delete) and row-level statistics.
The Manifest List Avro Schema
The manifest list file represents a single snapshot. Because engines must evaluate this file quickly during query planning, it is stored in Avro, which provides efficient serialization. Each entry in a manifest list file contains:
-
manifest_path: The absolute string path to the manifest Avro file. -
manifest_length: The long integer size of the manifest file in bytes, allowing the engine to plan parallel read operations. -
partition_spec_id: The integer ID of the partition spec used to write this manifest file. -
added_snapshot_id: The snapshot ID that created this manifest entry. -
added_files_count,existing_files_count,deleted_files_count: Integers tracking file modifications, allowing engines to evaluate table changes without scanning the manifests. -
partitions: A list of partition field summaries. For each partitioned column, the manifest list records whether the partition contains null values, whether it contains NaN values, and the lower and upper bounds of the partition values (stored as binary representations).
The Manifest File Avro Schema
Manifest files track the individual physical data files and delete files. Each manifest entry records:
-
status: An integer defining the state of the data file (0 for existing, 1 for added, and 2 for deleted). -
snapshot_id: The long integer snapshot ID in which the file was added to the table. -
data_file: A struct containing details about the physical file:-
file_path: The URI of the physical file (such as Parquet or ORC) stored in the object store. -
file_format: The format string (PARQUET, ORC, or AVRO). -
partition: The struct of partition values indicating which partition folder this file belongs to. -
record_count: The total number of rows in the data file. file_size_in_bytes: The physical file size.-
column_sizes: A map of column field IDs to their physical size in bytes. -
value_counts: A map of column field IDs to the count of values in that column. -
null_value_counts: A map of column field IDs to the count of null values. -
lower_boundsandupper_bounds: Maps of column field IDs to their minimum and maximum values, enabling file-level query pruning.
-
Comparison: Hive-Style vs. Apache Iceberg
The physical layout and tracking mechanisms of Apache Iceberg represent a fundamental departure from legacy directory-based architectures. The table below highlights the core differences between the two models.
| Architectural Feature | Hive-Style Tables (Legacy Data Lake) | Apache Iceberg Tables (Modern Lakehouse) |
|---|---|---|
| Table State Tracking | Physical directory scans and listings in object storage. | Explicit tracking via a structured tree of metadata files. |
| ACID Guarantees | No native transaction support; prone to dirty reads and write conflicts. | Full ACID compliance via Optimistic Concurrency Control (OCC). |
| Metadata Storage Location | Central relational database (SQL backing the Hive Metastore). | Distributed metadata files located alongside data files in object storage. |
| Partitioning Model | Physical folder hierarchies; requires query author awareness. | Hidden partitioning; transforms are applied automatically in metadata. |
| Schema Evolution | Brittle column indexing; reordering or dropping columns causes data corruption. | Safe evolution via stable column IDs; metadata-only operations. |
| Time Travel & Rollback | Not supported; historic states are lost when directories are modified. | Supported natively; query historic snapshots via metadata pointers. |
| File-Level Statistics | Basic file lists; requires scanning file headers during execution. | Advanced column-level stats (min/max, nulls) stored in manifest files. |
How a Transactional Write Commit Works
Apache Iceberg guarantees ACID transactions by implementing Optimistic Concurrency Control (OCC). Instead of lock-based locking, which restricts concurrency, Iceberg assumes that write conflicts are rare and resolves them during the commit phase.
The following sequence diagram illustrates the lifecycle of a write commit operation, from loading the table schema to executing the catalog pointer swap.
Detailed Step-by-Step Commit Flow
- Initialization: The compute engine (such as Apache Spark,
Apache Flink, or Dremio) queries the catalog to get the path of the current
table-metadata.jsonfile (version 1). - Write Execution: The engine writes the new data files to the storage layer. These files are typically stored in folders organized by partition spec, but the physical location does not dictate table membership.
- Manifest Creation: The engine writes manifest files that point to the newly created data files, detailing their column-level statistics and formatting details.
- Manifest List Creation: The engine writes a manifest list file that represents a new snapshot. This manifest list merges the new manifest files with existing manifests from the previous snapshot.
- Metadata Creation: The engine writes a new
table-metadata.jsonfile (version 2) that registers the new manifest list and marks the new snapshot as the current state of the table. - Catalog Swap: The engine attempts to commit by requesting the catalog to swap the metadata pointer from version 1 to version 2. The catalog performs this swap atomically. If another writer committed a change in the meantime, the catalog swap fails, triggering a conflict resolution routine.
- Conflict Resolution (Retry): The engine reads the new table metadata file, checks if the concurrent commit overlaps with the files it wrote (for example, writing to the same partition), and if no conflict exists, regenerates the manifest list and retries the catalog swap.
Deep Dive: Catalog Implementations and Pointer Swap Mechanics
The pointer swap is the core mechanism of transactional security in an open data lakehouse. Because object stores like Amazon S3 do not natively support atomic file renaming, the catalog must handle the serialization of metadata swaps. Let us examine how the primary catalog types achieve this:
- REST Catalog: This is the modern, vendor-neutral standard catalog interface. Compute engines make standardized HTTP requests to a REST catalog service. The service itself coordinates the pointer swap using its own backend, which is typically a relational database like PostgreSQL or MySQL. This relational backend uses standard database transactions (such as using transactional locks or SELECT FOR UPDATE queries) to update the current metadata URI location.
- Apache Polaris: An open-source REST catalog designed for multi-engine governance. Polaris stores catalog state in a persistence store (such as a relational database) and provides role-based access control. Pointer swaps are validated through the Polaris console or APIs, ensuring that only authenticated compute engines with table write privileges can swap metadata pointers.
- AWS Glue Catalog: AWS Glue functions as a catalog by storing table metadata within a managed metastore. Commits are performed using the Glue API. The compute engine makes an update table call, passing the expected version check parameter. Glue implements optimistic locking at the database catalog layer to ensure that if a concurrent commit modified the table metadata during the transaction, the Glue service returns a validation error, forcing the engine to retry.
- Project Nessie: Nessie implements a Git-like branching catalog model. Instead of a single pointer swap, Nessie commits are represented in a hash tree database. The database records commits as hashes, and branches point to specific commit hashes. Committing to a table in Nessie is similar to a Git commit on a branch. The pointer swap is executed using a database commit that updates the branch reference (such as updating the main branch pointer to the new commit hash). If a concurrent commit occurred, Nessie requires the client to rebase its branch commit, resolving conflict overlays automatically.
- Hive Metastore (HMS): The legacy catalog option. HMS stores table metadata locations in a relational database. Pointer updates are executed by running standard SQL updates against the Metastore database. Because HMS was designed for file system directory mappings rather than metadata trees, concurrent updates can lead to locks or write conflicts under high execution concurrency.
Project Nessie and Git-like Branching Internals
Project Nessie is a transactional catalog for Apache Iceberg that brings Git-like version control to data lakehouses. It allows developers to create branches, merge changes, and tag specific states of the entire catalog, enabling multi-table consistency and isolation.
Nessie achieves this by storing catalog state as a directed acyclic graph of commit objects, similar to Git. Let us examine the internal components of Nessie's commit database:
- Commit Objects: Each commit in Nessie represents a catalog-wide
transaction. A commit contains a unique commit hash (a cryptographic SHA-256
hash), a parent commit reference, the author, a timestamp, and a set of key-value
operations. The keys are table identifiers (such as
analytics.orders), and the values are pointers to the specifictable-metadata.jsonfiles for those tables at that moment. - Branches and Named References: Branches in Nessie are mutable pointers to specific commit hashes. By default, the main branch tracks the production state of the catalog. When an ETL pipeline creates a staging branch, Nessie creates a new named reference pointing to the same commit hash as main. As the pipeline writes data, Nessie appends new commit objects to the staging branch, updating its branch pointer while leaving the main branch pointer unchanged.
- Atomic Merging and Conflict Resolution: When the ETL pipeline completes, it requests Nessie to merge the staging branch back into main. Nessie performs a catalog-level validation to ensure that no conflicting commits occurred on the main branch while staging was active. If there are no conflicts, Nessie advances the main branch pointer to the latest commit hash on the staging branch, publishing all table updates atomically in a single catalog operation.
Query Optimization: Hidden Partitioning and Partition Evolution
Partitioning is critical for analytical query performance because it allows the engine to skip scanning irrelevant data files. However, traditional partitioning introduces operational complexity and leads to poorly optimized queries when analysts do not filter on physical partition columns.
Hidden Partitioning Mechanics
Apache Iceberg introduces hidden partitioning to decouple physical partitioning from logical query structure. When you create an Iceberg table, you define partition transformations on existing columns rather than creating new columns.
For example, if you partition a table by days(event_time),
Iceberg does not add a date column to the table schema. Instead,
it tracks the relationship between the event_time timestamp and
the partition values in the metadata.
When an analyst queries the table using a filter on the original column:
SELECT * FROM analytics.events
WHERE event_time >= '2026-05-22 00:00:00'
AND event_time < '2026-05-23 00:00:00'; The query planner reads the partition transformations from the table metadata, identifies that the query targets a specific day, and automatically applies partition pruning. The analyst does not need to know how the table is partitioned, and queries remain optimal regardless of user behavior.
Partition Transforms
Iceberg supports multiple partition transforms out of the box, allowing for flexible physical layouts:
- Identity: Partitions data using the raw values of a column
(suitable for low-cardinality categories like
country). - Time Transforms (year, month, day, hour): Extracts temporal ranges from timestamp or date columns.
- Bucket: Hashes a column's values into a fixed number of
buckets (useful for distributing high-cardinality keys like
user_idacross a uniform set of files). - Truncate: Truncates string or numeric columns to a specific width (for example, grouping text search data by prefix).
Partition Evolution
As tables grow, partition schemes that were optimal in the past may become inefficient. For example, a table that started with monthly partitioning may grow to the point where daily partitioning is required to keep file sizes manageable.
In Hive, changing the partition scheme requires creating a new table, writing a migration query, and rewriting all historical data. In Apache Iceberg, partition schemes can evolve in place using a metadata-only operation.
/* Evolve the partition scheme from monthly to daily */
ALTER TABLE analytics.events
REPLACE PARTITION FIELD months(event_time)
WITH days(event_time); When this command is run, Iceberg updates the table metadata with a new partition specification. Historical data files remain partitioned by month, and new data files are written partitioned by day.
The query planner uses both partition specs during execution. When a query is planned, Iceberg splits the planning phase: applying the monthly spec to filter old files, and the daily spec to filter new files. This evolution occurs without downtime or historical data rewrites.
Schema Evolution: Safely Evolving Table Structures
In legacy systems, schema changes like renaming, dropping, or reordering columns are risky operations. In Apache Iceberg, schema evolution is a first-class, metadata-only operation that guarantees safety and consistency.
Iceberg achieves this by assigning a stable, unique numeric field ID to every column in the table schema. When a table is written, the field IDs are stored inside the physical Parquet data files alongside the data values.
How Column IDs Prevent Corruption
- Column Renames: When you rename a column, Iceberg updates
the name associated with the field ID in the
table-metadata.jsonfile. The physical data files are not modified. When a reader processes an old file, it maps the physical data labeled with the field ID to the new column name, avoiding any reliance on column string matches. - Column Additions: Adding a column generates a new field
ID in the schema. When the engine reads historical files written before the
addition, it notices that the new field ID is missing from the file metadata
and automatically returns
NULLvalues for that column. - Column Deletions: Dropping a column retires its field ID. The physical data remains in the old files, but the engine ignores it during read operations because the field ID is no longer present in the active table schema.
Row-Level Updates: Copy-on-Write vs. Merge-on-Read
Data lakes are primarily read-heavy systems, but modern workloads require row-level mutation capabilities to handle Change Data Capture (CDC), data cleansing, and regulatory deletions (such as GDPR right-to-be-forgotten requests). Apache Iceberg supports two row-level update strategies: Copy-on-Write and Merge-on-Read.
1. Copy-on-Write (CoW)
In Copy-on-Write mode, any update or delete operation rewrites the entire data file containing the affected rows.
- Write Path: When a row is modified, the engine reads the existing data file, applies the changes, and writes out a brand-new Parquet data file. The table metadata is updated to point to the new data file and exclude the old one.
- Read Path: Readers access the table without any join overhead. Files are read directly as standard columnar structures.
- Trade-off: High write amplification but optimal read performance. CoW is best suited for tables with frequent reads but infrequent, bulk updates.
2. Merge-on-Read (MoR)
In Merge-on-Read mode, update and delete operations do not rewrite existing data files. Instead, they write separate delete files that record which rows have been modified.
- Write Path: When a row is deleted or updated, the engine writes a new file containing the deleted row information (either its file path and position, or its primary key value) and commits it to metadata.
- Read Path: When a query engine reads the table, it must read the base data files and perform a join with the delete files to filter out modified rows on the fly.
- Trade-off: Low write latency and low write amplification, but higher read overhead due to real-time joins. MoR is ideal for streaming pipelines and high-frequency CDC ingestion.
Query Engine Integration: The Role of Dremio
One of the main strengths of Apache Iceberg is its open, engine-neutral design. Rather than locking users into a specific compute stack, Iceberg allows different engines to read and write to the same tables concurrently.
Dremio: High-Performance Data Lakehouse Engine
Dremio is a unified query engine built specifically to deliver sub-second SQL performance on open lakehouse formats like Apache Iceberg. Dremio bypasses the limitations of traditional query planning through several architectural optimizations:
- Apache Arrow Execution: Dremio executes queries using vectorization patterns built on Apache Arrow, an open-source columnar memory format. Dremio's execution engine (Sabot) processes query tasks vectorially, executing operations directly in memory. This eliminates the JVM garbage collection pauses and CPU cycles spent on data serialization and deserialization, maximizing query speed.
- Dremio Data Reflections: Reflections are pre-computed materializations of Iceberg tables that are stored back in the lakehouse as Iceberg tables. Dremio's query planner uses the Apache Calcite optimization framework to automatically rewrite query plans to use these reflections. When a query requires aggregations or joins, the planner intercepts the request, routes execution to read from the cached reflection, and accelerates the output without requiring users to alter their SQL code.
- Unified Semantic Layer: Dremio provides a logical, governed view of the data lake, mapping physical Iceberg tables to virtual spaces. This layer supports row-level and column-level access control, ensuring secure data access for BI tools and data analysts.
- Polaris Catalog Integration: Dremio integrates natively with Apache Polaris, acting as an optimized compute provider that reads Polaris-governed catalogs with strict RBAC validation.
- Metadata Caching: Dremio caches the Iceberg metadata (manifest list files and manifests) within its local catalog. During query planning, the planner queries the local metadata cache rather than scanning S3 or object storage, reducing planning time to milliseconds. Dremio monitors catalog updates and automatically refreshes its cache when changes commit.
- Optimizing Merge-on-Read Queries: Joining delete files with base files in Merge-on-Read tables is a resource-intensive operation. Dremio accelerates these queries by caching positional deletes in memory, performing the filtering checks vectorially in Arrow, and parallelizing file reads across the compute cluster.
Dremio Sabot and Calcite Optimization Details
Dremio implements a distributed SQL execution engine named Sabot that utilizes Apache Calcite for logical and physical query optimization. When a query is submitted to Dremio, the planner uses Calcite to parse the SQL statement and construct an abstract syntax tree. The Calcite optimizer then applies a series of relational algebra transformation rules to optimize the query structure.
For Apache Iceberg tables, this planning phase is highly optimized:
- Logical Plan Transformation: Calcite matches query filters with partition transformations defined in the Iceberg metadata. If a query filters on a timestamp column, Calcite maps that filter to the underlying partition spec and rewrites the logical plan to prune unneeded partitions before physical execution begins.
- Data Reflection Matching: Dremio's optimizer compares the logical query plan with the logical plans of active Data Reflections. If a match is found, Calcite automatically rewrites the plan to read from the pre-computed reflection instead of scanning the raw Iceberg data files. This query acceleration is transparent to the user, requiring no modifications to the original SQL query.
- Arrow-Native Vectorized Execution: Once the optimized plan is finalized, the Sabot execution engine schedules task execution across the coordinator and executor nodes. Sabot executes tasks vectorially using the Apache Arrow memory layout. Instead of processing rows individually, Sabot executes operations on entire vectors of columnar values in CPU registers, minimizing memory bandwidth bottlenecks and maximizing throughput.
Other Core Engines in the Ecosystem
- Apache Spark: A general-purpose batch compute engine widely used for heavy ETL pipelines, bulk writes, and large-scale data ingestion into Iceberg tables.
- Trino: A distributed SQL query engine optimized for fast, ad-hoc interactive queries across multi-cloud environments.
- Apache Flink: A stateful stream processing framework that acts as a continuous streaming sink for Iceberg, committing data at checkpoint intervals.
- PyIceberg: A lightweight, python-native client library that enables data scientists to load Iceberg tables directly into pandas or Polars DataFrames without requiring a Java runtime environment.
Storage Reclamation and Table Maintenance
Although Apache Iceberg optimizes performance and provides ACID transactions, the lifecycle of a table introduces physical fragmentation. Streaming ingestion, high-concurrency writes, and recurring updates create small files and redundant metadata. To maintain optimal query latencies and control storage costs, administrators must implement three core maintenance operations.
1. Compaction: Resolving the Small Files Problem
Streaming jobs write data in frequent intervals, creating thousands of small Parquet files. This introduces the small files problem: query engines must open and close thousands of files to read a small volume of records, which increases query execution time.
Compaction solves this problem by merging small files into larger, optimized files. Iceberg supports two main compaction strategies:
- Bin-Packing: The default strategy, which groups small files together into standard target file sizes (usually 128 MB to 512 MB). This is a metadata-intensive, low-compute operation because it does not sort or reorder rows.
- Sort and Z-Order: These strategies reorder rows while writing. Z-ordering groups similar column values across multiple dimensions into identical physical files. This maximizes the efficiency of column-level min/max statistics, enabling Dremio to skip files during query execution.
2. Snapshot Expiration: Controlling Storage Growth
Every transaction on an Iceberg table creates a new snapshot, preserving old data files to allow time-travel queries. If snapshots are never removed, storage consumption grows indefinitely.
Running snapshot expiration jobs removes historical snapshot references from the metadata JSON. The expiration job checks which data and manifest files are no longer referenced by any active snapshot and deletes them from object storage, reclaiming storage space.
3. Orphan File Cleanup: Deleting Unreferenced Bytes
Sometimes, compute engines crash during a write transaction. While the write fails to commit to the catalog, the physical data files may have already been written to S3 or object storage. These files are orphan files: they exist on disk but are not referenced by any metadata.
An orphan file cleanup job compares the physical files stored in the object store path with the list of files registered across all table metadata snapshots. Any physical file that does not appear in the metadata is deleted, keeping storage clean and reducing cloud costs.
When Not to Use Apache Iceberg
While Apache Iceberg is an excellent default choice for enterprise data lakehouses, it is not a universal solution for every workload. Architects should avoid Iceberg in the following scenarios:
- Low-Latency Transactional (OLTP) Applications: Iceberg is designed for analytical queries on large datasets. High-frequency, single-row inserts (such as logging user events one by one) will create millions of tiny metadata files, causing severe metadata bloat. Traditional relational databases like PostgreSQL or MySQL are far better suited for transactional systems.
- Small, Static Datasets: If your tables are only a few megabytes or gigabytes and are rarely modified, the overhead of maintaining an Iceberg catalog and reading hierarchical metadata is unnecessary. A plain directory of Parquet files or even CSVs is easier to manage.
- Raw unstructured data processing: Iceberg is a table format for structured and semi-structured tabular data. For unstructured data (such as raw images, audio files, or text logs), direct object storage paths are the appropriate destination.
Frequently Asked Questions (FAQ)
How does Iceberg handle concurrent writes?
Iceberg uses Optimistic Concurrency Control (OCC). Both writers perform their operations independently. At the commit phase, the catalog validates whether the snapshot pointer has changed. If another writer committed first, the second writer retries the operation by merging the changes and attempting another commit.
What is the difference between a table format and a file format?
A file format (like Parquet or ORC) defines how data values are physically laid out and compressed inside a single file. A table format (like Iceberg) defines how multiple files are grouped, structured, and managed together to represent a single logical database table.
Can I use multiple query engines on the same Iceberg table at the same time?
Yes. Because the catalog acts as the single source of truth and manages commits atomically, you can read the table with Trino or Dremio while Spark is performing writes. Readers will continue to see the old snapshot until the write transaction commits.
How does partition evolution affect historical queries?
It does not. When you evolve a partition spec, old files remain in their original layout. Iceberg keeps track of historical partition specs in the table metadata, and the query planner automatically generates split plans to prune files written under different specifications.
How does the catalog ensure credential vending and data storage isolation?
Modern enterprise catalogs like Apache Polaris implement credential vending. When a compute engine requests access to a table, the catalog verifies the engine's RBAC permissions. If authorized, the catalog requests short-lived, scoped security tokens (such as temporary AWS IAM credentials) from the cloud provider. The catalog vends these credentials to the engine, allowing it to read and write directly to the S3 bucket path containing the table's data and metadata files. This design ensures that compute engines never require direct, permanent read and write access to the entire storage bucket, enforcing storage isolation and enhancing data lakehouse security.
Does Iceberg support Git-like operations?
Yes, Iceberg support for branching and tagging is built directly into the metadata model. You can create named branches for write-audit-publish patterns, or tags to bookmark historical snapshots for machine learning model reproducibility.