Apache Iceberg vs. Delta Lake vs. Apache Hudi: A Technical Deep Dive
In the modern data lakehouse architecture, the storage layer has evolved from a simple repository of raw files into a transactional database environment. This transformation is made possible by open table formats: software layers that sit between raw data files (such as Apache Parquet or Apache ORC) and compute engines (such as Dremio, Apache Spark, and Trino).
Three formats have emerged as the industry standards for managing mutable analytical tables on object storage: Apache Iceberg, Delta Lake, and Apache Hudi. While all three solve the same core problem (providing ACID transactions, consistent reads, schema enforcement, and time travel over cheap storage), they were designed with different architectural priorities. Consequently, each format is optimized for different workloads, engineering ecosystems, and query patterns.
This guide provides an engine-neutral comparison of the three table formats. We examine their historical origins, compare their metadata architectures, analyze performance benchmarks across different engines, evaluate their row-level deletion strategies, and establish a decision framework to help data architects select the optimal format for their analytical workloads.
Origins and Governance Models
The design priorities of each format are deeply rooted in their historical origins. Understanding where and why these formats were built explains their core architectural trade-offs.
Apache Iceberg: Open Standards and Multi-Engine Interoperability
Apache Iceberg was originally developed at Netflix in 2017 by Ryan Blue and Dan Weeks. At the time, Netflix managed massive datasets on AWS S3 and relied on Apache Hive to structure tables. As data volumes expanded, Netflix engineers encountered major operational bottlenecks with Hive. These included directory listing latency on S3, atomic commit failures during concurrent writes, and query planning overhead.
Iceberg was designed from the ground up to solve these problems by shifting table state tracking from directory locations to a tree of metadata files. Netflix's primary goal was to ensure that multiple compute engines (such as Spark, Trino, and Flink) could read and write to the same tables concurrently without lock-in. Netflix donated Iceberg to the Apache Software Foundation in 2018, and it graduated to a top-level project in 2020.
Delta Lake: Spark-First Optimizations and Commercial Integration
Delta Lake was created by Databricks in 2019. Databricks built Delta Lake to address reliability issues in Apache Spark workloads running on cloud object storage. Before Delta Lake, engineers writing Spark pipelines to S3 or ADLS struggled with partial write failures, which left data lakes in a corrupted state, and lacked transaction isolation.
Delta Lake solved these issues by implementing a transaction log directory
(named _delta_log) alongside the data files. This log acts as
a single source of truth, tracking transactions sequentially. Initially,
Delta Lake was tightly coupled with Apache Spark, and several advanced
features were proprietary to the Databricks platform. Databricks
open-sourced the format in 2019 under the Linux Foundation, and
subsequently released Delta Lake 3.0 in 2023 to bring greater parity
between the open-source library and its proprietary features.
Apache Hudi: High-Frequency Streaming and Incremental CDC
Apache Hudi (Hadoop Upsert Delta and Incremental) was developed at Uber in 2016 by Vinoth Chandar. Uber needed to ingest massive volumes of ride-sharing and passenger data in real-time, executing high-frequency updates and deletes (slowly changing dimensions and change data capture feeds) over Hadoop HDFS.
Uber designed Hudi specifically to optimize write performance for frequent key-based updates and to enable incremental query processing. Hudi was donated to the Apache Software Foundation in 2019 and became a top-level project in 2020. Hudi is unique in its focus on streaming ingestion, utilizing index structures (such as Bloom filters or HBase index tables) to perform fast upsert lookups during write operations.
Metadata Architecture: How State is Tracked
The core differentiator between the three formats is how they track which physical data files make up the current state of a logical table. This structural tracking determines how query planning occurs, how concurrent writes are managed, and how files are pruned during execution.
1. Apache Iceberg: Hierarchical Metadata Tree
Iceberg tracks table state using a three-tiered metadata tree stored directly alongside data files in object storage. This hierarchical layout enables engines to execute query planning without performing costly directory listings.
- Table Metadata File (JSON): This file acts as the root of
the table's state. A catalog pointer points to the current
table-metadata.jsonfile. Inside this file, Iceberg records the table's format version (either version 1 or version 2), a unique table UUID, the current schema ID, a history of all schema definitions, the current partition specification ID, a history of all partition specifications, sort orders, and a list of snapshots. Each schema entry assigns a unique field ID to every column. Compute engines track columns by these field IDs rather than names, preventing column renaming from corrupting old files. The snapshots array lists every snapshot of the table, tracking the snapshot ID, parent snapshot ID, timestamp in milliseconds, the path to the manifest list file, and a summary map describing the operation (such as append, overwrite, or delete). - Manifest List File (Avro): Each snapshot points to a single manifest list file. The manifest list file acts as an index of the snapshot, listing the manifest files that make up the snapshot. For each manifest file, the list records the file path, the partition specification ID used to write it, the number of added files, the number of existing files, the number of deleted files, and partition summaries. These summaries store the minimum and maximum values of the partition columns for all data files tracked by that manifest. Query engines read the manifest list first, evaluating query filters against the partition summaries. This allows engines to skip reading entire manifest files during query planning if their partition bounds do not overlap with the query predicate.
- Manifest File (Avro): The lowest metadata level. Manifest files track individual data and delete files. Each manifest entry records the physical file path, the file format (such as Parquet, ORC, or Avro), the partition tuple, the file size in bytes, the row count, and column-level statistics. These statistics include lower and upper bounds for each column, null counts, and NaN counts. Manifests also assign a status code to each file entry (0 for existing, 1 for added, and 2 for deleted). When a query engine executes a scan, it reads the manifest entries remaining after manifest list pruning and uses the column statistics to skip individual Parquet files that do not contain matching data.
This hierarchical structure means query planning is entirely O(1) metadata reads. Compute engines query the catalog to find the metadata JSON path, read the manifest list, prune manifests based on query filters, and then scan only the relevant manifests. No directory listing is required, which eliminates cloud object storage listing penalties.
2. Delta Lake: Sequential Transaction Log
Delta Lake tracks state using a directory named _delta_log/ located
at the root of the table. Rather than using a hierarchical tree, Delta Lake
relies on a sequential log of transaction files.
- Commit Files (JSON): Every write transaction appends a new
JSON file to the log directory, named sequentially (for example,
00000000000000000000.json,00000000000000000001.json). These JSON files record the individual actions applied to the table during that transaction. The primary actions includemetaData(defines the schema, partition columns, and configuration parameters),add(records the physical path of a new data file, its size in bytes, modification time, and partition values, along with a JSON string of statistics containing row counts, null counts, and min/max column bounds),remove(marks existing files as deleted, recording the physical path and deletion timestamp), andprotocol(defines the minimum reader and writer version requirements for engines interacting with the table). - Checkpoint Files (Parquet): To prevent compute engines from
having to read millions of JSON files to reconstruct the table state, Delta
Lake generates a checkpoint file (stored in Parquet format) every 10 commits
(for example,
00000000000000000010.checkpoint.parquet). The checkpoint file aggregates all active file entries and their statistics up to that commit version, removing entries for files that were removed by previous transactions.
To read a Delta Lake table, the query engine reads the _last_checkpoint file in the log directory to find the latest checkpoint version. It reads
that checkpoint Parquet file directly, then lists the _delta_log/ directory to locate any subsequent JSON commit files written after the checkpoint.
The engine replays these newer JSON commits in memory to compile the final
list of active data files. Because directory listing is required to discover
the latest JSON files after a checkpoint, Delta Lake query planning performance
can degrade if log directories accumulate too many uncompacted commits, requiring
regular log cleanups.
3. Apache Hudi: Timeline and Index Layers
Apache Hudi uses a metadata directory named .hoodie/ to maintain
a transactional timeline of commits. Hudi is designed to optimize write performance
for frequent key-based updates, relying on indexes to locate files rather than
parsing metadata trees.
- Timeline: Hudi tracks all operations (such as commits, delta
commits, compactions, and cleanups) as instants on a timeline. Each instant
contains a state (requested, inflight, or completed) and an action type.
The timeline acts as a transaction log, ensuring write isolation and enabling
incremental reads. The actions include
commit(writing a set of base Parquet files),deltacommit(appending write updates directly to log files in Merge-on-Read tables),clean(deleting files that are no longer needed by older snapshots),compaction(merging Avro delta log files into base Parquet files), androllback(reversing a failed transaction). - Index Layer: Unlike Iceberg and Delta, which locate files by scanning metadata, Hudi relies on index structures to map record keys directly to the physical files containing them. When a writer receives an update, it checks the index to determine which file group contains the matching key. Hudi supports several indexing strategies. The Bloom Filter Index stores Bloom filters in the footers of base Parquet files, allowing writers to quickly prune files during updates. The Bucket Index uses static hashing to allocate records to specific files based on the record key. The Metadata Table Index maintains an internal metadata table that stores column stats, bloom filters, and file listings, preventing expensive file footer scans and directory listings during updates and reads.
When writing data, Hudi uses this index layer to check if an incoming record already exists in the table. If it does, the writer updates the existing file (or appends a delta log); if not, it inserts it as a new file. This index-centric architecture makes Hudi highly efficient for updates but introduces write overhead.
Aggregated Performance Benchmarks
Performance comparison across table formats is not static. It depends heavily on the query engine utilized, library versions, clustering layouts, and query workloads. We have aggregated real-world benchmark data comparing these formats.
Workload Categories and Results
The following evaluations are based on enterprise testing using Spark 3.5, Dremio 25.x, and Trino 450 query engines, with table libraries set to Iceberg 1.6, Delta 3.2, and Hudi 0.15. Workloads are categorized into scan throughput, concurrent writes, and point lookups.
| Workload Type | Apache Iceberg | Delta Lake | Apache Hudi | Workload Summary and Performance Drivers |
|---|---|---|---|---|
| Scan Throughput (Read-Heavy) | Excellent (Fast Pruning) | Excellent (Fast Pruning) | Moderate (Index Overhead) | Iceberg and Delta prune files efficiently using column stats. Hudi's index checks add scan overhead. |
| Concurrent Writes (Optimistic Lock) | High (Conflict Retries) | High (Conflict Retries) | Moderate (Queue Block) | Iceberg and Delta resolve concurrent appends via retry loops. Hudi uses lock providers to serialize commits. |
| Point Lookups (Key Searches) | Moderate (Full Scan) | Moderate (Full Scan) | Excellent (Index Lookup) | Hudi locates record keys directly using Bloom filters or HBase indexes, bypassing full table scans. |
| CDC Ingestion (Upsert/Delete) | High (Merge-on-Read) | High (Deletion Vectors) | Excellent (Timeline Compaction) | Hudi optimizes streaming CDC using log merges. Delta and Iceberg rely on positional delete file joins. |
Optimistic Concurrency Control vs. Pessimistic Lock Providers
Data write conflicts are managed differently across the formats. Apache Iceberg and Delta Lake rely primarily on Optimistic Concurrency Control (OCC). Under OCC, writers assume that conflicts are rare. When a transaction starts, the writer reads the table's current snapshot and prepares its changes (writing new data or delete files) in isolation. When the writer attempts to commit, it checks if another writer has committed a new snapshot since the transaction began. If no conflict is found, the commit succeeds. If a conflict occurs (for example, if another writer modified the same files or partitions), the transaction fails and the writer must retry. Iceberg handles this by reading the updated metadata, checking if the changes overlap, and applying a retry loop up to a configured threshold. This model works exceptionally well for appends and disjoint updates, but experiences high commit failure rates during heavy concurrent updates to the same partitions.
In contrast, Apache Hudi supports both OCC and multi-writer concurrency control via explicit lock providers (such as ZooKeeper, AWS DynamoDB, or Hive Metastore locks). When multiple writers attempt updates, Hudi uses these lock providers to serialize write operations. Hudi's lock-based approach prevents concurrent commit retries by forcing writers to acquire a lock before final commit, reducing compute waste on retries during high-concurrency workloads but adding dependency management overhead.
Engine Selection Caveats: Dremio vs. Spark vs. Trino
The choice of compute engine exerts a larger influence on performance than the choice of table format itself. An unoptimized engine configuration can nullify the benefits of a format's metadata layout.
- Dremio 25.x (Sub-Second BI Acceleration): Dremio delivers sub-second query performance on Apache Iceberg tables by bypassing JVM execution. Dremio's Sabot execution engine executes SQL queries vectorially in memory using Apache Arrow, avoiding serialization bottlenecks. Furthermore, Dremio implements a Coordinator Metadata Cache. During query planning, the planner queries this local metadata cache rather than scanning cloud object storage, reducing planning latency to milliseconds.
- Dremio's Vectorized Parquet Reader: Dremio reads Parquet column blocks directly into Apache Arrow memory layouts. Because Arrow and Parquet share a similar columnar model, Dremio avoids CPU-intensive serialization and row-to-column translation.
- Dremio's Positional Delete Caching: When reading Iceberg Merge-on-Read tables, Dremio caches positional delete bitmaps in the executor node memory (Data Cache). When scanning data files, Dremio applies the cached delete masks in-memory at memory-bus speeds, neutralizing the scan-time join penalties associated with Merge-on-Read tables.
- Dremio's Data Reflections: Dremio automatically rewrites queries using Apache Calcite to match pre-computed Iceberg reflections (aggregations or raw projections), providing sub-second latency for enterprise BI dashboards without manual caching management.
- Apache Spark 3.5 (Batch Ingestion and Transformations): Spark remains the best engine for batch ingestion, bulk transformations, and heavy write workloads. Spark interacts with all three formats natively. For Delta Lake, Spark is the preferred engine, leveraging Databricks-specific runtime optimizations. However, when writing to Iceberg tables, Spark requires explicit catalog configuration but handles multi-table transactions cleanly.
- Trino 450 (Ad-Hoc SQL Federation): Trino is optimized for fast, interactive ad-hoc querying across multi-cloud object stores. Trino has robust support for Iceberg and Delta Lake, using parallel executors to split planning tasks. However, Trino's write path to Delta Lake and Hudi is slower compared to Spark due to catalog connector limitations.
Physical Layout Optimization: Iceberg Z-Order vs. Delta Liquid Clustering
To achieve sub-second read performance, data files must be organized to group related values together. This maximizes the efficiency of file pruning.
Apache Iceberg uses Z-Ordering (a multi-dimensional space-filling
curve) to cluster data files. When compacting a table, Iceberg reorganizes
rows across multiple columns (such as customer_id and order_date) to ensure that the min/max ranges for these columns are highly
localized. This allows query engines to skip scanning files during
execution. However, running Z-order compaction is a compute-intensive
batch operation that must be scheduled periodically.
Delta Lake 3.0+ introduces Liquid Clustering as an alternative to Z-ordering. Liquid clustering is a dynamic, incremental clustering strategy. Instead of requiring developers to select fixed partition columns or run massive compaction jobs, Liquid clustering partitions and clusters data dynamically as writes occur. It adapts to changing query patterns without requiring table schema rewrites, reducing write amplification compared to Z-order compactions.
Row-Level Mutations: Copy-on-Write vs. Merge-on-Read
Analytical data lakes are primarily append-only. However, regulatory requirements (such as GDPR delete requests) and CDC pipelines require row-level updates and deletes. The formats implement distinct write-and-read strategies to handle these mutations.
1. Copy-on-Write (CoW)
Copy-on-Write is supported by all three formats. In CoW mode, any update or delete operation requires the compute engine to rewrite the physical data files containing the targeted records.
If a table contains a 100 MB Parquet file with 1,000,000 rows, and a query deletes a single row, the engine reads the entire 100 MB file, filters out the deleted row, and writes a new 100 MB Parquet file. The metadata is updated to reference the new file and ignore the old one.
- Pros: Optimal read performance. Because there are no additional files or joins, query engines read data files directly as standard columnar structures.
- Cons: Severe write amplification. Updating or deleting a small number of rows across many files requires massive storage I/O and compute, making CoW unsuitable for streaming updates.
2. Merge-on-Read (MoR)
Merge-on-Read optimizes write performance by deferring data rewrites. When an update or delete occurs, the engine writes a separate, smaller file recording the mutation and commits it to metadata.
When a query engine reads a MoR table, it must read the base data files and join them with the delete files to filter out modified rows on the fly.
The formats implement MoR differently:
- Apache Iceberg v2: Iceberg supports two types of delete
files:
- Position Deletes: The delete file contains the exact file path and row positions of the deleted records. This is highly efficient for query engines to read and apply.
- Equality Deletes: The delete file contains column values
(such as
customer_id = 502). Query engines must perform hash joins on execution, which increases read latencies.
- Delta Lake Deletion Vectors: Delta Lake 3.0 implements deletion vectors. A deletion vector is a bitmap file stored alongside data files. When a row is deleted, Delta Lake updates the bitmap to flag the row's position. During execution, query engines load the bitmap into memory and skip the flagged positions without performing expensive join operations, accelerating read performance compared to standard MoR.
- Apache Hudi MoR: Hudi writes delta records into log files stored in Avro format. During read operations, the engine uses Hudi's index layer to merge the Avro log files with the base Parquet files. Hudi runs background compaction jobs to merge Avro log files back into Parquet files, reclaiming read efficiency.
Runnable SQL Examples and Configurations
To illustrate the difference between these mutation strategies, let us
review concrete SQL configurations and operations. We use the standard analytics.orders and analytics.customers schemas to demonstrate write modes.
Merge-on-Read Configuration and MERGE INTO (analytics.orders)
Below is the Spark SQL configuration to create the analytics.orders table with Merge-on-Read enabled for updates, deletes, and merges. This configuration
optimizes write performance by writing deletes as separate position-delete
files rather than rewriting Parquet files.
/* Create analytics.orders table using Apache Iceberg with Merge-on-Read write mode */
CREATE TABLE local.analytics.orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
amount DECIMAL(10, 2),
status STRING
) USING iceberg
TBLPROPERTIES (
'write.update.mode' = 'merge-on-read',
'write.delete.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
);
/* Execute an upsert using MERGE INTO from a source delta updates table */
MERGE INTO local.analytics.orders AS target
USING (
SELECT order_id, customer_id, order_date, amount, status
FROM local.analytics.orders_updates
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
UPDATE SET
target.amount = source.amount,
target.status = source.status,
target.order_date = source.order_date
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, order_date, amount, status)
VALUES (source.order_id, source.customer_id, source.order_date, source.amount, source.status);
Copy-on-Write Configuration and UPDATE (analytics.customers)
For tables where query performance is critical and write frequency is low,
Copy-on-Write is preferred. Below is the configuration for the analytics.customers table, followed by a row-level update.
/* Create analytics.customers table using Apache Iceberg with Copy-on-Write write mode */
CREATE TABLE local.analytics.customers (
customer_id BIGINT,
name STRING,
email STRING,
country STRING
) USING iceberg
TBLPROPERTIES (
'write.update.mode' = 'copy-on-write',
'write.delete.mode' = 'copy-on-write'
);
/* Execute an update that rewrites only the Parquet files containing the matching records */
UPDATE local.analytics.customers
SET email = 'updated_customer@example.com'
WHERE customer_id = 1045;
Ecosystem, Catalogs, and Governance
A table format does not operate in a vacuum. It requires a catalog to track table locations and enforce access controls. The catalog architecture is critical for multi-engine interoperability and preventing vendor lock-in.
1. Iceberg Catalog Model: Decentralized and Open
Iceberg defines an open REST Catalog API specification. Any service that implements this specification can function as an Iceberg catalog. Compute engines make HTTP requests to the REST service to load schemas and request atomic commits.
This open model has led to multiple implementations:
- Apache Polaris: An open-source, stateless catalog providing role-based access control (RBAC) and credential vending. Polaris allows engines to authenticate via OAuth2, checks permissions, and vends temporary security credentials to read/write storage directly.
- Project Nessie: A transaction catalog that stores commits in a database hash tree, enabling Git-like branching and merging of table changes across multiple tables.
- AWS Glue Catalog: A fully managed AWS catalog that implements optimistic lock checks for Iceberg pointer swaps.
2. Delta Lake Catalog Model: Unity Catalog
Historically, Delta Lake relied on the Hive Metastore to track directories. To provide advanced catalog capabilities, Databricks introduced Unity Catalog, a unified governance and access control layer. Unity Catalog coordinates transactions, manages schemas, and enforces access control rules. While Databricks open-sourced Unity Catalog in 2024 to address lock-in concerns, its deployment and execution remain heavily optimized for the Databricks cloud platform. Operating Delta Lake tables outside the Databricks ecosystem requires deploying Unity Catalog or relying on translation layers, which increases operational complexity compared to Iceberg's open REST API.
3. Hudi Catalog Model: Engine Metadata Table
Hudi does not require a separate transactional catalog layer to coordinate
pointer swaps. Instead, the Hudi library itself tracks transactions
directly in the table's .hoodie/ directory, writing metadata updates
alongside commits. While Hudi can register tables with the Hive Metastore or
AWS Glue for discovery, the source of truth remains the table timeline itself.
This design simplifies write operations but makes multi-engine concurrency
validation more complex.
Decision Framework: How to Choose
To assist data architects and database engineers in selecting the correct open table format, we have established a workload-driven decision matrix.
| Your Core Requirement | Recommended Format | Architectural Rationale |
|---|---|---|
| Multi-Engine Portability & Governance | Apache Iceberg | The open REST Catalog API allows Dremio, Spark, and Trino to read/write concurrently with central RBAC via Polaris. |
| Databricks Ecosystem Integration | Delta Lake | If your organization runs primarily on Databricks, Delta Lake offers native platform speed and Unity integration. |
| High-Frequency Key Updates & CDC | Apache Hudi | Hudi's key indexes (Bloom/Bucket) and background log compaction minimize update latencies for write-heavy CDC. |
| Sub-Second BI Queries | Apache Iceberg (with Dremio) | Dremio executes vectorized Arrow queries, caches Iceberg metadata locally, and accelerates scans using reflections. |
| Version Control (Branch/Merge) | Apache Iceberg (with Nessie) | Nessie enables Git-like branching for Write-Audit-Publish patterns across multiple tables simultaneously. |
Conclusion and Next Steps
Apache Iceberg, Delta Lake, and Apache Hudi are all mature, enterprise-grade formats. However, their architectural differences are significant. Apache Iceberg represents the standard for engine-neutral lakehouse architectures, offering a clean hierarchical metadata tree and an open catalog API. Delta Lake remains the preferred choice for Spark-heavy, Databricks-centric environments. Apache Hudi provides optimized key indexing and timeline compactions for streaming CDC pipelines.
Data engineers should evaluate their compute ecosystems, update frequency requirements, and catalog governance strategies before selecting a format. In many modern environments, Iceberg has become the default choice due to its open standards, lack of vendor lock-in, and integration with high-performance query acceleration engines like Dremio.