Skip to content

2025 Comprehensive Guide to Apache Iceberg

Published: at 09:00 AM

Apache Iceberg had a monumental 2024, with significant announcements and advancements from major players like Dremio, Snowflake, Databricks, AWS, and other leading data platforms. The Iceberg ecosystem is evolving rapidly, making it essential for professionals to stay up-to-date with the latest innovations. To help navigate this ever-changing space, I’m introducing an annual guide dedicated to Apache Iceberg. This guide aims to provide a comprehensive overview of Iceberg, highlight key resources, and offer valuable insights for anyone looking to deepen their knowledge. Whether you’re just starting with Iceberg or are a seasoned user, this guide will serve as your go-to resource for 2025.

Read this article for details on migrating to Apache Iceberg.

What is a Table Format?

A table format, often referred to as an “open table format” or “lakehouse table format,” is a foundational component of the data lakehouse architecture. This architecture is gaining popularity for its ability to address the complexities of modern data management. Table formats transform how data stored in collections of analytics-optimized Parquet files is accessed and managed. Instead of treating these files as standalone units to be opened and read individually, a table format enables them to function like traditional database tables, complete with ACID guarantees.

With a table format, users can interact with data through SQL to create, read, update, and delete records, bringing the functionality of a data warehouse directly to the data lake. This capability allows enterprises to treat their data lake as a unified platform, supporting both data warehousing and data lake use cases. It also enables teams across an organization to work with a single copy of data in their tool of choice — whether for analytics, machine learning, or operational reporting — eliminating redundant data movements, reducing costs, and improving consistency across the enterprise.

Currently, there are four primary table formats driving innovation in this space:

Each of these table formats plays a role in the evolving data lakehouse landscape, enabling organizations to unlock the full potential of their data lakehouse.

How Table Formats Work

At the core of every table format is a metadata layer that transforms collections of files into a table-like structure. This metadata serves as a blueprint for understanding the data, providing essential details such as:

This metadata acts as an entry point, allowing tools to treat the underlying files as a cohesive table. Instead of scanning all files in a directory, query engines use the metadata to understand the structure and contents of the table. Additionally, the metadata often includes statistics about partitions and individual files. These statistics enable advanced query optimization techniques, such as pruning or skipping files that are irrelevant to a specific query, significantly improving performance.

While all table formats rely on metadata to bridge the gap between raw files and table functionality, each format structures and optimizes its metadata differently. These differences can influence performance, compatibility, and the features each format provides.

How Apache Iceberg’s Metadata is Structured

Apache Iceberg’s metadata structure is what enables it to transform raw data files into highly performant and queryable tables. This structure consists of several interrelated components, each designed to provide specific details about the table and optimize query performance. Here’s an overview of Iceberg’s key metadata elements:

The Evolution of Iceberg’s Specification

Apache Iceberg’s specification is constantly evolving through community contributions and proposals. These innovations benefit the entire ecosystem, as improvements made by one platform are shared across others. For example:

This collaborative approach ensures that Apache Iceberg continues to evolve as a cutting-edge table format for modern data lakehouses.

Read this article on the Apache Iceberg Metadata tables.

The Role of Catalogs in Apache Iceberg

One of the key features of Apache Iceberg is its immutable file structure, which makes snapshot isolation possible. Every time the data or structure of a table changes, a new metadata.json file is generated. This immutability raises an important question: how does a tool know which metadata.json file is the latest one?

This is where Lakehouse Catalogs come into play. A Lakehouse Catalog serves as an abstraction layer that tracks each table’s name and links it to the most recent metadata.json file. When a table’s data or structure is updated, the catalog is also updated to point to the new metadata.json file. This update is the final step in any transaction, ensuring that the change is completed successfully and meets the atomicity requirement of ACID compliance.

Lakehouse Catalogs are distinct from Enterprise Data Catalogs or Metadata Catalogs, such as those provided by companies like Alation and Collibra. While Lakehouse Catalogs focus on managing the technical details of tables and transactions, enterprise data catalogs are designed for end-users. They act as tools to help users discover, understand, and request access to datasets across an organization, enhancing data governance and usability.

Read this article to learn more about Iceberg catalogs.

The Apache Iceberg REST Catalog Spec

As more catalog implementations emerged, each with unique features and APIs, interoperability between tools and catalogs became a significant challenge. This lack of a unified standard created a bottleneck for seamless table management and cross-platform compatibility.

To address this issue and drive innovation, the REST Catalog specification was developed. Rather than requiring all catalog providers to adopt a standardized server-side implementation, the specification introduced a universal REST API interface. This approach ensures that:

With the REST Catalog specification, interoperability and ease of integration have dramatically improved. This innovation allows developers and enterprises to adopt or build catalogs that align with their technical and business requirements while still being compatible with any tool that supports the REST API interface. This forward-thinking design has strengthened the role of catalogs in modern lakehouse architectures, ensuring that Iceberg tables remain accessible and manageable across diverse platforms.

Read more about the Iceberg REST Spec in this article.

Soft Deletes vs Hard Deleting Data

When working with table formats like Apache Iceberg, it’s important to understand how data deletion is handled. Unlike traditional databases, where deleted data is immediately removed from the storage layer, Iceberg follows a different approach to maintain snapshot isolation and enable features like time travel.

When you execute a delete query, the data is not physically deleted. Instead:

This approach allows users to query previous versions of the table using time travel, providing a powerful mechanism for auditing, debugging, and historical analysis.

However, this also means that data marked for deletion continues to occupy storage until it is physically removed. To address this, snapshot expiration procedures are performed during table maintenance using tools like Spark or Dremio. These procedures:

Regular maintenance is a critical part of managing Iceberg tables to ensure storage efficiency and maintain optimal performance while leveraging the benefits of its snapshot-based architecture.

Optimizing Iceberg Data

Minimizing Storage

The first step in reducing storage costs is selecting the right compression algorithm for your data. Compression not only reduces the amount of space required to store data but can also improve performance by accelerating data transfer across networks. These compression settings can typically be adjusted at both the table and query engine levels to suit your specific use case.

Improving Performance

Optimizing performance largely depends on how data is distributed across files. This can be achieved through regular maintenance procedures using tools like Spark or Dremio. These optimizations result in two key outcomes:

By leveraging these strategies, Iceberg users can maintain a balance between efficient storage and fast query performance, ensuring their data lakehouse operates at peak efficiency. Regular maintenance is essential for reaping the full benefits of these optimizations.

Read this article for more detail on optimizing Apache Iceberg tables.

Hands-on Tutorials