Skip to content

Apache Iceberg Schema Evolution and Hidden Partitioning: Under the Hood

In a production data lakehouse, change is the only constant. Business requirements evolve, application schemas shift, and partition strategies that were optimal at gigabyte scales become bottlenecks when data grows to petabytes. Managing these changes in traditional Hadoop Hive-style architectures has historically been a major source of operational friction. Schema changes often led to silent data corruption or required expensive, multi-day table rewrites.

Distributed data architectures decouple compute from storage, allowing different query engines like Dremio, Spark, and Trino to read and write to the same underlying files. This decoupling increases the difficulty of schema coordination. If one engine updates the schema of a table while another engine continues writing files using an older structure, the mismatch can disrupt downstream dashboards and pipelines. Therefore, a robust table format must provide transactionally safe, metadata-driven schema evolution that operates independently of the query engine or file format.

Apache Iceberg solves these problems by separating logical table layouts from physical file structures. Two of Iceberg's core features, schema evolution and partition evolution, operate entirely as metadata updates. This guide explores the architectural mechanics of these features, explaining how unique column IDs prevent data corruption, how hidden partitioning optimizes query planning, and how Dremio accelerates queries against evolved schemas and partitions.

The Legacy Problem: Hive-Style Schema Evolution

To understand why Iceberg's design is revolutionary, we must first examine how legacy Hive tables manage schemas. Hive structures tables as directories on a file system, such as HDFS or cloud object stores like AWS S3, using folders to represent partition values (for example, year=2026/month=05/). Within these directories, data is stored in files like Parquet or ORC.

Hive does not store column mappings in the data files themselves. Instead, it maintains a schema definition in a centralized database (the Hive Metastore) that maps column names to index positions (for example, the first column is customer_id, the second is name, and the third is email). When a query engine reads a Hive table, it asks the Metastore for the schema, then reads the files and maps the data columns based on their physical index positions.

This name-by-position mapping is fragile. If a data engineer renames a column, drops a column, or reorders fields, the Metastore schema changes, but the physical data files remain unmodified. When the query engine reads older data files using the updated schema, columns align incorrectly. For example, if a column is dropped, the engine reads the next physical column in the file into the slot of the dropped column. This causes silent data corruption or query failures because the engine attempts to parse data with mismatching types.

Let us trace a concrete example of this corruption. Consider a Hive table that contains three columns: order_id (index 0), customer_id (index 1), and billing_amount (index 2). If a data engineer drops the customer_id column, the Hive Metastore updates the schema. The new schema definition maps order_id to index 0 and billing_amount to index 1. However, the physical files written before this change still contain three columns. When a query engine scans these older files, it reads the second physical column (which contains customer_id values) and maps it to the billing_amount column. Since customer_id contains identifiers that do not match the expected decimal values of billing_amount, queries fail with type conversion errors or, worse, populate the billing field with incorrect numeric data.

The situation becomes worse when columns are dropped and then re-added with the same name but different types. Hive maps the physical column in old files to the new column definition based on name match, leading to errors. For example, if an engineer drops a column named status (which was a string) and later adds a new column named status (which is an integer representing a lookup code), query engines attempting to read historical files will fail. They will try to read the historical string data as integers, resulting in parser crashes. Data engineers running Hive architectures are often forced to write cleanup scripts, create copy tables, and execute massive batch jobs to rewrite directories and clean up index positions, resulting in significant resource usage and operational overhead.

The Foundation: Column ID Mapping

Apache Iceberg eliminates position-based column mapping by assigning a unique, immutable integer ID (field ID) to every column in a table's schema. This mapping is recorded in the table's JSON metadata files.

graph TD subgraph SCHEMA["Table Schema in metadata.json"] C1["Field ID 1: order_id (BIGINT)"] C2["Field ID 2: cust_id (BIGINT) (renamed from customer_id)"] C3["Field ID 3: total (DOUBLE)"] C4["Field ID 4: region (STRING) (added later)"] C5["Field ID 5: status (STRING)"] end subgraph FILE1["data-00001.parquet (written before rename + add)"] F1["field_id=1: order values"] F2["field_id=2: customer values"] F3["field_id=3: total values"] F5["field_id=5: status values"] NOTE["field_id=4 missing (reader returns NULL)"] end SCHEMA --> FILE1

When an Iceberg table is created, the catalog assigns field ID 1 to the first column, field ID 2 to the second, and so on. The catalog tracks the highest assigned ID in a metadata field called last-column-id. When a compute engine writes a data file (such as a Parquet file) to the table, Iceberg writes these field IDs directly into the Parquet file's metadata blocks.

When reading the table, the query engine does not rely on column names or physical positions to resolve columns. Instead, it reads the field IDs from the Parquet metadata and aligns them with the field IDs recorded in the table's active schema in the JSON metadata. This design enables safe, metadata-only schema evolution. Renaming a column does not break the layout because the field ID remains unchanged. Adding a column assigns a new field ID, and readers automatically return null values for this field when scanning files written before the column was added.

The Schema History Array in Metadata JSON

Every Apache Iceberg table maintains its state inside a series of JSON metadata files. Within these JSON files, Iceberg records a historical array of every schema configuration the table has ever utilized. Each entry in the schemas list contains a unique schema-id, a root type (typically a struct), and a complete array of nested field definitions.

Each field in the schema definition is represented as a structured object containing several key metadata elements: an id (the immutable integer field ID), a name (the logical column name used in SQL queries), a type (the physical or logical data type), and a required property (a boolean value defining whether the column supports null values).

To trace which schema represents the current state of the table, the metadata file includes a root-level current-schema-id property. When an engineer executes an alter table statement, such as renaming or adding a column, Iceberg does not edit or replace existing schema versions. Instead, it creates a new schema representation, increments the schema ID, appends the new schema to the historical array, and updates the current-schema-id pointer.

This immutable schema logging model is essential for time-travel queries. When a user requests data from a past snapshot (for instance, querying a state from a week ago), the query engine retrieves the metadata associated with that historical snapshot, extracts the active schema ID at that specific point in time, and uses that historical schema to resolve columns during the file scan. This prevents errors that would otherwise arise if old files were read with the current schema.

How Field IDs are Embedded in Physical Storage

The integration of field IDs is handled at the file format writer level. For Parquet files, Iceberg maps columns by writing the field IDs to the Parquet schema metadata. Inside the file's footer, Parquet stores the schema definition where each node contains name and type details. Iceberg adds a custom metadata key named org.apache.iceberg.field-id to each schema element, recording the integer field ID.

When an Iceberg reader initializes a file scan, it bypasses the physical column names present in the Parquet file. It extracts the org.apache.iceberg.field-id values, compares them to the active schema IDs defined in the table's JSON metadata, and constructs a projection map. If a physical column has ID 2 but is named customer_id in the file and cust_id in the table metadata, the reader maps them without issue, resolving the rename on the fly.

For other file formats, Iceberg implements similar mappings. In Apache ORC, field IDs are written as attributes within the nested type description tree. In Apache Avro, Iceberg inserts field ID properties directly into the JSON schema record definition. This format-agnostic field mapping ensures that multi-engine architectures can read and write to Iceberg tables using different file configurations while preserving schema consistency.

Managing Nested Structures and Complex Types

Modern schemas often include nested structures like structs, maps, and lists. Iceberg handles nested schema evolution by recursively assigning unique field IDs to all child elements.

For example, in a struct column named shipping_address, the struct itself receives a field ID, and each child field (such as street, city, and zip_code) receives its own unique field ID. If a child field is added to the struct later, it receives a new ID, and older rows resolve this child field as NULL.

Let us trace this behavior with a concrete example. Suppose we define a column named user_profile as a struct with field ID 5. Inside this struct, we define first_name as field ID 6, and last_name as field ID 7. If the application evolves and we add a nested field named postal_code, the catalog assigns field ID 8 to this new field. When scanning older Parquet files that contain only data for field IDs 6 and 7, the Iceberg reader observes that field ID 8 is missing from the physical file. Instead of failing or misaligning columns, the reader fills the postal_code attribute with a null value for every row read from those historical files.

In map types, Iceberg assigns distinct field IDs to both the map key and the map value. This means a map of integer keys to string values maps the key to a specific field ID and the value to another field ID. For lists, the list element receives a unique field ID. This detailed tracking ensures that nested fields can be renamed, reordered, or promoted using the same rules as top-level columns. It allows engineers to alter deep object hierarchies without risking structural corruption or requiring complete table rebuilds.

How Iceberg Processes Schema Mutations

Because Iceberg aligns fields using unique IDs, common schema operations require zero physical data modifications. The following table describes what occurs in metadata and how readers resolve data for each type of schema change.

Operation Data Files Rewritten? Under-the-Hood Metadata and Reader Behavior
Add Column No The catalog assigns a new unique field ID to the column and appends it to the schema in the JSON metadata. When reading old data files (which do not contain this field ID), the reader automatically returns NULL values for the column. When writing new data files, the writer includes the new field ID and its written values.
Rename Column No The catalog updates the name mapping for the target field ID in the JSON metadata. The physical Parquet files still store data using the immutable field ID. The reader maps the new name to the existing ID during query execution, guaranteeing that old data remains readable.
Reorder Columns No The schema definition in the JSON metadata is updated to specify the new display order of columns. The physical data layout is unaffected because the reader uses the schema's field ID ordering to project columns, sorting them in memory before returning results to the query.
Drop Column No The field ID is removed from the active schema in the JSON metadata, marking the ID as retired. While the physical bytes remain in older Parquet files, the reader filters out the dropped field ID during execution, ignoring the column's contents. New files are written without the dropped column.

Type Promotion Rules and Constraints

In addition to layout changes, schemas often require type modifications to accommodate larger values. Iceberg supports type promotion as a metadata-only operation, provided the change is safe and does not lose precision.

Iceberg allows the following type promotions:

If an engineer attempts to promote a type outside these rules, such as changing a string to an integer, Iceberg rejects the operation to prevent runtime parse failures. When a valid promotion occurs, the reader handles the widening dynamically. For example, when reading an older file containing 32-bit integers for a column that was promoted to long, the reader widens the values to 64-bit in memory during deserialization.

Deep Dive into Safe and Unsafe Type Mutations

The restriction of type promotion to widening operations is a design requirement to protect data integrity. Widening operations are mathematically safe because every value in the source domain is representable in the target domain without loss of precision or scale.

For example, promoting an int to a long is safe because 64-bit integers represent a superset of 32-bit integers. If a query reader reads a 32-bit integer, it can store it directly in a 64-bit memory register by padding the upper bits. This promotion can occur at the register level using CPU sign-extension or zero-extension instructions, which completes in a single CPU cycle. In memory formats like Apache Arrow, this conversion allocates a new 64-bit vector and copies the widened values vectorially.

Widening a 32-bit float to a 64-bit double is similarly safe. The IEEE 754 standard for floating-point arithmetic defines single-precision floats with 1 sign bit, 8 exponent bits, and 23 fraction bits. Double-precision floats expand this representation to 1 sign bit, 11 exponent bits, and 52 fraction bits. Moving from single to double precision requires moving the exponent and fraction bits without dropping precision, which represents an exact conversion.

For decimal types, promoting decimal(P, S) to decimal(P', S) operates safely because the scale (S), which represents the number of digits to the right of the decimal point, remains identical. Only the precision (P'), which defines the total count of allowable digits, is widened. Since the decimal scale is unchanged, the underlying unscaled integer value of each decimal representation requires only widening to fit a larger register, preserving the exact position of the decimal point.

Conversely, narrowing operations are unsafe. Promoting a long to an int would result in data loss or overflow errors if a physical file contained values exceeding the 32-bit signed integer boundary.

Similarly, altering a double to a float is blocked because it drops precision bits, which would return incorrect floating-point approximations. Changing a column from string to any numeric type is disallowed at the metadata layer. While some strings can be parsed into integers, other strings may contain non-numeric characters, which would cause query planning or execution failures during runtime deserialization.

When an unsupported type conversion is necessary, data platforms must execute a migration. The standard process involves:

  1. Adding a new column to the table with the target type (for example, temp_customer_id as bigint).
  2. Running an update query to copy and cast data from the old column to the new column (for example, casting customer_id from string to bigint).
  3. Verifying data accuracy and completeness.
  4. Dropping the old column from the schema.
  5. Renaming the new column to the original column name.

While this requires writing new data files, it ensures that all values are validated and prevents runtime failures for downstream applications.

Spark SQL and PySpark Implementation Examples

Let us review how to configure and execute schema evolution using Apache Spark SQL. In these examples, we manipulate the analytics.orders and analytics.customers tables to illustrate different schema operations.

Table Creation and Schema Setup

First, we create our standard e-commerce schemas using Iceberg.

/* Create the analytics.orders table */
CREATE TABLE local.analytics.orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    amount DECIMAL(10, 2),
    status STRING
) USING iceberg;

/* Create the analytics.customers table */
CREATE TABLE local.analytics.customers (
    customer_id BIGINT,
    name STRING,
    email STRING,
    country STRING
) USING iceberg;

Executing Alterations: Altering analytics.orders

We can execute multiple schema changes on the analytics.orders table without rebuilding the table.

/* Add a new column to track the acquisition channel */
ALTER TABLE local.analytics.orders 
ADD COLUMN sales_channel STRING;

/* Rename the status column to order_status */
ALTER TABLE local.analytics.orders 
RENAME COLUMN status TO order_status;

/* Reorder the sales_channel column to sit after customer_id */
ALTER TABLE local.analytics.orders 
ALTER COLUMN sales_channel AFTER customer_id;

/* Drop the order_status column from the active schema */
ALTER TABLE local.analytics.orders 
DROP COLUMN order_status;

Schema Merging via Spark Writes

When writing data using Spark dataframes, you can enable automatic schema merging. This configuration permits Spark to append new columns to the Iceberg table schema if they are present in the incoming dataframe.

# PySpark configuration to write data with schema merge enabled
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("IcebergSchemaMerge") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/warehouse") \
    .getOrCreate()

# Create a DataFrame with a new column 'customer_segment' not present in analytics.customers
new_data = [
    (1045, "Alice Smith", "alice@example.com", "USA", "VIP"),
    (1046, "Bob Jones", "bob@example.com", "Canada", "Standard")
]
columns = ["customer_id", "name", "email", "country", "customer_segment"]
df = spark.createDataFrame(new_data, columns)

# Write the dataframe, enabling schema merge
df.write \
    .format("iceberg") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("local.analytics.customers")

Hidden Partitioning: Eliminating Query-Time Errors

In legacy Hive tables, partitioning is physical. The table directory is split into subdirectories based on partition values, such as /orders/year=2026/month=05/. This structure introduces two critical flaws.

First, query planning requires recursively listing directories to find data files. On cloud object storage, directory listings are slow, metadata-heavy operations that introduce significant latency and can cause API rate-limiting or throttling when scanning thousands of directories.

Second, partition filtering is manual. If the table is partitioned by a derived column like month, the query writer must explicitly include that column in the query's WHERE clause, such as WHERE order_date >= '2026-05-15' AND month = '2026-05'. If the user omits the partition filter and queries only the date, the engine scans the entire table, leading to slow queries and massive cloud costs.

The Iceberg Solution: Logical Partition Transforms

Iceberg eliminates physical directory-based partitioning through a concept called Hidden Partitioning. In Iceberg, you define partitioning on real columns using partition transforms.

graph LR subgraph USER["What the user writes"] Q["SELECT * FROM orders WHERE order_date >= '2026-05-01'"] end subgraph ICEBERG["What Iceberg does internally"] PRUNE["Evaluate manifest partition summaries: only open manifests where max(months(order_date)) >= 2026-05"] SKIP["Skip manifests and files outside the range"] READ["Read only matching files"] end USER --> ICEBERG

When data is written, Iceberg reads the source column value, applies the partition transform, such as extracting the month from a date, and stores the resulting partition value in the manifest file metadata alongside the data file path. The physical layout of files is managed dynamically, and query planning uses metadata summaries to prune files.

Because partitioning is managed in metadata, query authors do not need to know how the table is partitioned. They write standard queries filtering on the source column, such as WHERE order_date >= '2026-05-01'. During query planning, Iceberg automatically applies the partition transform to the filter predicate and prunes manifests and files that do not match the partition criteria.

Under the Hood: Query Pruning Mechanics

To understand how Iceberg translates query filters into file pruning without physical directory indexes, we can trace the execution of a query through the planning phase.

  1. Query Parsing: The user submits a query to the compute engine, such as SELECT * FROM analytics.orders WHERE order_date >= '2026-05-15'. The engine's query parser parses the SQL and generates an AST (Abstract Syntax Tree), isolating the filter predicate on order_date.
  2. Transform Application: The query planner inspects the table's partition specification. It discovers that the table is partitioned using the month(order_date) transform. The planner applies this transform function to the query predicate value, translating '2026-05-15' into the partition key '2026-05'.
  3. Manifest List Evaluation: The planner reads the manifest list file associated with the table's active snapshot. For each manifest file entry, the manifest list contains partition summary statistics, specifically the minimum and maximum values of the partition keys present within that manifest. The planner evaluates the transformed partition filter against these min/max summaries. If a manifest's range is entirely outside the filter boundary, such as a manifest containing data only from 2026-01 through 2026-04, the planner prunes the manifest, skipping it entirely.
  4. Manifest File Evaluation: For the manifest files that pass the manifest list filter, the planner reads the individual manifest entries. Each entry tracks a single data file and records its partition values. The planner matches the filter against these partition values. If a file's partition does not match, it is skipped.
  5. File Scanning: The planner generates a final scan list containing only the physical Parquet file paths that match the partition criteria. The compute engine scans only these files, avoiding directory listings and minimizing cloud storage read requests.

Supported Partition Transforms

Iceberg provides several built-in partition transforms. These transforms generate partition values at write time and enable automatic pruning at query time.

Transform Name Source Column Type Example Mapping Resulting Partition Value
identity(col) Any Type identity(country) The raw column value (such as "USA")
year(col) Date, Timestamp year(order_date) The year value (such as 2026)
month(col) Date, Timestamp month(order_date) The year and month (such as 2026-05)
day(col) Date, Timestamp day(order_date) The specific day (such as 2026-05-22)
hour(col) Timestamp hour(event_ts) The day and hour (such as 2026-05-22-09)
bucket(col, N) Int, Long, String, etc. bucket(customer_id, 16) A hash value from 0 to 15, grouping values into N buckets
truncate(col, W) String, Decimal truncate(name, 1) The value truncated to width W (such as "A" for "Alice")

Comparing Hive Partitioning and Iceberg Hidden Partitioning

The differences between physical directory-based partitioning and logical metadata-based partitioning are significant. The table below compares the two approaches across key architectural dimensions.

Feature / Dimension Legacy Hive Partitioning Apache Iceberg Hidden Partitioning
Physical File Structure Strict directory hierarchies. Files are nested within folders named by partition keys. Logical. Files are stored in directories, but location does not define partition membership.
Query Predicates Users must filter explicitly by partition columns. Omitting partition filters triggers full scans. Users filter on real data columns. Iceberg translates filters to prune partitions automatically.
Write Execution Writers must compute partition directories and structure output paths manually. Writers write rows directly. The transform engine calculates partition values and records them.
Schema Coupling Partition columns must exist as physical fields in data files or be appended as virtual columns. Partition transforms are decoupled from physical schemas, operating as logical transformations.
Query Planning Latency High. Requires recursive directory listings across cloud object storage to discover files. Low. planning uses O(1) manifest reading, avoiding directory listings.

Partition Evolution: Layout Changes Without Migrations

As tables grow, partition requirements change. For example, a table that was originally partitioned by month(order_date) may require partitioning by day(order_date) as write volume increases.

In legacy Hive environments, changing the partition scheme requires creating a new table with the new partition layout, writing a migration job to read all historical data and write it to the new table, and updating all downstream applications to point to the new location. This migration introduces downtime and requires substantial compute resources.

Apache Iceberg supports Partition Evolution. Because Iceberg resolves partitions using metadata, you can alter a table's partition scheme at any time. This operation is metadata-only and requires no data rewrites.

graph TD A["Old files: partitioned by months(order_date), manifest records spec_id=0"] B["New files: partitioned by days(order_date), manifest records spec_id=1"] C["Query planner: reads both manifest sets, applies correct pruning for each spec_id"] A --> C B --> C

How Partition Evolution Works Under the Hood

When a partition scheme is updated, Iceberg creates a new Partition Spec inside the table's JSON metadata. Each partition spec is assigned a unique, sequential partition spec ID, such as spec ID 0 for the original layout, and spec ID 1 for the new layout. The metadata tracks these specs in the partition-specs array and points to the active layout using the default-spec-id property.

Existing data files written before the partition change remain in their current physical locations and are tracked by manifest files using partition spec ID 0. New writes are automatically organized according to the new partition spec and are tracked by manifest files using partition spec ID 1.

When a query is executed, the query planner checks the partition spec ID associated with each manifest. It evaluates the query filters against the partition values using the matching partition spec definition for that ID. This allows the planner to prune files written under spec 0 and spec 1 independently, producing a unified, pruned scan plan.

Multi-Spec Partition Planning: A Detailed Timeline Walkthrough

To visualize how Iceberg executes queries across evolved partition specifications, let us trace a table's history as its partition strategy evolves over three stages.

Runnable DDL Example: Partition Evolution

Below is the Spark SQL code to evolve the partition scheme of the analytics.orders table.

/* First, create the table partitioned by month */
CREATE TABLE local.analytics.orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    amount DECIMAL(10, 2),
    status STRING
) USING iceberg
PARTITIONED BY (month(order_date));

/* Insert some initial data (written under partition spec 0) */
INSERT INTO local.analytics.orders VALUES
(1, 101, CAST('2026-04-15' AS DATE), 150.00, 'Shipped'),
(2, 102, CAST('2026-04-20' AS DATE), 200.00, 'Processing');

/* Evolve the partition scheme: replace monthly partitioning with daily */
ALTER TABLE local.analytics.orders
REPLACE PARTITION FIELD month(order_date)
WITH day(order_date);

/* Insert new data (written under partition spec 1) */
INSERT INTO local.analytics.orders VALUES
(3, 103, CAST('2026-05-22' AS DATE), 75.50, 'Processing'),
(4, 104, CAST('2026-05-22' AS DATE), 120.00, 'Completed');

After executing these commands, the table contains data files with different partition layouts. The query engine reads both sets of files seamlessly, using the appropriate metadata mapping for each file.

Dremio Query Acceleration for Evolved Tables

While Iceberg's metadata architecture makes schema and partition evolution possible, executing queries efficiently against tables with mixed partition specs and evolved schemas requires query engine integration. Dremio is designed specifically to optimize these scenarios, delivering sub-second BI query latency over evolved Iceberg tables.

Vectorized Arrow Projector for Evolved Schemas

When a query scans a Parquet data file, the engine must project the physical Parquet columns into memory buffer allocations. If the table schema has evolved, different files may contain different columns or data types.

Dremio's Sabot execution engine compiles vectorized execution plans that read Parquet data directly into in-memory Apache Arrow record batches. During this scan, Dremio implements a Vectorized Arrow Projector. Apache Arrow represents columnar data in memory as contiguous arrays of values, validity (null) buffers, and offset buffers. The Sabot engine executes processing loops directly over these contiguous memory buffers, maximizing CPU instruction cache hits and using SIMD instructions.

If a query requests a column that was added after a physical file was written, Dremio does not execute complex conditional row checks. Instead, Dremio's vectorized reader allocates a null vector buffer in Arrow memory and projects it directly alongside the active data vectors. This vectorized allocation bypasses row-by-row branch evaluations, preserving execution performance.

Similarly, if a column type was promoted, such as from integer to long, Dremio's reader performs a vectorized type promotion in the CPU registers as the data is loaded into memory, avoiding serialization overhead and ensuring zero performance penalty for schema changes.

In-Memory Processing of Dropped Fields

When a column is dropped from an Iceberg table, the physical data remains inside older Parquet files to prevent expensive rewrite operations. When executing queries against these older files, Dremio's vectorized reader uses the metadata schema map to identify dropped field IDs.

Instead of allocating memory and loading the data vectors for these dropped fields, Dremio's reader skips reading those column blocks from disk. During record batch assembly, the projector projects only the active columns into Arrow memory, ensuring that dropped fields consume zero memory or network IO during execution.

Metadata Caching and Pruning Across Multiple Partition Specs

When a table contains evolved partitions, the query planner must apply different pruning math depending on the partition spec ID of each manifest. If the planner must scan cloud object storage to resolve these spec definitions and manifest list boundaries, query planning times degrade.

Dremio accelerates this planning phase using its Coordinator Metadata Cache. Dremio caches the Iceberg table's JSON metadata, partition spec definitions, and manifest lists locally on its coordinator nodes. This caching eliminates the need for query planning steps to execute remote HTTP requests to cloud object storage (such as AWS S3 or Google Cloud Storage) to read table state.

During query planning, Dremio's Cost-Based Optimizer reads this local cache and uses Apache Calcite to parse the query predicates. It applies partition pruning formulas to both the monthly and daily partition specs in parallel. This local metadata resolution reduces query planning times to milliseconds, ensuring that BI dashboards experience sub-second response times even when querying tables with highly complex partition evolution histories.

Positional Delete Caching for Merge-on-Read Optimization

When executing updates and deletes on Iceberg tables using Merge-on-Read (MoR), the table writes positional delete files that track which rows in specific data files have been modified. When query engines read these tables, they must merge the data files with the delete files on the fly, which traditionally introduces significant CPU and memory overhead.

Dremio optimizes this process by caching positional delete vectors directly in memory. When a query scans a partition containing deletes, Dremio's executor loads the corresponding delete file, decodes the row offsets into an in-memory bitmap, and caches this bitmap on executor nodes. As the data reader scans Parquet column chunks, it references the cached delete bitmap to skip deleted rows during the vectorized Arrow buffer projection. This caching mechanism prevents redundant reads of delete files across concurrent queries and minimizes memory churn, accelerating analytical queries on mutable transactional tables.

Reflections and Evolved Tables

Dremio's Data Reflections provide pre-computed materializations (stored as Iceberg tables) that Dremio queries automatically to accelerate analytical workloads. When a reflection is created on an Iceberg table that subsequently undergoes schema or partition evolution, Dremio's compiler manages the relationship automatically.

If a schema alteration occurs, Dremio determines whether the reflection can still satisfy incoming queries, such as if the query does not require a newly added column. If the partition scheme evolves, Dremio leverages Apache Calcite to rewrite execution paths, ensuring that queries mapped to the reflection resolve correctly. The reflection is updated incrementally according to the configured refresh schedule, aligning it with the table's updated metadata spec without requiring manual intervention from database administrators.

Deep Dive: Field ID Mapping for Complex and Nested Types

While column ID tracking is straightforward for flat structures, managing nested data types like structs, arrays, and maps requires a recursive approach. Apache Iceberg solves this by treating every nested field within a container type as a first-class schema element, assigning each its own unique integer ID. For example, if a table has a struct column named shipping_address with field ID 10, the nested fields within it (such as street, city, and zip_code) might receive IDs 11, 12, and 13 respectively.

If a data engineer subsequently adds a new nested column named state_province to the shipping_address struct, the catalog assigns it the next available ID, such as 14. This assignment occurs at the catalog level, which ensures that all query engines agree on the structural layout. When a query scans Parquet files written prior to the addition of state_province, the reader detects that field ID 14 is absent from the file schema. The reader then dynamically projects a stream of null values for that column during query execution.

For map structures, Iceberg assigns two separate field IDs: one for the map key and one for the map value. Similarly, list structures have a specific field ID assigned to their elements. This design ensures that if a list containing structs is evolved, each nested attribute inside the list is mapped by its unique ID, rather than its relative index position. Consequently, nested columns can be renamed, reordered, or promoted using the same rules as top-level fields. The schema metadata JSON keeps a complete log of these transitions, which allows historic snapshots of nested structures to remain accessible for time-travel queries.

Arrow Widening and Vectorized Projection Mechanics

At the hardware level, data type promotion requires modifying the physical representation of values in memory. When a query engine reads a Parquet file containing 32-bit integers but the table schema has evolved the column to a 64-bit long integer, the engine must perform a type promotion. Dremio accelerates this operation by performing type widening directly inside its vectorized Arrow execution memory.

During a scan, Dremio's Sabot execution engine reads columnar data chunks from disk into memory-aligned Apache Arrow record batches. In Apache Arrow, columns are represented as contiguous vectors in memory. If a type promotion is required, Dremio allocates a 64-bit target vector for the long integer values. As it deserializes the 32-bit integers from the Parquet page, Dremio applies CPU-level sign-extension instructions to widen the values vectorially. By executing this operation in the CPU registers before assembling the Arrow record batch, Dremio avoids row-by-row casting loops, keeping memory copy operations efficient.

This vectorized widening is also applied to floating-point numbers, such as promoting 32-bit floats to 64-bit doubles. When promoting decimals, Dremio widens the precision by padding the underlying unscaled integer representations while preserving the scale factor. Because the scale remains constant, the decimal widening is an exact arithmetic operation, which prevents rounding errors or precision loss during query execution.

Evolved Tables in Multi-Engine Environments

A key advantage of Apache Iceberg is its ability to support concurrent access from multiple query engines like Spark, Dremio, Flink, and Trino. However, schema evolution in a multi-engine environment requires strict transaction coordination. Iceberg manages this coordination through catalog transactions, which enforce optimistic concurrency control (OCC).

When a write engine like Apache Spark adds a column to an Iceberg table, it attempts to commit the change by writing a new metadata JSON file and updating the catalog pointer. If another engine tries to write data using the older schema at the same time, the catalog detects the conflict. Because Iceberg's metadata files are versioned, the writing engine reads the new schema definition, aligns its write task with the new column layout, and retries the commit. This design prevents schema mismatches from corrupting the table state.

Furthermore, engines like Dremio use metadata caching to avoid reading JSON metadata files from object storage for every query. When a schema change occurs, Dremio detects the updated catalog pointer and invalidates its coordinator metadata cache. This caching model ensures that queries always resolve against the latest schema version, preventing situations where query engines read stale layouts and return outdated fields.

Schema Evolution Best Practices and Design Guidelines

To ensure optimal performance and avoid operational issues, data engineers should follow key design guidelines when evolving Iceberg tables:

Conclusion

Apache Iceberg's metadata-first architecture transforms schema evolution and partitioning from high-risk operations into simple, routine metadata updates. By tracking columns using unique, immutable field IDs rather than names, Iceberg prevents silent data corruption and enables safe alterations, renames, and drops. Through hidden partitioning and partition evolution, Iceberg ensures query optimization is handled automatically by the metadata layer, eliminating query-time errors and costly migrations.

When paired with high-performance execution engines like Dremio, which accelerates reads using vectorized Arrow projection and cached metadata pruning, evolved tables deliver the same sub-second BI latencies as static, newly built tables. This combination allows data platforms to evolve as business requirements change, without interrupting active analytics or requiring expensive engineering maintenance.

Go Deeper

๐Ÿ“š Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.