Understanding Apache Iceberg: key features and how to solve operational challenges [Part 1]

Written by David Tomlins | Dec 8, 2025 10:40:44 AM

If you are evaluating modern data platforms, you have probably heard people talk about Apache Iceberg. It is often described as a “table format for the data lake” or “the foundation of an open lakehouse”. Those phrases sound good, but they do not always make it clear what Iceberg actually does, or why so many data teams are moving towards it.

This article explains what Apache Iceberg is, how it organises your data, and why it is becoming a key building block for open lakehouse architectures. We also cover the operational challenges that appear once you move into production—issues most vendor demonstrations skip over entirely.

This is Part 1 of a two-part series. Part 2 looks at how to choose the best Iceberg platform and how to turn open lakehouse architecture into a strategic priority.

What is Apache Iceberg?

Apache Iceberg is an open table format for analytical data stored in cloud object storage such as Amazon S3, Azure Data Lake Storage or Google Cloud Storage. It brings data warehouse-style guarantees – such as ACID transactions, schema evolution and time travel – to the flexibility and low cost of a cloud data lake.

At a high level, Iceberg allows you to:

Store data once in open file formats such as Parquet.
Expose that data as reliable, governed tables to many compute engines.
Maintain ACID guarantees even when multiple jobs write concurrently.
Evolve the schema and partitioning strategy without breaking downstream workloads.

Where a traditional data warehouse is tightly coupled to a single engine, Iceberg is deliberately engine-agnostic. The same Apache Iceberg table can be queried by Spark, Snowflake, Trino, Flink, Athena, Qlik and others. That engine neutrality is one of the main reasons Iceberg sits at the centre of many open lakehouse strategies.

How Iceberg organises your data

To understand why Iceberg is powerful, it helps to look at how it structures metadata. Rather than pointing directly at a folder of Parquet files, Iceberg uses several layers of metadata to keep track of what belongs to your table and how it has changed over time.

Data files: the actual payload

These are the Parquet, ORC or Avro files that hold the rows of your table. Iceberg does not replace your files; it manages and organises them so they behave like a single logical table.

Manifest files: detailed inventories

Manifest files are like catalogue cards for your data. Each manifest records:

Which data files are part of the table snapshot.
How the data is partitioned.
Statistics for each file, such as minimum and maximum values and record counts.
File sizes and locations in object storage.

These statistics allow query engines to skip entire files that are irrelevant to a query, which is essential for performance at scale.

Manifest lists: the master index

A manifest list is the index of all manifests for a given snapshot of the table. It tells Iceberg which manifests together describe the current state of the table.

Metadata file: the table blueprint

Each Iceberg table has a metadata file that acts as the entry point. It stores:

The current schema definition.
Partitioning configuration.
Table properties and configuration.
Pointers to the active manifest list, and therefore to the current snapshot.

This layered approach – data files, manifests, manifest lists and the metadata file – is what enables Iceberg’s advanced capabilities such as time travel, safe concurrent writes and consistent reads across engines.

Key features of Apache Iceberg that matter

The architecture is interesting, but the real question is what it delivers in practice. The features that usually matter most to organisations are:

ACID transactions at lake scale

Iceberg supports ACID transactions, so you can have multiple jobs writing to the same table at the same time without corrupting it. Each set of changes is committed as a new snapshot. Readers always see a consistent view of the data, even during updates.

Schema evolution that reflects real business change

Business requirements evolve, and so does your schema. Iceberg allows you to:

Add new columns.
Rename existing columns.
Reorder or drop columns where it is safe to do so.

Because the schema is part of the table metadata rather than baked into file layout in a rigid way, Iceberg can maintain backward compatibility and avoid breaking existing queries and dashboards.

Time travel and snapshot based analytics

Every commit creates a new snapshot. You can query a table as of a specific time or snapshot ID. This time travel capability is extremely useful for:

Auditing historical values.
Reproducing the state of reports at a previous point in time.
Debugging data issues and pipeline changes.
Recovering from accidental changes.

Partition evolution

Iceberg decouples the logical partitioning from the physical layout of files. That means you can change your partitioning strategy as your data or query patterns change, without having to rewrite all your historic data. For example, you might start with daily partitions and later switch to hourly or add geography as a dimension, while still retaining all previous data.

Multi-engine, multi-cloud flexibility

Because Apache Iceberg is an open standard, the same table can be accessed by multiple query engines across clouds. This gives you flexibility in how you use your data and reduces the risk of being locked into a single platform or vendor.

Why organisations are adopting Apache Iceberg

From a consultancy perspective, the decision to adopt Iceberg is usually driven by a combination of technical and strategic factors:

Open, vendor-neutral architecture: the table definition and metadata are open and portable.
Cost efficiency: data lives in low-cost, durable cloud storage but behaves like a governed warehouse table.
Support for analytics and AI: a single, consistent table layer can serve BI, data science and machine learning workloads.
Governance and auditability: time travel and versioned metadata support compliance, impact analysis and change control.

Iceberg in an open lakehouse strategy

Modern data strategies are converging on an open lakehouse model: a shared, governed data layer on cloud storage, with multiple engines on top. Apache Iceberg plays a central role in that architecture by acting as the standard table format for that shared layer.

The hidden operational challenges of Apache Iceberg

Apache Iceberg unlocks powerful capabilities for modern data lakehouse architectures. Once you move from prototypes into production, however, different questions start to appear:

Why are queries getting slower over time?
Why are there so many small files?
Why is metadata storage growing so quickly?
Why do deletes seem to have such a big performance impact?

These are not academic issues. They are the real operational challenges we see when organisations adopt Iceberg at scale. In this article, we focus on those challenges and what it takes to address them.

The small files problem

One of the most common Apache Iceberg performance issues is the small files problem. Every time you write to an Iceberg table – especially with streaming or frequent micro-batches – you create new data files.

Individually, each file may be small and quick to write. Over days and weeks of ingestion, they accumulate into thousands or even hundreds of thousands of tiny files. When a user runs a query, the engine must:

Read the manifest list for the active snapshot.
Scan a large number of manifest files.
Evaluate statistics for many individual data files.
Decide which files can be skipped and which must be read.

This leads to longer query planning times, more object storage calls and higher memory usage in the query engine. Queries that started out as sub-second can creep into seconds or minutes if small files are not managed and compacted.

Metadata bloat in Apache Iceberg

Small files are only part of the story. Iceberg tracks every data file in its metadata, and that metadata grows over time as well.

Each write operation creates new manifest files that reference the newly written data files. Old manifests are not automatically discarded, because they form part of the historical record that drives time travel and snapshot isolation. Over time you can end up with:

Large numbers of manifest files.
Complex manifest lists that take longer to process.
Higher metadata storage costs.
More work for query engines when planning and optimising queries.

This metadata growth, sometimes called metadata bloat, is a natural outcome of Iceberg’s design. Without active management, it can have a significant impact on both performance and cost.

Delete files and “delete debt”

Iceberg supports two main types of deletes:

Position deletes, which mark specific rows in specific data files as deleted.
Equality deletes, which identify rows to delete based on column values, such as all records for a particular customer.

Both approaches avoid rewriting large data files immediately, which makes deletes cheap in the short term. However, they also introduce what we call delete debt.

Every delete operation results in delete files that must be read alongside data files at query time. As delete files accumulate:

Queries have to read both data and delete files.
More work is required to filter out deleted rows.
Performance gradually declines, especially on heavily updated partitions.

At some point, you need to physically rewrite data files and apply the deletes permanently. This is a compute-intensive operation that needs to be planned and coordinated.

Compaction: essential but difficult

The main mechanism to address small files, metadata bloat and delete debt is compaction: rewriting many small files into fewer larger ones, and optionally applying deletes at the same time.

Compaction sounds straightforward, but in practice you must make several non-trivial decisions:

When to compact: too frequently and you waste compute; too infrequently and performance suffers.
What to compact: heavily updated partitions need more attention; historical partitions may be almost read-only.
How aggressive to be: file size targets and thresholds influence both performance and cost.
How to schedule and prioritise compaction: poorly timed jobs can contend with ingestion pipelines and user queries.

Many organisations start with simple scheduled jobs and discover that they either over-compact, wasting money on unnecessary rewrites, or under-compact, suffering from poor performance because they do not do enough.

Schema evolution and change control

Schema evolution is one of Iceberg’s strengths, but it still needs to be treated carefully in production.

Downstream tools may cache schema information and not immediately pick up changes. Type changes can introduce subtle bugs. Dropping columns too early can remove information that is still needed for historical reporting or regulatory purposes.

From an operational perspective, Apache Iceberg schema evolution should be treated as a controlled change process. It needs coordination between data engineers, BI developers and business stakeholders, rather than happening ad hoc.

The cost of managing Iceberg manually

If you are running Apache Iceberg on top of open-source engines such as Spark, Trino or Flink, you are responsible for building and maintaining:

Compaction pipelines and policies.
Delete file cleanup routines.
Metadata pruning and snapshot retention.
Partition evolution strategies.
Schema evolution workflows and approvals.
Monitoring and alerting for table health.
Performance tuning and troubleshooting.

All of this is achievable, but it is not free. Over time, many organisations discover that operating Iceberg at scale is a platform engineering problem in its own right. The hidden cost is not in the Iceberg format itself, but in the tooling and processes needed to keep tables healthy.

Two broad approaches

There are two broad approaches to delivering Apache Iceberg in production:

Build your own optimisation and governance framework on top of open-source engines.
Adopt a platform that provides automated optimisation, compaction, delete management and governance as part of a managed Iceberg experience.

For either option, check out our data engineering services to see how we can help.

Neither route is inherently right or wrong. The right choice depends on:

The skills and capacity of your engineering teams.
Your appetite for owning and operating platform components.
Your need for fine-grained control versus time to value.
The number of different engines and tools that must access your Iceberg tables.

Part 2 of this series compares the main Iceberg platforms and products, including Qlik Open Lakehouse, and looks at how they can reduce the operational burden of running Iceberg at scale.

Need additional help?

If you’re exploring Iceberg or modern lakehouse architectures, our experts can help you evaluate the right approach for your business. If you’d like to talk it through, schedule a call with our team.

View full post