If you are evaluating modern data platforms, you have probably heard people talk about Apache Iceberg. It is often described as a “table format for the data lake” or “the foundation of an open lakehouse”. Those phrases sound good, but they do not always make it clear what Iceberg actually does, or why so many data teams are moving towards it.
This article explains what Apache Iceberg is, how it organises your data, and why it is becoming a key building block for open lakehouse architectures. We also cover the operational challenges that appear once you move into production—issues most vendor demonstrations skip over entirely.
This is Part 1 of a two-part series. Part 2 looks at how to choose the best Iceberg platform and how to turn open lakehouse architecture into a strategic priority.
Apache Iceberg is an open table format for analytical data stored in cloud object storage such as Amazon S3, Azure Data Lake Storage or Google Cloud Storage. It brings data warehouse-style guarantees – such as ACID transactions, schema evolution and time travel – to the flexibility and low cost of a cloud data lake.
At a high level, Iceberg allows you to:
Where a traditional data warehouse is tightly coupled to a single engine, Iceberg is deliberately engine-agnostic. The same Apache Iceberg table can be queried by Spark, Snowflake, Trino, Flink, Athena, Qlik and others. That engine neutrality is one of the main reasons Iceberg sits at the centre of many open lakehouse strategies.
To understand why Iceberg is powerful, it helps to look at how it structures metadata. Rather than pointing directly at a folder of Parquet files, Iceberg uses several layers of metadata to keep track of what belongs to your table and how it has changed over time.
These are the Parquet, ORC or Avro files that hold the rows of your table. Iceberg does not replace your files; it manages and organises them so they behave like a single logical table.
Manifest files are like catalogue cards for your data. Each manifest records:
These statistics allow query engines to skip entire files that are irrelevant to a query, which is essential for performance at scale.
A manifest list is the index of all manifests for a given snapshot of the table. It tells Iceberg which manifests together describe the current state of the table.
Each Iceberg table has a metadata file that acts as the entry point. It stores:
This layered approach – data files, manifests, manifest lists and the metadata file – is what enables Iceberg’s advanced capabilities such as time travel, safe concurrent writes and consistent reads across engines.
The architecture is interesting, but the real question is what it delivers in practice. The features that usually matter most to organisations are:
Iceberg supports ACID transactions, so you can have multiple jobs writing to the same table at the same time without corrupting it. Each set of changes is committed as a new snapshot. Readers always see a consistent view of the data, even during updates.
Business requirements evolve, and so does your schema. Iceberg allows you to:
Because the schema is part of the table metadata rather than baked into file layout in a rigid way, Iceberg can maintain backward compatibility and avoid breaking existing queries and dashboards.
Every commit creates a new snapshot. You can query a table as of a specific time or snapshot ID. This time travel capability is extremely useful for:
Iceberg decouples the logical partitioning from the physical layout of files. That means you can change your partitioning strategy as your data or query patterns change, without having to rewrite all your historic data. For example, you might start with daily partitions and later switch to hourly or add geography as a dimension, while still retaining all previous data.
Because Apache Iceberg is an open standard, the same table can be accessed by multiple query engines across clouds. This gives you flexibility in how you use your data and reduces the risk of being locked into a single platform or vendor.
From a consultancy perspective, the decision to adopt Iceberg is usually driven by a combination of technical and strategic factors:
Modern data strategies are converging on an open lakehouse model: a shared, governed data layer on cloud storage, with multiple engines on top. Apache Iceberg plays a central role in that architecture by acting as the standard table format for that shared layer.
Apache Iceberg unlocks powerful capabilities for modern data lakehouse architectures. Once you move from prototypes into production, however, different questions start to appear:
These are not academic issues. They are the real operational challenges we see when organisations adopt Iceberg at scale. In this article, we focus on those challenges and what it takes to address them.
One of the most common Apache Iceberg performance issues is the small files problem. Every time you write to an Iceberg table – especially with streaming or frequent micro-batches – you create new data files.
Individually, each file may be small and quick to write. Over days and weeks of ingestion, they accumulate into thousands or even hundreds of thousands of tiny files. When a user runs a query, the engine must:
This leads to longer query planning times, more object storage calls and higher memory usage in the query engine. Queries that started out as sub-second can creep into seconds or minutes if small files are not managed and compacted.
Small files are only part of the story. Iceberg tracks every data file in its metadata, and that metadata grows over time as well.
Each write operation creates new manifest files that reference the newly written data files. Old manifests are not automatically discarded, because they form part of the historical record that drives time travel and snapshot isolation. Over time you can end up with:
This metadata growth, sometimes called metadata bloat, is a natural outcome of Iceberg’s design. Without active management, it can have a significant impact on both performance and cost.
Iceberg supports two main types of deletes:
Both approaches avoid rewriting large data files immediately, which makes deletes cheap in the short term. However, they also introduce what we call delete debt.
Every delete operation results in delete files that must be read alongside data files at query time. As delete files accumulate:
At some point, you need to physically rewrite data files and apply the deletes permanently. This is a compute-intensive operation that needs to be planned and coordinated.
The main mechanism to address small files, metadata bloat and delete debt is compaction: rewriting many small files into fewer larger ones, and optionally applying deletes at the same time.
Compaction sounds straightforward, but in practice you must make several non-trivial decisions:
Many organisations start with simple scheduled jobs and discover that they either over-compact, wasting money on unnecessary rewrites, or under-compact, suffering from poor performance because they do not do enough.
Schema evolution is one of Iceberg’s strengths, but it still needs to be treated carefully in production.
Downstream tools may cache schema information and not immediately pick up changes. Type changes can introduce subtle bugs. Dropping columns too early can remove information that is still needed for historical reporting or regulatory purposes.
From an operational perspective, Apache Iceberg schema evolution should be treated as a controlled change process. It needs coordination between data engineers, BI developers and business stakeholders, rather than happening ad hoc.
If you are running Apache Iceberg on top of open-source engines such as Spark, Trino or Flink, you are responsible for building and maintaining:
All of this is achievable, but it is not free. Over time, many organisations discover that operating Iceberg at scale is a platform engineering problem in its own right. The hidden cost is not in the Iceberg format itself, but in the tooling and processes needed to keep tables healthy.
There are two broad approaches to delivering Apache Iceberg in production:
Neither route is inherently right or wrong. The right choice depends on:
Part 2 of this series compares the main Iceberg platforms and products, including Qlik Open Lakehouse, and looks at how they can reduce the operational burden of running Iceberg at scale.
If you’re exploring Iceberg or modern lakehouse architectures, our experts can help you evaluate the right approach for your business. If you’d like to talk it through, schedule a call with our team.