SL#84 - AWS AI Series (21/30) - Build an Iceberg Lakehouse on S3 Tables: Time Travel and Auto-Compaction, Queried From Two Engines

What we are building

By the end you will have an Iceberg table living in an S3 table bucket, registered in the SageMaker lakehouse so two different engines can read it, with three things demonstrated that are painful to get right on a roll-your-own Iceberg setup: time travel across snapshots, the same table queried from Athena and from Spark without copying data, and compaction that runs automatically instead of as a cron job you babysit.

The reason this is its own episode is that "build a lakehouse" usually means standing up a catalog, wiring an Iceberg runtime into Spark, configuring a metastore, and then scheduling OPTIMIZE and VACUUM jobs so small files do not strangle your query performance six weeks later. S3 Tables collapses most of that. A table bucket is a first-class S3 resource that stores Iceberg tables as subresources, and S3 runs the maintenance, including compaction, snapshot expiration, and unreferenced file cleanup, on your behalf. The non-obvious payoff is operational: the part of an Iceberg deployment that actually costs teams their weekends is the maintenance, and that is the part you are handing to S3.

Here is the destination: one CLI command creates the table bucket, a handful of SQL statements in the SageMaker Unified Studio Query Editor create and populate an Iceberg table, a FOR VERSION AS OF query reads a past snapshot, a Spark cell reads the live table, and a single CLI call shows you the compaction config S3 is already enforcing.

Prerequisites

You need an AWS account where you can create S3 table buckets, act as a Lake Formation administrator, and you should already have a SageMaker Unified Studio domain and project. If you do not, episode 20 of this series walks through one-click onboarding in about ten minutes, and you want that done before you start here.

You need the AWS CLI v2 configured (aws sts get-caller-identity should succeed) and an AWS Region where S3 Tables is available. Use us-east-1 if you have no preference. You should be comfortable reading SQL and a few lines of PySpark. No prior Iceberg knowledge is required. The Iceberg-specific concepts, snapshots and time travel, are explained as we hit them.

A note on cost. Following along costs well under a dollar. S3 Tables bills for storage, requests, and the managed maintenance (compaction is billed per GB of objects processed, plus a small monthly per-object monitoring charge), and our table is a few thousand rows, so all of that rounds to cents. Athena SQL is 5 USD per TB scanned, and these scans are kilobytes. Athena for Spark is roughly 0.35 USD per DPU-hour billed per second, so a short session is a few cents. The full teardown is in the cleanup section.

Setup: create a table bucket with integration on

A table bucket is not a normal S3 bucket. It is a separate bucket type that stores tabular data and exposes Iceberg semantics. Create one from the CLI:

aws s3tables create-table-bucket --name sl-lakehouse
# note the tableBucketARN it returns, you will reuse it

Now turn on the analytics integration so engines like Athena and Spark can see the bucket through the catalog. The simplest path is the S3 console: open Table buckets, and when you create or open a table bucket, choose Enable integration. Behind that one click S3 creates a service role that lets Lake Formation manage access to all table buckets in the Region, and it creates the s3tablescatalog entry in the AWS Glue Data Catalog. That catalog is the bridge that makes an S3 Tables bucket queryable from the lakehouse.

Integration is the part people skip and then wonder why their table is invisible to Athena. Without it there is no Glue catalog entry, so no engine can resolve the table name. Confirm it worked: in the Lake Formation console, choose Catalogs, and you should see s3tablescatalog with your sl-lakehouse bucket underneath it.

Last setup step, grant your SageMaker project access. In Lake Formation, open s3tablescatalog, choose your sl-lakehouse bucket, and from Actions choose Grant. Select your SageMaker Unified Studio project role and grant Super user for this tutorial. In a real deployment you would grant narrower table and column permissions, but Super user keeps the focus on the lakehouse mechanics here.

Step 1: Register the catalog in SageMaker Unified Studio

Open the SageMaker console, select your domain, and open your project. In the navigation choose Data, then the + to add a new data source, and choose Create Lakehouse catalog. Pick Amazon S3 Tables as the source, give the catalog a name, and a database name. Choose Create catalog.

This wires the S3 Tables bucket into the project as a queryable catalog. From here the table you create is reachable by every integrated engine: Athena, Redshift, EMR, Glue, and the Spark runtime in your notebook. You did not copy data anywhere. The table bucket holds the bytes, the Glue catalog holds the metadata, and the lakehouse catalog is the project-scoped pointer that lets your tools resolve it.

If the catalog creation errors with an access problem, the project role is missing the Lake Formation grant from setup. Go back and grant it before continuing. The order matters: integration first, then the Lake Formation grant, then the catalog registration.

Step 2: Create an Iceberg table and write your first snapshots

From the Build menu open the Query Editor and make sure the connection is Amazon Athena with your new S3 Tables catalog and database selected. Create the database (this also creates the Iceberg namespace), then the table:

CREATE DATABASE "s3tablescatalog/sl-lakehouse".sales;

CREATE TABLE "s3tablescatalog/sl-lakehouse".sales.orders (
  order_id   bigint,
  category   string,
  amount     double,
  order_ts   timestamp)
TBLPROPERTIES ('table_type' = 'ICEBERG');

The table_type of ICEBERG is what makes this a managed Iceberg table rather than a plain Hive table over files. Now write some data. Each INSERT creates a new Iceberg snapshot, which is the unit time travel rewinds to:

INSERT INTO "s3tablescatalog/sl-lakehouse".sales.orders VALUES
  (1, 'electronics', 299.00, timestamp '2026-06-01 09:00:00'),
  (2, 'grocery',      42.50, timestamp '2026-06-01 10:15:00');

INSERT INTO "s3tablescatalog/sl-lakehouse".sales.orders VALUES
  (3, 'apparel',      89.99, timestamp '2026-06-02 11:30:00'),
  (4, 'electronics', 149.00, timestamp '2026-06-02 14:05:00');

Run a quick SELECT count(*) and you should get 4. You now have a table with at least two snapshots, one per insert. Each snapshot is a consistent, immutable view of the table at the moment it was written. That immutability is the whole basis of Iceberg's reliability story, and it is what the next step exploits.

Step 3: Travel back in time

Iceberg keeps a versioned manifest of every snapshot, and Athena exposes that history through metadata tables and time travel clauses. First, list the snapshots so you have real IDs to work with. With the catalog and database selected in the Query Editor, query the $snapshots metadata table:

SELECT committed_at, snapshot_id, operation
FROM "orders$snapshots"
ORDER BY committed_at;

You will see one row per insert, each with a committed_at timestamp and a numeric snapshot_id. Now read the table as it existed at the first snapshot, before the second insert landed. Use version travel with one of the IDs from that list:

SELECT * FROM orders FOR VERSION AS OF 8027658604211071520;

That returns only the two rows from the first insert. Swap in your own snapshot ID from the $snapshots output. You can also travel by wall-clock time, which is what you reach for during an incident when you know roughly when bad data landed but not the snapshot ID:

SELECT * FROM orders
FOR TIMESTAMP AS OF TIMESTAMP '2026-06-01 23:00:00 UTC';

This is the capability that makes "someone ran a bad batch update at 2am" a recoverable event rather than a restore-from-backup event. Note the syntax: Athena engine version 3 uses FOR TIMESTAMP AS OF and FOR VERSION AS OF. If you find older examples using FOR SYSTEM_TIME AS OF, those are the engine version 2 spelling and will not run here.

Step 4: Query the same table from a second engine

The point of a lakehouse, as opposed to a warehouse, is that one copy of the data serves many engines. You just queried orders from Athena. Now read the identical table from Spark, no export, no copy. In your project open a serverless notebook, add a Python cell, and select Athena for Apache Spark as the runtime. The notebook gives you a ready spark session:

df = spark.sql("""
    SELECT category,
           count(*)              AS orders,
           round(sum(amount), 2) AS revenue
    FROM sales.orders
    GROUP BY category
    ORDER BY revenue DESC
""")
df.show()

You should see the per-category aggregation over all four rows, computed by Spark, reading the same Iceberg table Athena just wrote and queried. Spark resolves sales.orders through the same lakehouse catalog. The bytes never moved.

This is the part that justifies the lakehouse architecture over a pile of per-tool copies. An analyst runs interactive SQL in Athena, a data engineer runs a heavy Spark transform, and a downstream Redshift job reads the same governed table, all against one physical dataset with one set of Lake Formation permissions. The cost of getting this wrong on a hand-built stack is three diverging copies and a governance nightmare. Here it is one table.

Step 5: Confirm compaction is running without you

On a self-managed Iceberg table, every insert above added small files, and small files are how Iceberg deployments quietly rot: query planners open thousands of tiny objects and latency climbs. The usual fix is a scheduled OPTIMIZE ... REWRITE DATA job. S3 Tables does this automatically. Inspect the compaction configuration S3 is already enforcing on your table:

aws s3tables get-table-maintenance-configuration \
  --table-bucket-arn arn:aws:s3tables:us-east-1:<ACCOUNT>:bucket/sl-lakehouse \
  --namespace sales \
  --name orders

The response shows icebergCompaction with a status of enabled and a target file size (512 MB by default). You configured none of this. S3 continuously compacts small objects into fewer, larger files, expires old snapshots on a schedule, and removes unreferenced files, which is what keeps both query performance and storage cost from drifting over time. If you want a different target file size or a different snapshot retention window, you change it with put-table-maintenance-configuration, but the defaults are sane and on from creation.

The conceptual point worth keeping: compaction and time travel pull in opposite directions. Compaction wants to delete old files, time travel wants to keep them. S3 Tables reconciles that through snapshot management, expiring snapshots past the retention window so cleanup can proceed, which is exactly the bookkeeping you would otherwise hand-tune. You get both behaviors without arbitrating between them yourself.

Verify it works

You have a working lakehouse if five things hold. First, SELECT count(*) FROM orders returns 4. Second, the "orders$snapshots" query returned at least two rows with distinct snapshot_id values. Third, a FOR VERSION AS OF query against the earliest snapshot returned only the first two rows, proving the older version is still readable. Fourth, the Spark cell in Step 4 returned the per-category aggregation, proving a second engine reads the same table. Fifth, get-table-maintenance-configuration reported icebergCompaction enabled.

If all five are true, you have an Iceberg lakehouse on S3 Tables with time travel, multi-engine access, and managed compaction, and you wrote zero maintenance jobs to get it.

When it breaks

If CREATE TABLE fails with an access-denied or "catalog not found" error, the Lake Formation grant or the integration is missing. Confirm s3tablescatalog exists in Lake Formation and that your SageMaker project role has a grant on the bucket. This is the single most common failure, and it is always a permissions gap, not a SQL gap.

If the $snapshots query returns "table not found," you are likely missing the catalog or database context in the Query Editor, or you quoted the metadata table name wrong. The dollar-suffix metadata tables must be quoted as one identifier, "orders$snapshots", not orders$snapshots.

If a time travel query returns "no such snapshot," your snapshot ID is stale or mistyped, or the snapshot was already expired by snapshot management. Re-run the $snapshots query to get a current ID. On a brand-new table nothing will have expired yet, so this is almost always a copy-paste error.

If the Spark cell cannot resolve sales.orders, the notebook runtime is using a different catalog than the Query Editor. Make sure the Athena for Spark runtime is selected and the lakehouse catalog from Step 1 is the active catalog for the session.

If get-table-maintenance-configuration errors, check that the --table-bucket-arn, --namespace, and --name all match exactly. The namespace is your database name (sales), and the bucket ARN is the one create-table-bucket returned, not a normal S3 bucket ARN.

Clean up

Tear down in dependency order so nothing is left billing. From the Query Editor, drop the table, then delete the namespace and bucket from the CLI:

DROP TABLE "s3tablescatalog/sl-lakehouse".sales.orders;

aws s3tables delete-namespace \
  --table-bucket-arn arn:aws:s3tables:us-east-1:<ACCOUNT>:bucket/sl-lakehouse \
  --namespace sales

aws s3tables delete-table-bucket \
  --table-bucket-arn arn:aws:s3tables:us-east-1:<ACCOUNT>:bucket/sl-lakehouse

Then, in Lake Formation, revoke the grant you gave the project role, and in SageMaker remove the lakehouse catalog data source from the project if you do not plan to keep it. The s3tablescatalog and its service role are account-level and harmless to leave, but the table bucket is the thing that holds storage, so deleting it is what actually stops the meter.

Where to take it next

Wire in a third engine. The same table is readable from Amazon Redshift through the lakehouse catalog. Run the Step 4 aggregation as a Redshift query and confirm you get identical numbers from a warehouse engine reading the lake table, which is the clearest possible demonstration that there is one copy of the data.

Tune compaction for a real access pattern. Define a sort order on the table with ALTER TABLE sales.orders WRITE ORDERED BY order_ts, then switch the compaction strategy from the default bin-pack to sort. On a table with frequent range queries on a timestamp column, sort compaction clusters related rows together and cuts the bytes scanned per query. That is the bridge from "it works" to "it is fast under load."

Then make it stream. The next episodes in this series push live events into the lake. An S3 Tables Iceberg table is a natural sink for that: appends become snapshots, snapshots become time travel, and S3 keeps the file count under control while the data keeps arriving. The question to sit with: once maintenance is free, what is actually left of the operational case for a separate data warehouse?