This website uses cookies

Read our Privacy policy and Terms of use for more information.

What we are building

By the end of this tutorial you will have a working analytics project in Amazon SageMaker Unified Studio sitting on top of data you put in S3 yourself, not a canned sample. You will load a retail orders dataset into an S3 bucket, register it in the AWS Glue Data Catalog, then open a serverless notebook and run three kinds of analysis against the same table without provisioning a single cluster.

The non-obvious thing about the new SageMaker Unified Studio is that the notebook is not bound to one engine. A SQL cell talks to Athena SQL, a PySpark cell talks to Athena for Apache Spark (or EMR Serverless if you switch the runtime in the side panel), and a built-in AI agent can write either from a plain English prompt. You pick the engine per cell, and the compute spins up just in time and scales back down when the session ends. That means you stop paying the moment you stop working, which is the whole reason this is worth an hour of your evening.

The finished artifact is a notebook that reads a Glue-cataloged table, aggregates it with SQL, reprocesses it at scale with Spark, and renders a chart, all over your own bucket. You could push it to a repo and hand it to a teammate as the starting point for a real analytics project.

Prerequisites

You need an AWS account where you have administrator access, or at least enough IAM permissions to create an IAM role, an S3 bucket, Glue databases and tables, and to run the SageMaker Unified Studio one-click onboarding. The onboarding creates a domain and project under the hood, so a locked-down developer role will not be enough for the setup step.

You need the AWS CLI v2 configured (aws configure) and Python 3.9 or later with boto3 installed locally for the data-staging step. Everything after staging happens in the browser.

You must run this in a Region where SageMaker Unified Studio is available. As of June 2026 that includes us-east-1, us-east-2, us-west-2, eu-west-1, and several others. Check the current list in the supported Regions page linked in Sources before you start, because onboarding silently assumes you are in a supported Region. This tutorial uses us-east-1.

Assumed knowledge: you are comfortable reading Python and SQL, and you have seen a Jupyter-style notebook before. You do not need any prior Spark experience. Budget roughly 1 to 2 dollars for Athena SQL scans, a short Spark session, and a few cents of S3 storage.

Setup

First, stage a dataset in S3 and register it in the Glue Data Catalog. Run this locally. It generates a small synthetic retail orders CSV, uploads it, and creates a Glue database and external table pointing at it.

# stage_data.py
import boto3, csv, io, random, datetime
REGION = "us-east-1"
ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]
BUCKET = f"sl83-smus-demo-{ACCOUNT}"
PREFIX = "retail/orders"
DB = "sl83_retail"
s3 = boto3.client("s3", region_name=REGION)
glue = boto3.client("glue", region_name=REGION)
# 1. Create the bucket (us-east-1 needs no LocationConstraint)
try:
    s3.create_bucket(Bucket=BUCKET)
except s3.exceptions.BucketAlreadyOwnedByYou:
    pass
# 2. Generate 20k synthetic orders as CSV
regions = ["NA", "EMEA", "APAC", "LATAM"]
buf = io.StringIO()
w = csv.writer(buf)
for i in range(20_000):
    day = datetime.date(2026, 1, 1) + datetime.timedelta(days=random.randint(0, 180))
    w.writerow([i, day.isoformat(), random.choice(regions),
                round(random.uniform(5, 500), 2), random.randint(1, 6)])
s3.put_object(Bucket=BUCKET, Key=f"{PREFIX}/orders.csv", Body=buf.getvalue())
print(f"Uploaded s3://{BUCKET}/{PREFIX}/orders.csv")

The CSV has five columns: an order id, an order date, a sales region, an amount, and a quantity. Twenty thousand rows is enough to make Spark non-trivial while staying inside cents of storage. Notice there is no header row, which matters for the table definition next.

Now register the database and table in Glue so SageMaker Unified Studio can discover it through the catalog:

# stage_data.py (continued)
try:
    glue.create_database(DatabaseInput={"Name": DB})
except glue.exceptions.AlreadyExistsException:
    pass
glue.create_table(
    DatabaseName=DB,
    TableInput={
        "Name": "orders",
        "StorageDescriptor": {
            "Columns": [
                {"Name": "order_id", "Type": "bigint"},
                {"Name": "order_date", "Type": "string"},
                {"Name": "region", "Type": "string"},
                {"Name": "amount", "Type": "double"},
                {"Name": "quantity", "Type": "int"},
            ],
            "Location": f"s3://{BUCKET}/{PREFIX}/",
            "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
                "Parameters": {"separatorChar": ","},
            },
        },
        "Parameters": {"classification": "csv"},
        "TableType": "EXTERNAL_TABLE",
    },
)
print(f"Registered {DB}.orders in the Glue Data Catalog")

Run it: python stage_data.py. The smoke test that proves the setup is correct is to query the table once from the Athena console before touching SageMaker Unified Studio. Open Athena, pick the sl83_retail database, and run SELECT count(*) FROM orders. You should get 20000. If that works, the catalog and S3 wiring are sound, and the rest is just the notebook.

Step 1: One-click onboarding into SageMaker Unified Studio

This is the step that used to take a platform team a week. Go to the SageMaker console at console.aws.amazon.com/datazone/home and choose the Get started button. SageMaker offers to either reuse an existing IAM role that already has access to your data and compute, or create a new one. For this tutorial, let it create a new role so onboarding wires up the Glue, Lake Formation, and S3 permissions for you, then choose Set up.

Onboarding takes a few minutes. Under the covers it creates an IAM-based domain and a project with default serverless compute already attached: Athena SQL, Athena for Apache Spark, AWS Glue Spark, and Managed Workflows for Apache Airflow serverless. You provision nothing and tune nothing. When it finishes you land on the SageMaker Unified Studio portal with your Glue Data Catalog tables, including sl83_retail.orders, visible in the left navigation along with the analytics and ML tools.

The reason this matters beyond convenience is the permission model. Onboarding preserves your existing Data Catalog, Lake Formation, and S3 permissions instead of inventing a parallel set. The data you could already see, you can see here. The data you could not, you still cannot. That is what makes the one-click path safe to run against a real account rather than a throwaway sandbox.

Step 2: Run your first SQL cell in a serverless notebook

In the portal left navigation, choose Notebooks to open a new serverless notebook. A notebook here is a sequence of cells, and each cell has a connection selector that decides which engine runs it. For your first cell, set the connection to the Athena (SQL) connection, then type a plain SQL aggregation:

SELECT region,
       count(*)        AS orders,
       round(sum(amount), 2) AS revenue,
       round(avg(amount), 2) AS avg_order
FROM sl83_retail.orders
GROUP BY region
ORDER BY revenue DESC;

Run the cell. Within a few seconds you get four rows back, one per region, sorted by revenue. This is Athena SQL underneath, which means you are billed only for the bytes scanned, with a 10 MB minimum per query. Our CSV is well under a megabyte, so this query rounds to the minimum and costs a fraction of a cent.

The point of starting in SQL is that most analytics questions are SQL questions, and the serverless notebook does not make you leave SQL to get them answered. You are not configuring a JDBC driver or a workgroup. The connection was created for you during onboarding, and the catalog reference sl83_retail.orders resolves to the same S3 location your stage_data.py script wrote.

Step 3: Scale the same data up with Athena for Spark

SQL is the right tool until it is not. The moment you need a window function across a large partition, a Python UDF, or a join that spills past what you want to scan repeatedly, you move to Spark. In this notebook you do not switch tools. You add a new cell and set its connection to Athena Spark. The cell gives you a ready spark session object, so you write ordinary PySpark:

df = spark.sql("SELECT * FROM sl83_retail.orders")
from pyspark.sql import functions as F
daily = (df
    .withColumn("amount", F.col("amount").cast("double"))
    .groupBy("order_date", "region")
    .agg(F.sum("amount").alias("revenue"),
         F.count("*").alias("orders"))
    .orderBy("order_date"))
daily.show(10)
print("rows:", daily.count())

The first Spark cell in a session pays a cold-start cost while the serverless application warms, usually under a minute. Every cell after that runs against the warm session. Athena for Apache Spark bills per DPU-hour, where one DPU is 4 vCPUs and 16 GB of memory, at 0.35 dollars per DPU-hour in 1-second increments. A short interactive session with a driver and a handful of small calculations lands around 1 to 3 DPU-hours, so a few cents to about a dollar for this tutorial. The billing stops when the session ends.

If you would rather run on EMR Serverless than Athena Spark, the June 2026 release lets you select the Spark runtime from the notebook side panel, and the choice applies to both Python and SQL cells in that session. The PySpark code above is identical either way. EMR Serverless adds pre-initialized capacity for faster session starts and VPC connectivity if your workload needs network isolation, which is the reason to reach for it over Athena Spark on a real project.

Step 4: Let the Data Agent write the analysis

The serverless notebook ships with the SageMaker Data Agent, a built-in assistant that turns a natural-language prompt into SQL, Python, or Spark code with an execution plan. It reads your Data Catalog metadata, so it knows your schemas and relationships before it writes a line. Open the agent panel and give it something specific:

Using the sl83_retail.orders table, compute monthly revenue per
region for 2026, then plot it as a grouped bar chart with matplotlib.

The agent responds with a step-by-step plan and code: a query to pull and aggregate the data, then a matplotlib block to chart it. Specific prompts win here. "Analyze my data" produces vague output, while naming the table, the grouping, and the chart type produces code you can run as-is. If a generated cell throws an error, the Fix with AI action feeds the traceback back to the agent and it patches the cell.

Treat the agent as a fast first draft, not an oracle. It is genuinely good at the boilerplate you would otherwise copy from Stack Overflow, the connection setup and the matplotlib incantations, and it is wrong often enough about business logic that you should read the SQL before you trust the number. Here is the kind of visualization cell it produces, which you can also paste yourself:

import matplotlib.pyplot as plt
pdf = daily.groupBy(F.substring("order_date", 1, 7).alias("month"),
                    "region").agg(F.sum("revenue").alias("rev")).toPandas()
pivot = pdf.pivot(index="month", columns="region", values="rev").fillna(0)
pivot.plot(kind="bar", figsize=(10, 5), title="Monthly revenue by region")
plt.tight_layout(); plt.show()

Verify it works

You have three checkpoints. The SQL cell in Step 2 should return exactly four rows, one per region, with a revenue column in the low millions given 20,000 random orders averaging around 250 dollars. The Spark cell in Step 3 should print rows: followed by a number in the hundreds (roughly 180 days times 4 regions, minus any date and region combinations that drew zero orders), and daily.show(10) should print a clean table of order_date, region, revenue, orders.

The chart cell should render a grouped bar chart with one cluster per month and four bars per cluster. If all three render, the project is real: you onboarded your own S3 data and queried it through SQL, Spark, and an AI agent in one serverless notebook, with no cluster in sight. That is the contract of this tutorial.

When it breaks

If onboarding fails or the Get started button is missing, you are almost certainly in an unsupported Region. Switch to us-east-1 and retry. The console does not warn you clearly when SageMaker Unified Studio is unavailable where you are.

If sl83_retail.orders does not appear in the portal, the most common cause is Lake Formation. On accounts where Lake Formation manages the Glue databases, the onboarding role needs a Lake Formation grant on the database before the catalog surfaces it. Grant DESCRIBE and SELECT on sl83_retail to the SageMaker Unified Studio project role in the Lake Formation console, then refresh.

If the SQL cell returns zero rows but the Athena console smoke test returned 20000, your cell is pointed at the wrong connection or database. Confirm the cell's connection selector is the Athena SQL connection and that you fully qualified the table as sl83_retail.orders rather than relying on a default database.

If the first Spark cell hangs for more than a couple of minutes, the serverless application is cold-starting, which is normal once per session. If it errors immediately with an access-denied message, the project role lacks permission on the Athena Spark workgroup or the results S3 location; re-run onboarding with a freshly created role rather than a reused least-privilege one.

If the Data Agent generates code referencing columns that do not exist, your prompt was too vague and it guessed the schema. Name the exact table and columns in the prompt, and the grounded metadata pulls it back to reality.

Where to take it next

Swap the Spark runtime from Athena Spark to EMR Serverless in the notebook side panel and re-run Step 3 unchanged, then compare cold-start time and the Spark UI for the two engines on the same workload. This is the cheapest way to build intuition for when each engine is the right call.

Next, replace the synthetic CSV with a real columnar dataset. Convert orders to Parquet and watch the Athena SQL cost drop, because Athena only scans the columns a query touches, turning a full-table scan into a fraction of the bytes. Then wire the notebook into a schedule with the MWAA serverless environment that onboarding already created, so the daily aggregation runs on its own.

The harder extension is governance. Point a second project at the same catalog with a restricted role and confirm that Lake Formation column-level permissions actually hide a column from the notebook. If they do, you have the beginnings of a multi-team analytics platform on the same data, which is the entire pitch of Unified Studio. Does your team's current setup let you prove that as quickly?

Cost and cleanup

Following this tutorial costs roughly 1 to 2 dollars: a handful of Athena SQL queries at the 10 MB minimum (fractions of a cent each), one short Athena for Spark session at 0.35 dollars per DPU-hour for around 1 to 3 DPU-hours, and a few cents of S3 storage and requests. The serverless compute bills only while a session is active, so the dominant variable is how long you leave the Spark session open. Close it when you finish.

Clean up to avoid trickle charges. Delete the SageMaker Unified Studio project and domain from the portal or the datazone console first, since the domain keeps default compute associated. Then run this locally to remove the data and catalog entries:

# cleanup.py
import boto3
REGION = "us-east-1"
ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]
BUCKET = f"sl83-smus-demo-{ACCOUNT}"
glue = boto3.client("glue", region_name=REGION)
s3 = boto3.resource("s3", region_name=REGION)
glue.delete_table(DatabaseName="sl83_retail", Name="orders")
glue.delete_database(Name="sl83_retail")
bucket = s3.Bucket(BUCKET)
bucket.objects.all().delete()
bucket.delete()
print("Removed table, database, and bucket")

Verify in the billing console the next day that no Athena workgroup or domain is still incurring charges. With the project deleted and the bucket emptied, the running cost returns to zero.

Sources

Keep Reading