What we are building
Every useful LLM feature needs context the model does not have. A support agent that knows the customer's last 90 days of orders. A churn email that references the actual refund the user just got. A risk summary that cites real numbers. The model is only as good as the facts you hand it, and those facts live in your data, not in the weights.
The naive answer is to stand up a database, write an ingestion job, and query it from your app. That is a lot of moving parts for what is really a read-only lookup over data you already dumped in S3. We are going to skip all of it. Athena lets you run standard SQL directly against files in S3 with nothing to provision. You point a table definition at a prefix, run a SELECT, and pay per terabyte scanned. It is the closest thing AWS has to a serverless data API.
By the end you will have a Python module, athena_features.py, that takes a customer ID, runs one aggregation query against order data in S3, parses the result into a feature dictionary, and feeds it to Amazon Bedrock to generate a grounded, personalized message. The non-obvious move is turning on Athena's query result reuse so that repeated lookups for the same customer hit a cache and cost nothing, which is exactly the access pattern an AI feature produces.
Here is what the finished thing does: you call build_customer_features("C00042"), it returns {order_count: 7, total_spend: 412.50, top_category: "outdoor", refund_rate: 0.14, ...}, and a second call hands those features to Nova to write a two-sentence retention note that actually references the numbers.
Prerequisites
You need an AWS account with permissions to use Athena, Glue, S3, and Bedrock. You need Python 3.10 or newer and boto3 installed. You need Bedrock model access enabled for an Amazon Nova model in your region (the Bedrock console, Model access page, takes a minute to approve Nova). You need an S3 bucket you can write to.
Assumed knowledge: you can read Python, you have written SQL before, and you have run a boto3 script with credentials configured (via aws configure, an SSO profile, or an instance role). You do not need any prior Athena experience. We set up the table from scratch.
One cost note up front so nobody is surprised: Athena bills $5.00 per terabyte scanned, rounded up to a 10 MB minimum per query. The sample dataset in this tutorial is a few kilobytes, so every query you run costs the 10 MB minimum, which is $0.00005. You could run this tutorial a thousand times for a nickel. Bedrock Nova calls are fractions of a cent each. Full numbers are in the cost section.
Setup
Pick a region that has both Athena and Bedrock Nova (this tutorial uses us-east-1) and set up a working bucket. Replace the bucket name with something globally unique.
export AWS_REGION=us-east-1
export BUCKET=sl-athena-features-$(date +%s)
aws s3 mb "s3://$BUCKET" --region "$AWS_REGION"
# Two prefixes: one for source data, one for Athena query output.
export DATA_PREFIX="s3://$BUCKET/orders/"
export OUTPUT_PREFIX="s3://$BUCKET/athena-results/"
pip install "boto3>=1.40,<2"
Now generate a tiny sample dataset and upload it. Athena reads many formats; we use newline-delimited JSON because it is trivial to produce and read. Each row is one order.
python - <<'PY'
import json, random, datetime, os
random.seed(7)
cats = ["outdoor", "kitchen", "books", "electronics"]
rows = []
for i in range(400):
cid = f"C{random.randint(1, 60):05d}"
days_ago = random.randint(0, 200)
d = (datetime.date(2026, 6, 29) - datetime.timedelta(days=days_ago)).isoformat()
amount = round(random.uniform(8, 120), 2)
rows.append({"customer_id": cid, "order_date": d, "category": random.choice(cats),
"amount": amount, "refunded": random.random() < 0.15})
with open("orders.json", "w") as f:
for r in rows:
f.write(json.dumps(r) + "\n")
print("wrote", len(rows), "orders")
PY
aws s3 cp orders.json "${DATA_PREFIX}orders.json"
Smoke test that the file landed:
aws s3 ls "$DATA_PREFIX"
# 2026-06-29 06:30:11 28433 orders.json
If you see the object listed, your bucket and data are ready. We have not touched Athena yet; that is the first step.
Step 1: Define the table over S3
Athena needs a table definition that maps columns to the JSON fields. There is no data movement here. The DDL writes metadata into the Glue Data Catalog and points it at your S3 prefix. We run it from Python so the whole pipeline is one language. Start athena_features.py with the client and a single helper that submits any SQL statement.
import boto3, time, os
REGION = os.environ["AWS_REGION"]
OUTPUT = os.environ["OUTPUT_PREFIX"]
DATA = os.environ["DATA_PREFIX"]
athena = boto3.client("athena", region_name=REGION)
def start_query(sql, database="default", reuse_minutes=None):
kwargs = {
"QueryString": sql,
"QueryExecutionContext": {"Database": database, "Catalog": "AwsDataCatalog"},
"ResultConfiguration": {"OutputLocation": OUTPUT},
}
if reuse_minutes:
kwargs["ResultReuseConfiguration"] = {
"ResultReuseByAgeConfiguration": {"Enabled": True, "MaxAgeInMinutes": reuse_minutes}
}
return athena.start_query_execution(**kwargs)["QueryExecutionId"]
Three things matter in kwargs. QueryExecutionContext names the database and catalog (AwsDataCatalog is the default Glue catalog). ResultConfiguration tells Athena where to dump results in S3. ResultReuseConfiguration is the cache toggle we will use in Step 3. The call returns a query execution ID, not results, because execution is asynchronous; Step 2 handles the waiting.
Now the DDL. CREATE EXTERNAL TABLE describes the JSON shape and the location. The LOCATION is the prefix, not the file, so Athena reads every object under it.
CREATE_TABLE = f"""
CREATE EXTERNAL TABLE IF NOT EXISTS orders (
customer_id string,
order_date string,
category string,
amount double,
refunded boolean
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '{DATA}'
"""
The JsonSerDe is the deserializer that turns each JSON line into a row. The column names must match the JSON keys. That is the entire schema setup. No loading, no copy, no warehouse.
Step 2: Run a query and poll for the result
start_query_execution is asynchronous. It hands back a query execution ID and returns immediately; the query runs in the background. You poll get_query_execution until the state leaves QUEUED/RUNNING, then pull rows with get_query_results. Here is the polling loop and a result parser.
def wait_for(qid, poll=0.5, timeout=60):
deadline = time.time() + timeout
while time.time() < deadline:
ex = athena.get_query_execution(QueryExecutionId=qid)["QueryExecution"]
state = ex["Status"]["State"]
if state in ("SUCCEEDED", "FAILED", "CANCELLED"):
if state != "SUCCEEDED":
reason = ex["Status"].get("StateChangeReason", "unknown")
raise RuntimeError(f"Query {state}: {reason}")
return ex
time.sleep(poll)
raise TimeoutError(f"Query {qid} did not finish in {timeout}s")
The get_query_results response is verbose. The first row of the first page is the column header, and every value is wrapped in a {"VarCharValue": ...} envelope. This helper flattens a result set into a list of dictionaries keyed by column name.
def fetch_rows(qid):
paginator = athena.get_paginator("get_query_results")
rows, header = [], None
for page in paginator.paginate(QueryExecutionId=qid):
for r in page["ResultSet"]["Rows"]:
values = [c.get("VarCharValue") for c in r["Data"]]
if header is None:
header = values
else:
rows.append(dict(zip(header, values)))
return rows
Everything comes back as a string because Athena serializes the result set as text. You cast in the next step where you know the target types. Wrapping the three primitives into one call keeps the feature code readable:
def query(sql, **kw):
qid = start_query(sql, **kw)
wait_for(qid)
return fetch_rows(qid)
Run the table creation once with query(CREATE_TABLE), then query("SELECT count(*) AS n FROM orders") and confirm the single row reads {'n': '400'}.
Step 3: Turn a customer into a feature dictionary, cached
This is the heart of the build. One aggregation query computes every feature we want for a customer in a single scan, instead of four separate lookups. We enable result reuse so that calling this twice for the same customer inside the reuse window returns instantly and scans zero bytes.
def build_customer_features(customer_id, reuse_minutes=60):
sql = f"""
SELECT
count(*) AS order_count,
round(sum(amount), 2) AS total_spend,
round(avg(amount), 2) AS avg_order,
max(order_date) AS last_order,
round(avg(CASE WHEN refunded THEN 1 ELSE 0 END), 3) AS refund_rate,
arbitrary(category) AS a_category
FROM orders
WHERE customer_id = '{customer_id}'
AND order_date >= date_format(current_date - interval '90' day, '%Y-%m-%d')
"""
qid = start_query(sql, reuse_minutes=reuse_minutes)
wait_for(qid)
rows = fetch_rows(qid)
if not rows or rows[0]["order_count"] == "0":
return {"customer_id": customer_id, "order_count": 0}
r = rows[0]
return {
"customer_id": customer_id,
"order_count": int(r["order_count"]),
"total_spend": float(r["total_spend"]),
"avg_order": float(r["avg_order"]),
"last_order": r["last_order"],
"refund_rate": float(r["refund_rate"]),
}
Two things earn their keep here. First, the WHERE clause filters to the last 90 days inside SQL, so Athena scans and returns only what you need rather than dragging the whole table into Python. On a partitioned production table this is the difference between scanning a day and scanning a year. Second, reuse_minutes=60 sets ResultReuseByAgeConfiguration with a 60-minute max age. The first call for C00042 runs the query and scans data. A second identical call within the hour skips execution entirely and returns the stored result, which Athena bills as zero bytes scanned. AI features hammer the same few entities repeatedly, so this cache is close to free money. Note the limits: reuse only works in Athena engine v3, only for deterministic SELECTs, and a query that overrides the workgroup result location is not eligible.
For string interpolation into SQL: in production, never format a raw user value into a query. Validate or allowlist the ID format first (here, C followed by digits), or use a parameterized prepared statement via EXECUTE. The tutorial keeps it inline for readability, but a customer ID that arrives from an HTTP request is untrusted input.
Step 4: Feed the features to Bedrock
Now the payoff. We format the feature dictionary into a compact context block and hand it to Nova through the Bedrock Converse API. The model writes a grounded message that references the real numbers, not a generic template.
import json
bedrock = boto3.client("bedrock-runtime", region_name=REGION)
MODEL_ID = "amazon.nova-lite-v1:0"
def retention_note(features):
if features["order_count"] == 0:
return "No orders in the last 90 days. Skip or send a win-back offer."
context = json.dumps(features, indent=2)
prompt = (
"You write one or two sentence retention notes for a support rep. "
"Use ONLY the facts in the JSON. Cite concrete numbers. No greetings.\n\n"
f"Customer features:\n{context}"
)
resp = bedrock.converse(
modelId=MODEL_ID,
messages=[{"role": "user", "content": [{"text": prompt}]}],
inferenceConfig={"maxTokens": 120, "temperature": 0.2},
)
return resp["output"]["message"]["content"][0]["text"].strip()
The design choice worth calling out: the model never touches the database. It only sees the small, structured feature block your SQL produced. That keeps the prompt short (cheaper, faster, fewer tokens for the model to misread) and keeps the data access auditable. Every fact the model can state traces back to one query you can log and replay. This is the pattern that scales: SQL does retrieval and aggregation, the LLM does language. Do not ask the model to do math over raw rows; it is worse at it than sum() and you pay tokens for the privilege.
Wire the two halves together in a __main__ block so the module runs end to end:
if __name__ == "__main__":
for cid in ["C00007", "C00042", "C00099"]:
feats = build_customer_features(cid)
print(cid, feats)
print(" ->", retention_note(feats), "\n")
Verify it works
Run the module after creating the table once:
python athena_features.py
Expected output looks like this (your numbers vary because the data is random, but the shape is fixed):
C00007 {'customer_id': 'C00007', 'order_count': 5, 'total_spend': 287.4, 'avg_order': 57.48, 'last_order': '2026-06-10', 'refund_rate': 0.2}
-> This customer placed 5 orders for $287.40 in 90 days but has a 20% refund rate, which is worth a proactive quality check.
C00042 {'customer_id': 'C00042', 'order_count': 7, 'total_spend': 412.5, ...}
-> Strong repeat buyer at 7 orders and $412.50; a loyalty perk would protect a high-value account.
C00099 {'customer_id': 'C00099', 'order_count': 0}
-> No orders in the last 90 days. Skip or send a win-back offer.
To prove result reuse is working, run build_customer_features("C00042") twice and inspect the second query's stats. Add a quick probe: after wait_for(qid), read ex["Statistics"]["DataScannedInBytes"]. The first call shows a non-zero scan; the second call within the hour shows 0 and a ResultReuseInformation block with "ReusedPreviousResult": true. That zero is the cache paying off.
When it breaks
If start_query_execution raises InvalidRequestException: ... OutputLocation, your OUTPUT_PREFIX env var is empty or not an s3:// URL. Athena must have a result location, either per-query (as here) or set on the workgroup. Echo $OUTPUT_PREFIX and confirm it ends with a slash.
If the query fails with HIVE_CANNOT_OPEN_SPLIT or a SerDe error, your JSON does not match the schema. The most common cause is pretty-printed JSON; JsonSerDe needs one complete JSON object per line, not multi-line objects. Regenerate with json.dumps(r) per line, no indent.
If you get AccessDeniedException on get_query_results, the IAM principal can start queries but lacks s3:GetObject on the result prefix. Athena writes results to S3 and reads them back to return them to you; you need read access to the output location, not just Athena permissions.
If bedrock.converse raises AccessDeniedException or ValidationException about the model, Nova access is not enabled in this region, or the model ID is wrong for your region. Open the Bedrock console, Model access, and enable Amazon Nova. Confirm MODEL_ID matches a model listed as access-granted.
If result reuse never seems to hit (scan bytes stay non-zero on repeat calls), check that you are on engine v3 (workgroups created recently default to it), that the query string is byte-for-byte identical, and that you did not change the output location between calls. Any of those breaks the match.
The IAM permissions the script needs, minimally: athena:StartQueryExecution, athena:GetQueryExecution, athena:GetQueryResults, glue:GetTable, glue:GetDatabase, glue:CreateTable, s3:GetObject and s3:PutObject on the bucket, s3:GetBucketLocation, s3:ListBucket, and bedrock:InvokeModel on the Nova model ARN. Scope the S3 statements to your bucket, not *.
Estimated cost
Athena charges $5.00 per TB scanned with a 10 MB minimum per query. The sample dataset is about 28 KB, so every query bills the 10 MB floor: $0.00005. Creating the table is a DDL statement and scans nothing. Running the full module touches three customers, so roughly $0.00015 per run before reuse, and less once the cache warms. Bedrock Nova Lite is well under a cent for these short prompts. Following this entire tutorial, including a dozen debugging runs, costs a few cents. The only thing that lingers is S3 storage for the data and the accumulated query results, which is pennies per month for kilobytes.
Cleanup
Tear it all down so nothing accrues:
# Drop the Athena table (metadata only).
aws athena start-query-execution \
--query-string "DROP TABLE IF EXISTS orders" \
--query-execution-context Database=default \
--result-configuration "OutputLocation=$OUTPUT_PREFIX" \
--region "$AWS_REGION"
# Empty and delete the bucket (data + query results).
aws s3 rm "s3://$BUCKET" --recursive
aws s3 rb "s3://$BUCKET"
Dropping the table removes only the Glue catalog entry; it never touches your S3 data. Deleting the bucket removes the source JSON and every cached query result Athena wrote under athena-results/. There is nothing else running, because there was never a server to stop.
Where to take it next
First, partition the table. Real order data is huge, and scanning all of it for one customer is the expensive mistake Athena makes easy. Add partition projection by writing data under s3://bucket/orders/dt=2026-06-29/ and declaring partitioned by (dt string) with projection.enabled=true; queries that filter on dt then scan only the matching days, cutting cost by orders of magnitude. Athena's partition projection docs walk through the table properties.
Second, swap JSON for Parquet. Columnar Parquet lets Athena read only the columns your query touches, so a six-column aggregation does not pay to scan fields it ignores. Convert with a one-time CTAS (CREATE TABLE ... AS SELECT) and repoint the table.
Third, put this behind a tool. Wrap build_customer_features as a function-calling tool for a Strands agent (episode 7) so the agent decides when it needs customer context and pulls it on demand. At that point Athena stops being a script dependency and becomes a live data API your agent queries mid-conversation, which is the whole reason serverless SQL belongs in an AI stack.
The reframing: you do not need a database to give a model facts. You need a query engine over the data you already have, a cache for the lookups you repeat, and the discipline to let SQL do retrieval so the LLM only does language.

