What we are building
By the end you will have two things checked into your AWS account. First, a versioned prompt living in Bedrock Prompt Management, callable by ARN, so the prompt text is no longer an f-string buried in a Python file that nobody reviews. Second, an automated evaluation job that takes a small dataset of test inputs, generates responses with three candidate models, and uses a fourth model as a judge to score each response on correctness, completeness, and helpfulness. You get a per-model report in S3 and a one-screen summary that says model A scored 0.81, model B scored 0.74, model C scored 0.69.
The non-obvious design choice is to separate the prompt from the model. Most teams hardcode both together, so when they want to try a cheaper model they edit the same string that holds the prompt logic, and they can never answer "did the prompt change or the model change?" Prompt Management makes the prompt a versioned artifact. Evaluation jobs make the model a swappable input. Once those two are separate, "should we switch from Sonnet to Haiku here?" stops being a vibe and becomes a number you can put in a PR description.
Here is what the finished thing does: you run one script to create and version the prompt, one script to launch three eval jobs, and one script to print the scores. The reader who finishes this can stop arguing about prompts in standup and start linking to eval results.
Prerequisites
You need an AWS account with Amazon Bedrock enabled in a region where model evaluation is available (us-east-1 and us-west-2 are the safe choices). In the Bedrock console under Model access, request access to the models you want to compare and to the model you'll use as the judge. The judge and every model under test must be in the same region, or the job fails at submit time.
On your machine: Python 3.9 or later, and boto3 1.40 or newer (the prompt management and evaluation shapes below were stable as of boto3 1.43). You should be comfortable reading Python and have run at least one Bedrock call before, ideally episode 1 of this series. You also need an S3 bucket you can write to.
IAM is the part people get wrong. Your own user or role needs bedrock:CreatePrompt, bedrock:CreatePromptVersion, bedrock:CreateEvaluationJob, bedrock:GetEvaluationJob, and bedrock:ListEvaluationJobs, plus bedrock:InvokeModel for the smoke test and iam:PassRole. Separately, the evaluation job runs under a service role that Bedrock assumes on your behalf. That role needs bedrock:InvokeModel on the generator and judge models and read/write on your S3 bucket, with a trust policy that lets bedrock.amazonaws.com assume it. The exact service role policy is in the AWS docs linked at the bottom; create that role before you run Step 4.
Setup
Install the SDK and confirm your identity and region.
python -m venv .venv && source .venv/bin/activate
pip install "boto3>=1.40"
export AWS_REGION=us-east-1
export EVAL_BUCKET=my-bedrock-eval-bucket # a bucket you own
export EVAL_ROLE_ARN=arn:aws:iam::123456789012:role/BedrockEvaluationRole
aws sts get-caller-identity
Model IDs drift, and on-demand access to Nova and Claude usually goes through cross-region inference profiles (the IDs that start with us.), not the bare foundation-model IDs. Do not copy a model ID from a blog post, including this one. Pull the current list yourself:
aws bedrock list-inference-profiles --region $AWS_REGION \
--query "inferenceProfileSummaries[].inferenceProfileId" --output text
Pick three generator candidates and one judge from that output. A sensible cheap-to-capable spread for the generators is a Nova Micro, a Nova Lite, and a Claude Haiku class model; for the judge, use something at least as capable as your best candidate so it isn't grading work it couldn't do itself. The smoke test that proves your setup works is the list-inference-profiles call returning a non-empty list. If it's empty, you have not requested model access yet.
Step 1: Create and version the prompt
We'll register a prompt with one input variable and pin a default model and inference config to it.
import boto3, os
agent = boto3.client("bedrock-agent", region_name=os.environ["AWS_REGION"])
resp = agent.create_prompt(
name="support-summarizer",
description="Summarize a customer support thread into 3 bullet points.",
defaultVariant="v-default",
variants=[{
"name": "v-default",
"templateType": "TEXT",
"templateConfiguration": {
"text": {
"text": "Summarize the following support thread into exactly three "
"bullet points a manager can scan in 10 seconds.\n\n{{thread}}",
"inputVariables": [{"name": "thread"}],
}
},
"modelId": "us.amazon.nova-lite-v1:0", # verify against your list output
"inferenceConfiguration": {"text": {"temperature": 0.2, "maxTokens": 400}},
}],
)
prompt_id = resp["id"]
print("prompt id:", prompt_id, "draft arn:", resp["arn"])
ver = agent.create_prompt_version(promptIdentifier=prompt_id)
print("published version:", ver["version"], "arn:", ver["arn"])
The {{thread}} token is the template variable, declared once in inputVariables. The create_prompt call only ever writes to a mutable DRAFT. create_prompt_version snapshots the current draft into an immutable, numbered version, and that is the thing you reference from production code. This is the whole point: version 1 is frozen, so a change to the draft tomorrow cannot silently alter what production runs. Save the version ARN from ver["arn"]; it looks like arn:aws:bedrock:us-east-1:123456789012:prompt/ABCD1234:1 with the trailing :1 being the version.
Step 2: Invoke the managed prompt by ARN
Before evaluating anything, prove the prompt actually runs. With the Converse API you pass the prompt version ARN as the modelId and supply the variable values in promptVariables. You send no messages; the template provides them.
import boto3, os
runtime = boto3.client("bedrock-runtime", region_name=os.environ["AWS_REGION"])
PROMPT_ARN = "arn:aws:bedrock:us-east-1:123456789012:prompt/ABCD1234:1" # from Step 1
out = runtime.converse(
modelId=PROMPT_ARN,
promptVariables={"thread": {"text":
"Customer: my export keeps timing out at 30s.\n"
"Agent: try the async export endpoint.\n"
"Customer: that worked, thanks."}},
)
print(out["output"]["message"]["content"][0]["text"])
You should see three bullet points. The model and inference settings came from the variant you pinned in Step 1, not from this call, which is exactly what you want: callers ask for a behavior by name, not by wiring up a model string and a temperature every time. If you later decide Nova Lite was the wrong default, you change it in one place and cut a new version, and no caller code changes.
Step 3: Build the evaluation dataset
An automated evaluation job reads a JSONL file from S3 where each line is one test case. For an LLM-as-a-judge job, each record needs a prompt and, for metrics like correctness, an optional referenceResponse the judge can compare against.
import json, os, boto3
rows = [
{"prompt": "Summarize: Customer can't log in after password reset. "
"Agent cleared the session cache and it worked.",
"referenceResponse": "- Login failed post password reset\n"
"- Cause: stale session cache\n- Fix: cache cleared"},
{"prompt": "Summarize: Billing charged twice. Agent issued a refund "
"and added a duplicate-charge alert.",
"referenceResponse": "- Double charge reported\n- Refund issued\n"
"- Duplicate-charge alert added"},
# ... add 15-20 rows total; more rows = more stable scores
]
with open("prompts.jsonl", "w") as f:
for r in rows:
f.write(json.dumps(r) + "\n")
s3 = boto3.client("s3", region_name=os.environ["AWS_REGION"])
s3.upload_file("prompts.jsonl", os.environ["EVAL_BUCKET"], "eval/prompts.jsonl")
print("uploaded to s3://%s/eval/prompts.jsonl" % os.environ["EVAL_BUCKET"])
Aim for 15 to 20 rows minimum. Scores from a 3-row dataset are noise; the judge's per-row variance washes out anything you'd want to act on. Keep the dataset in version control next to your code so the eval is reproducible, and so a teammate can add a row when they find a case the model botches in production. That growing file is your regression suite for prompts.
Step 4: Launch one judge job per candidate model
An automated job evaluates one generator model. To compare three, you launch three jobs that share the same dataset, the same metrics, and the same judge, varying only the model under test. Same dataset and judge is what makes the scores comparable.
import boto3, os, time
bedrock = boto3.client("bedrock", region_name=os.environ["AWS_REGION"])
DATASET = "s3://%s/eval/prompts.jsonl" % os.environ["EVAL_BUCKET"]
JUDGE = "us.anthropic.claude-3-5-sonnet-20241022-v2:0" # verify against your list
CANDIDATES = [
"us.amazon.nova-micro-v1:0",
"us.amazon.nova-lite-v1:0",
"us.anthropic.claude-3-5-haiku-20241022-v1:0",
]
METRICS = ["Builtin.Correctness", "Builtin.Completeness", "Builtin.Helpfulness"]
job_arns = {}
for model in CANDIDATES:
short = model.split(":")[0].split(".")[-1]
r = bedrock.create_evaluation_job(
jobName="cmp-%s-%d" % (short, int(time.time())),
roleArn=os.environ["EVAL_ROLE_ARN"],
evaluationConfig={"automated": {
"datasetMetricConfigs": [{
"taskType": "General",
"dataset": {"name": "support-set",
"datasetLocation": {"s3Uri": DATASET}},
"metricNames": METRICS,
}],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [{"modelIdentifier": JUDGE}]},
}},
inferenceConfig={"models": [{"bedrockModel": {"modelIdentifier": model}}]},
outputDataConfig={"s3Uri": "s3://%s/eval/out/" % os.environ["EVAL_BUCKET"]},
)
job_arns[model] = r["jobArn"]
print("launched", short, "->", r["jobArn"])
The job under test is the bedrockModel inside inferenceConfig; the grader is the bedrockEvaluatorModels entry inside evaluatorModelConfig. The Builtin. prefix on metric names is required for the judge's built-in rubrics. Correctness and Completeness lean on the referenceResponse; Helpfulness is reference-free. Jobs run asynchronously and typically take 10 to 30 minutes for a 20-row dataset, so launching all three at once is the move.
Step 5: Read the scores and pick a winner
Poll the jobs to completion, then read the results. The console shows a histogram per metric, but the machine-readable scores are in the S3 output prefix, written as JSONL under a folder named after each job.
import boto3, os, time, json
bedrock = boto3.client("bedrock", region_name=os.environ["AWS_REGION"])
s3 = boto3.client("s3", region_name=os.environ["AWS_REGION"])
def wait(job_arn):
while True:
st = bedrock.get_evaluation_job(jobIdentifier=job_arn)["status"]
if st in ("Completed", "Failed", "Stopped"):
return st
time.sleep(30)
for model, arn in job_arns.items(): # job_arns from Step 4
print(model, "->", wait(arn))
# Each job writes per-record results as JSONL under the output prefix.
# Inspect one job's objects, then average the metric scores across rows:
prefix = "eval/out/"
objs = s3.list_objects_v2(Bucket=os.environ["EVAL_BUCKET"], Prefix=prefix)
for o in objs.get("Contents", []):
if o["Key"].endswith("_output.jsonl"):
print("results file:", o["Key"])
Open one of the _output.jsonl files. Each line carries the input, the generated response, and the judge's score and explanation per metric. Average each metric across all rows for a job, do that for all three jobs, and you have a comparison table. The explanations matter as much as the numbers: when Nova Micro loses on Completeness, the judge tells you it dropped the third bullet on long threads, which is a far more useful finding than a scalar. That is the difference between "Haiku scored higher" and "Haiku scored higher because the cheaper models truncate multi-issue threads, which is a real failure for our use case."
Verify it works
Step 1 prints a prompt id and a published version of 1. Step 2 prints three bullet points generated through the managed prompt with no messages array in your call. Step 3 prints an uploaded to s3://... line and the object exists if you run aws s3 ls s3://$EVAL_BUCKET/eval/prompts.jsonl. Step 4 prints three launched ... -> arn:aws:bedrock:...:evaluation-job/... lines. Step 5 eventually prints Completed for all three and lists at least one _output.jsonl result file per job.
The end-state contract: in the Bedrock console under Inference and Assessment, then Evaluations, you see three completed jobs over the same dataset, each with a metrics summary card. If you can open two of them side by side and read different average Correctness scores, the tutorial worked and you have a defensible model choice.
When it breaks
If create_evaluation_job throws AccessDeniedException mentioning a role, your service role is wrong: either the trust policy doesn't allow bedrock.amazonaws.com to assume it, or the role lacks bedrock:InvokeModel on the judge or candidate, or it can't read the dataset bucket. Fix the role, not your user permissions; this error is almost always about the assumed role.
If the job fails fast with a validation error on the model identifier, you used a bare foundation-model ID where on-demand access requires a cross-region inference profile. Re-run the list-inference-profiles call from Setup and use a us.-prefixed ID. The same error appears if the judge and a candidate live in different regions.
If Step 2 returns a validation error about promptVariables, you either pointed modelId at the prompt id instead of the full ARN, or you forgot the :1 version suffix, or your variable name doesn't match the {{thread}} token in the template. The variable key in promptVariables must exactly equal the inputVariables name.
If a job completes but the output folder is empty, check the job's failureMessages via get_evaluation_job; a common cause is the dataset JSONL having a trailing blank line or a record missing the prompt key, which makes the whole job succeed with zero scored rows.
Cost of following along
There is no surcharge for the evaluation feature itself or for using the console. You pay standard on-demand token rates for two things: the generator producing a response to each test prompt, and the judge reading that response plus its rubric and emitting a score. For a 20-row dataset, three Nova and Haiku class generators, and a Sonnet class judge, expect well under a few dollars total, since the judge calls are the largest input and there are only 60 of them. Swapping the judge for a Nova Lite model drops it under a dollar at the cost of a slightly noisier grader. The Prompt Management storage is negligible.
Cleanup
Delete the prompt and the eval artifacts when you're done so nothing lingers on the bill or in your account.
aws bedrock-agent delete-prompt --prompt-identifier <PROMPT_ID> --region $AWS_REGION
aws s3 rm s3://$EVAL_BUCKET/eval/ --recursive --region $AWS_REGION
Evaluation jobs are terminal records and incur no ongoing cost once Completed, so there's nothing to delete there; the only standing resources are the prompt and the S3 objects. If you created a dedicated service role only for this tutorial, delete that too.
Where to take it next
First, wire the prompt ARN into the code from episode 1 so your CLI calls the managed, versioned prompt instead of an inline string, and cut a new version when you tune it. Second, turn Step 4 into a CI step: on every change to prompts.jsonl or the prompt template, launch the jobs and fail the build if the winning model's average Correctness drops below a threshold you set, which makes prompt regressions visible in code review. Third, harder, replace the built-in metrics with a custom metric: the API accepts your own judge rubric, so you can score against a property your product actually cares about, like "never invents a refund amount," instead of generic correctness. That last one is where eval stops being a checkbox and starts catching the bugs that page you.

