SL#62 - AWS AI Series (5/30) - Make a Small Model Act Big: Fine-Tune Nova Micro on Bedrock and Benchmark It vs Base

❝

We will fine-tune Amazon Nova Micro on a real intent-classification dataset, deploy the custom model for on-demand inference, and measure it against the base model on a held-out test set. Plan for about 90 minutes of hands-on work plus 1 to 2 hours of unattended training. The training run costs a couple of dollars.

What we are building

By the end of this tutorial you will have a fine-tuned version of Nova Micro that classifies airline-support queries into intents, a custom model deployment you can call through the standard Converse API, and a benchmark script that prints the accuracy of the base model and your fine-tuned model side by side on the same test set.

The non-obvious design choice is the model size. The instinct when accuracy is low is to reach for a bigger model. We are going to do the opposite: take the smallest, cheapest model in the Nova family and teach it one job until it beats models many times its size on that job. AWS published numbers for exactly this setup. Base Nova Micro scores 41.4% on the ATIS intent benchmark. After a single supervised fine-tuning run it scores 97%, a jump of more than 55 points, for a training cost of $2.18. And because Nova supports on-demand inference for customized models, the fine-tuned model is billed per token at the same rate as base Nova Micro. You pay for the training once and then run the smarter model at the cheap-model price.

The finished thing: you send a query like "show me the morning flights from boston to philadelphia" and the model returns a single intent label, flight, reliably, where the base model would have guessed wrong most of the time.

Prerequisites

You need an AWS account with access to Amazon Bedrock and to the Nova models in the us-east-1 (N. Virginia) Region. On-demand inference for customized Nova models is currently available only in us-east-1 and us-west-2, and Nova Micro on-demand is us-east-1, so do everything in N. Virginia for this build. Request access to Nova Micro in the Bedrock console under Model access before you start if you have not already.

You need Python 3.10 or later with boto3 installed (version 1.42 or newer so the custom-model-deployment calls exist), and AWS credentials configured locally with permission to call Bedrock and to read and write one S3 bucket. You should be comfortable reading Python and running shell commands. No machine-learning background is required: Bedrock handles the GPUs, the distributed training, and the hyperparameter defaults. You bring the data and one API call.

One cost note up front so nobody is surprised: the training job in this tutorial runs about 1.75 million tokens over 3 epochs and costs roughly $2.18, plus a recurring $1.75 per month to store the custom model until you delete it. Inference during the benchmark is a few cents. The cleanup section at the end removes the storage charge.

Setup

Install the SDK and confirm your identity and Region.

pip install "boto3>=1.42"
export AWS_REGION=us-east-1
aws sts get-caller-identity
aws bedrock list-foundation-models --region us-east-1 \
  --query "modelSummaries[?contains(modelId, 'nova-micro')].modelId"

The last command should list a Nova Micro model id. If it returns nothing, you have not been granted model access yet. Fix that in the console before continuing.

Create an S3 bucket for the training data and the training outputs. Keep it in the same Region as the job. Block public access and turn on encryption.

export BUCKET=sl-nova-ft-$(aws sts get-caller-identity --query Account --output text)
aws s3 mb s3://$BUCKET --region us-east-1
aws s3api put-public-access-block --bucket $BUCKET \
  --public-access-block-configuration \
  BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

That is the whole environment. Bedrock provisions and tears down the training cluster for you, so there is no infrastructure to stand up.

Step 1: Build the training data in the Nova conversation format

Bedrock expects fine-tuning data as JSONL, one training example per line, in the bedrock-conversation-2024 schema. Each line carries a system prompt, the user turn, and the assistant turn that you want the model to learn to produce. We will generate a small synthetic intent dataset so the tutorial is fully reproducible without downloading anything, but the format is identical to what you would use on the real ATIS dataset.

import json, random

INTENTS = ["flight", "airfare", "airline", "ground_service",
           "abbreviation", "aircraft", "flight_time", "quantity"]

SYSTEM = ("Classify the intent of airline queries. Choose one intent from "
          "this list: flight, airfare, airline, ground_service, abbreviation, "
          "aircraft, flight_time, quantity\n\nRespond with only the intent "
          "name, nothing else.")

TEMPLATES = {
    "flight": ["show me flights from {a} to {b}", "i need a flight to {b} from {a}"],
    "airfare": ["how much is a ticket from {a} to {b}", "cheapest fare {a} to {b}"],
    "airline": ["which airlines fly from {a} to {b}", "what carrier serves {a}"],
    "ground_service": ["car rental in {b}", "ground transportation at {b} airport"],
    "abbreviation": ["what does the fare code y mean", "what is the abbreviation ap57"],
    "aircraft": ["what kind of plane flies {a} to {b}", "aircraft type for this route"],
    "flight_time": ["what time does the {a} flight leave", "departure times from {a}"],
    "quantity": ["how many flights from {a} to {b}", "number of daily flights to {b}"],
}
CITIES = ["boston", "denver", "atlanta", "dallas", "seattle", "miami", "chicago"]

def make_row():
    intent = random.choice(INTENTS)
    a, b = random.sample(CITIES, 2)
    text = random.choice(TEMPLATES[intent]).format(a=a, b=b)
    return {"schemaVersion": "bedrock-conversation-2024",
            "system": [{"text": SYSTEM}],
            "messages": [{"role": "user", "content": [{"text": text}]},
                         {"role": "assistant", "content": [{"text": intent}]}]}

random.seed(7)
rows = [make_row() for _ in range(900)]
with open("train.jsonl", "w") as f:
    for r in rows[:800]:
        f.write(json.dumps(r) + "\n")
with open("valid.jsonl", "w") as f:
    for r in rows[800:]:
        f.write(json.dumps(r) + "\n")

Two things matter here and both come straight from the AWS guidance. First, the system prompt is part of the training example, and the same system prompt must be sent at inference time. The model learns the system prompt as the context that triggers its fine-tuned behavior, so a mismatch at inference quietly degrades accuracy. Second, small and clean beats large and noisy. AWS got to 97% with roughly 5,000 examples; you do not need hundreds of thousands. Curate examples that cover your real cases and drop anything ambiguous or contradictory before training.

Upload both files. Keep training and validation under separate prefixes so the job configuration is unambiguous.

aws s3 cp train.jsonl s3://$BUCKET/training-data/train.jsonl
aws s3 cp valid.jsonl s3://$BUCKET/validation-data/valid.jsonl

Step 2: Create the IAM role Bedrock assumes during training

Bedrock runs the training job under a service role that it assumes on your behalf. The role needs a trust policy allowing bedrock.amazonaws.com to assume it, and a permissions policy scoped to the exact bucket paths for input and output. Do not hand it s3:* on *; scope it.

import json, boto3
iam = boto3.client("iam")
BUCKET = "REPLACE_WITH_YOUR_BUCKET"

trust = {"Version": "2012-10-17", "Statement": [{
    "Effect": "Allow",
    "Principal": {"Service": "bedrock.amazonaws.com"},
    "Action": "sts:AssumeRole"}]}

perms = {"Version": "2012-10-17", "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
    "Resource": [f"arn:aws:s3:::{BUCKET}",
                 f"arn:aws:s3:::{BUCKET}/training-data/*",
                 f"arn:aws:s3:::{BUCKET}/validation-data/*",
                 f"arn:aws:s3:::{BUCKET}/output-data/*"]}]}

role = iam.create_role(RoleName="BedrockNovaFineTune",
                       AssumeRolePolicyDocument=json.dumps(trust))
iam.put_role_policy(RoleName="BedrockNovaFineTune",
                    PolicyName="s3-scoped",
                    PolicyDocument=json.dumps(perms))
print(role["Role"]["Arn"])

Copy the printed role ARN. If you later see an AccessDenied on the training data, this policy is almost always the culprit: the prefix in the policy does not match the prefix you uploaded to. Keep them in sync.

Step 3: Launch the supervised fine-tuning job

This is the one API call that does the work. customizationType is FINE_TUNING for supervised fine-tuning. The hyperparameters are passed as strings, and Bedrock applies sensible defaults if you omit them. We follow the AWS reference values: 3 epochs, a learning-rate multiplier of 1e-5, and 10 warmup steps.

import boto3, time
bedrock = boto3.client("bedrock", region_name="us-east-1")
BUCKET = "REPLACE_WITH_YOUR_BUCKET"
ROLE_ARN = "REPLACE_WITH_ROLE_ARN"

job = bedrock.create_model_customization_job(
    jobName="nova-micro-intent-sft",
    customModelName="nova-micro-intent",
    roleArn=ROLE_ARN,
    baseModelIdentifier="amazon.nova-micro-v1:0:128k",
    customizationType="FINE_TUNING",
    trainingDataConfig={"s3Uri": f"s3://{BUCKET}/training-data/train.jsonl"},
    validationDataConfig={"validators": [
        {"s3Uri": f"s3://{BUCKET}/validation-data/valid.jsonl"}]},
    outputDataConfig={"s3Uri": f"s3://{BUCKET}/output-data/"},
    hyperParameters={"epochCount": "3",
                     "learningRateMultiplier": "0.00001",
                     "learningRateWarmupSteps": "10"})
print(job["jobArn"])

Confirm the exact baseModelIdentifier string for your account with list-foundation-models; model id suffixes change as new context-window variants ship, and an outdated id is the most common reason this call fails. Under the hood Bedrock uses parameter-efficient fine-tuning (it trains a small adapter matrix rather than rewriting all the weights), which is why a run this cheap can move accuracy this far.

Now poll until the job leaves InProgress. Training takes minutes to a couple of hours depending on dataset size; this one is small.

arn = job["jobArn"]
while True:
    s = bedrock.get_model_customization_job(jobIdentifier=arn)["status"]
    print(s, time.strftime("%H:%M:%S"))
    if s in ("Completed", "Failed", "Stopped"):
        break
    time.sleep(60)

When the status reaches Completed, Bedrock has written a custom model plus the training metrics and loss curves to your output bucket. Pull the new model ARN from get_model_customization_job(jobIdentifier=arn)["outputModelArn"]; you need it in the next step.

Step 4: Deploy the custom model for on-demand inference

A fine-tuned Nova model is not immediately callable. You first create a custom model deployment, which gives you an ARN that behaves like any other modelId. This is the piece that makes the economics work: on-demand deployments bill per token at the base model rate, with no provisioned capacity to reserve.

import uuid
out_model_arn = bedrock.get_model_customization_job(
    jobIdentifier=arn)["outputModelArn"]

dep = bedrock.create_custom_model_deployment(
    modelDeploymentName="nova-micro-intent-od",
    modelArn=out_model_arn,
    description="Fine-tuned Nova Micro intent classifier",
    clientRequestToken=f"deploy-{uuid.uuid4()}")
deployment_arn = dep["customModelDeploymentArn"]
print(deployment_arn)

Wait for the deployment to go Active before calling it. Poll with get_custom_model_deployment(customModelDeploymentIdentifier=deployment_arn) and check the status field. Once active, the customModelDeploymentArn is what you pass as modelId to the Converse API. Remember the on-demand deployment only works for models customized on or after 2025-07-16 in the two supported Regions, which is why we pinned everything to us-east-1.

Step 5: Benchmark the fine-tuned model against base

The whole point is to prove the fine-tune worked, so we run the same held-out queries through both the base model and the deployment and compare. Build a small labeled test set first (use the same generator from Step 1 with a different seed so the model has never seen these exact rows), then score both models.

import boto3, json, random
rt = boto3.client("bedrock-runtime", region_name="us-east-1")
BASE = "amazon.nova-micro-v1:0:128k"
TUNED = "REPLACE_WITH_DEPLOYMENT_ARN"

def classify(model_id, text):
    r = rt.converse(
        modelId=model_id,
        system=[{"text": SYSTEM}],            # same SYSTEM string from Step 1
        messages=[{"role": "user", "content": [{"text": text}]}],
        inferenceConfig={"maxTokens": 8, "temperature": 0})
    return r["output"]["message"]["content"][0]["text"].strip().lower()

random.seed(99)
test = [make_row() for _ in range(120)]   # make_row from Step 1
def score(model_id):
    hits = 0
    for row in test:
        text = row["messages"][0]["content"][0]["text"]
        gold = row["messages"][1]["content"][0]["text"]
        if classify(model_id, text) == gold:
            hits += 1
    return hits / len(test)

print(f"base  accuracy: {score(BASE):.1%}")
print(f"tuned accuracy: {score(TUNED):.1%}")

Two details that decide whether the comparison is honest. Send the identical system prompt to both models; the fine-tuned model expects it, and giving the base model the same prompt is the fair test. And set temperature to 0 with a tiny maxTokens so the label is deterministic and the model cannot ramble past the single word you want. AWS reports the gap on the full ATIS benchmark as 41.4% for base Nova Micro versus 97% for the fine-tune. Your synthetic numbers will differ, but the shape will be the same: the tuned model should be dramatically and consistently more accurate.

Verify it works

You have a working result when three things are true. The customization job status reads Completed and get_model_customization_job returns a non-empty outputModelArn. The custom model deployment status reads Active. And the benchmark prints two lines where the tuned accuracy is well above the base accuracy, for example:

base  accuracy: 38.3%
tuned accuracy: 95.8%

If you want to confirm a single prediction by hand, call classify(TUNED, "car rental in denver") and check it returns ground_service, then call the same text against BASE and watch it return something vaguer or wrong. That before-and-after on one example is the most convincing demo when you show this to your team.

You can also open the Bedrock console under Custom models, select your model, and view the training loss curve. A smooth downward curve confirms the model converged. That curve is your receipt that training did something real.

When it breaks

ValidationException on the base model identifier. The baseModelIdentifier string is wrong for your account or Region. Run aws bedrock list-foundation-models --query "modelSummaries[?contains(modelId,'nova-micro')]" and copy the exact id. Do not hardcode an id you saw in a blog post; they carry context-window suffixes that change.

AccessDenied reading the training data. The Bedrock service role cannot reach your S3 prefix. The fix is almost always that the prefix in the IAM policy (training-data/*) does not match where you uploaded the file. Re-check both, and confirm the trust policy names bedrock.amazonaws.com.

The deployment never goes Active, or create_custom_model_deployment errors. You are probably not in us-east-1/us-west-2, or the model was customized before 2025-07-16, or your boto3 is too old to have the call. Upgrade boto3, stay in N. Virginia, and use a freshly trained model.

Tuned accuracy is barely better than base. Look at the loss curve. A flat curve means the model did not learn enough: raise epochCount by one or two, or increase learningRateMultiplier by 2x. Wild oscillations mean the learning rate is too high: cut learningRateMultiplier by half. And make sure the system prompt at inference is byte-for-byte the one you trained with.

The model returns extra words instead of a bare label. Lower maxTokens to 8 and keep temperature at 0, and confirm your training assistant turns contain only the label with no trailing punctuation.

Where to take it next

First, swap the synthetic generator for the real ATIS dataset and reserve a clean test split the model never sees, so your benchmark number means something. That is the difference between a demo and an evaluation.

Second, replace the hand-rolled accuracy loop with Amazon Bedrock Evaluations and an LLM-as-a-judge run, so you grade harder tasks than exact-match classification. The same custom deployment ARN drops straight into an evaluation job.

Third, and this is the bigger leap, try distillation instead of supervised fine-tuning when you do not have labels. Set customizationType to DISTILLATION and pass a customizationConfig with a distillationConfig.teacherModelConfig naming a larger teacher such as Nova Pro or Nova Premier. Bedrock generates synthetic responses from the teacher (data synthesis can expand your set up to 15,000 prompt-response pairs) and fine-tunes the small student on them. You provide prompts, not answers, and still end up with a cheap model that punches above its size. That is the same idea as this tutorial, minus the labeling work. Which raises the real question for your next high-volume task: are you reaching for a bigger model because you need more intelligence, or because you have not yet taught a small one this one job?