SL#57 - AWS AI Series (1/30) - One API, Two Model Families: Build a Bedrock CLI with Converse, Streaming, and Validated JSON

What we are building

This is episode 1 of a 30-part series on the AWS AI, agentic, and data stack. We start where every Bedrock project starts whether it admits it or not: the inference call.

The artifact is a single-file Python CLI, bx, that does three things: send a prompt to any Bedrock model and print the answer, stream the answer token by token with --stream, and force the answer into a JSON schema with --json. Around 120 lines total, no framework, just boto3.

The non-obvious part is why this is worth an episode. Bedrock's Converse API makes the model a config value instead of an architecture decision: the same request shape works for Amazon's Nova family, Anthropic's Claude, Meta's Llama, and everything else on the platform, so swapping models is a string change, not a refactor. And since late 2025, Bedrock enforces structured output server-side: you attach a JSON schema to the request and the service constrains decoding to match it. That is not "please respond in JSON" prompting with a retry loop on top. It is a compiled grammar, cached for 24 hours, applied at the token level. Most of the JSON-repair code you have in production today exists because this feature didn't.

By the end, bx "summarize this incident" --json incident.schema.json --model claude returns parseable, schema-guaranteed JSON, and switching to Nova is --model nova.

Prerequisites

You need an AWS account where you can use Amazon Bedrock, and a region where the models we use are live (I use us-east-1 throughout; us-west-2 also works). You need model access granted for Amazon Nova 2 Lite and Anthropic Claude Sonnet 4.6: in the Bedrock console, go to Model access and enable both. Anthropic models require submitting a short use-case form the first time.

Your IAM identity needs bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream (the Converse and ConverseStream APIs check those two actions respectively). For a sandbox account, the AmazonBedrockLimitedAccess managed policy covers it.

Locally: Python 3.10+, the AWS CLI configured with credentials (aws sts get-caller-identity should return your account), and comfort reading Python. No prior Bedrock experience assumed.

Cost of following along: a few dozen short requests on Nova 2 Lite and Sonnet 4.6, which lands well under one dollar. Check the Bedrock pricing page for current per-token rates; Nova 2 Lite is the cheap one, which is exactly why we default to it.

Setup

Create a project and install a recent boto3. Recent matters: structured output uses the outputConfig request field, and an old botocore will reject it client-side before the request ever leaves your machine.

mkdir bx && cd bx
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade "boto3>=1.40"
export AWS_REGION=us-east-1

Smoke test, before writing any code. This proves your credentials, region, and model access in one shot:

aws bedrock list-foundation-models \
  --by-provider amazon \
  --query "modelSummaries[?contains(modelId,'nova-2')].modelId" \
  --region $AWS_REGION

You should see amazon.nova-2-lite-v1:0 in the output (alongside its siblings). If the list is empty, you are in a region where Nova 2 isn't deployed; switch to us-east-1. If you get AccessDeniedException, fix IAM before continuing, because nothing below will work.

Step 1: The first Converse call

We start with the smallest possible inference call to Nova 2 Lite. Create bx.py:

import boto3
client = boto3.client("bedrock-runtime")
def ask(model_id: str, prompt: str, system: str | None = None) -> dict:
    kwargs = {
        "modelId": model_id,
        "messages": [{"role": "user", "content": [{"text": prompt}]}],
        "inferenceConfig": {"maxTokens": 512, "temperature": 0.2},
    }
    if system:
        kwargs["system"] = [{"text": system}]
    return client.converse(**kwargs)
if __name__ == "__main__":
    resp = ask("amazon.nova-2-lite-v1:0", "In two sentences: what does the Converse API abstract away?")
    print(resp["output"]["message"]["content"][0]["text"])
    print(resp["usage"])

Run python bx.py. You get an answer plus a usage dict with inputTokens, outputTokens, and totalTokens.

Three things deserve attention in this tiny block. The Converse API takes messages as a list of role/content turns, so multi-turn conversation is just appending the assistant's reply and your next user message to the same list; we will not build a chat loop here, but nothing about the shape changes when you do. inferenceConfig is the model-agnostic home for maxTokens, temperature, topP, and stopSequences; before Converse existed, each provider spelled these differently inside InvokeModel bodies, and that spelling difference is exactly the lock-in Converse removed. And the usage block is your cost meter; log it from day one, because the day finance asks what the AI feature costs, this dict is the answer.

Step 2: Swap the model family without touching the code

Now the claim that makes Converse interesting: point the same function at Claude. Change the model ID in the main block:

resp = ask(
    "global.anthropic.claude-sonnet-4-6",
    "In two sentences: what does the Converse API abstract away?",
)

Same request shape, same response shape, different frontier lab. That is the whole pitch.

The model ID deserves a pause, because it is the first thing that breaks for newcomers. global.anthropic.claude-sonnet-4-6 is not a model ID; it is an inference profile ID. Recent Claude models on Bedrock are not invokable by their base ID (anthropic.claude-sonnet-4-6) with on-demand throughput. Instead you call a cross-Region inference profile, and Bedrock routes the request to whichever supported Region has capacity. Three flavors exist: in-Region profiles (us., eu. prefixes and friends) keep traffic inside a geography for data residency, and the global. prefix routes anywhere for maximum throughput. If your compliance team has opinions about where prompts are processed, this prefix is where those opinions go.

If you used the base ID, you got a ValidationException telling you on-demand throughput isn't supported and pointing you at inference profiles. That error message is the single most-asked Bedrock question on Stack Overflow and re:Post combined, and now it will never cost you twenty minutes.

Step 3: Stream the response

Waiting three seconds staring at a blank terminal is bad UX in a CLI and worse in a product. ConverseStream is the same request with an event-stream response. Add to bx.py:

def ask_stream(model_id: str, prompt: str, system: str | None = None) -> None:
    kwargs = {
        "modelId": model_id,
        "messages": [{"role": "user", "content": [{"text": prompt}]}],
        "inferenceConfig": {"maxTokens": 512, "temperature": 0.2},
    }
    if system:
        kwargs["system"] = [{"text": system}]
    stream = client.converse_stream(**kwargs)
    for event in stream["stream"]:
        if "contentBlockDelta" in event:
            print(event["contentBlockDelta"]["delta"].get("text", ""), end="", flush=True)
        elif "metadata" in event:
            print(f"\n\n[tokens: {event['metadata']['usage']['totalTokens']}]")

The stream yields typed events, not raw text: messageStart, then a sequence of contentBlockDelta events each carrying a few characters, then messageStop, then a final metadata event with the same usage accounting you got in Step 1. We only print the deltas and the token count. The event types matter once you go past toy usage; tool calls arrive as their own block types on this same stream, which is how every agent framework on AWS (including Strands, episode 7) consumes models under the hood.

Note the permission split: this call needs bedrock:InvokeModelWithResponseStream, not just InvokeModel. Locked-down production roles regularly have one and not the other, and the resulting AccessDeniedException confuses people because "the same call worked yesterday" - it was the non-streaming variant that worked.

Step 4: JSON the model cannot break

Here is the feature that retires your regex-repair helpers. We ask the model to extract structured data from a messy incident description, and we attach a schema the response must satisfy. Add:

import json
def ask_json(model_id: str, prompt: str, schema: dict, name: str = "extraction") -> dict:
    resp = client.converse(
        modelId=model_id,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        inferenceConfig={"maxTokens": 512},
        outputConfig={
            "textFormat": {
                "type": "json_schema",
                "structure": {
                    "jsonSchema": {
                        "schema": json.dumps(schema),
                        "name": name,
                        "description": "Extract structured data from text",
                    }
                }
            }
        },
    )
    return json.loads(resp["output"]["message"]["content"][0]["text"])

And a schema plus a test drive:

INCIDENT_SCHEMA = {
    "type": "object",
    "properties": {
        "service": {"type": "string"},
        "severity": {"type": "string", "enum": ["sev1", "sev2", "sev3"]},
        "duration_minutes": {"type": "integer"},
        "root_cause": {"type": "string"},
    },
    "required": ["service", "severity", "duration_minutes", "root_cause"],
    "additionalProperties": False,
}
text = ("Checkout was degraded for about 47 minutes yesterday evening. "
        "Pretty bad, lots of carts dropped. Turned out a config push "
        "exhausted the connection pool on the payments DB.")
print(ask_json("amazon.nova-2-lite-v1:0", f"Extract the incident: {text}", INCIDENT_SCHEMA))

You get back a dict with exactly those four keys, severity guaranteed to be one of the three enum values, duration_minutes guaranteed to be an integer. Per the structured output documentation, Bedrock validates your schema against a supported subset of JSON Schema Draft 2020-12, compiles it into a decoding grammar on first use (which can take noticeable extra seconds on the very first call), caches the compiled grammar for 24 hours, and serves subsequent identical schemas at normal latency. The validation is enforced during generation, not checked after.

Two practical notes. The schema travels as a JSON string inside jsonSchema.schema, hence the json.dumps; passing a dict there is the most common first-try error. And additionalProperties: false plus a complete required list is what turns "mostly valid" into "valid": without them the model can legally add or omit fields and still satisfy your schema. There is a sibling feature, strict: true on tool definitions, that applies the same grammar enforcement to tool inputs; we will use it when we build agents.

Step 5: Wire it into one CLI

Tie the three functions together with model aliases so the swap test is a flag. Append:

import argparse, sys
MODELS = {
    "nova": "amazon.nova-2-lite-v1:0",
    "claude": "global.anthropic.claude-sonnet-4-6",
}
def main():
    p = argparse.ArgumentParser(prog="bx", description="Bedrock from the terminal")
    p.add_argument("prompt")
    p.add_argument("--model", choices=MODELS, default="nova")
    p.add_argument("--stream", action="store_true")
    p.add_argument("--json", dest="schema_file", help="path to a JSON schema file")
    p.add_argument("--system", default=None)
    a = p.parse_args()
    model_id = MODELS[a.model]
    if a.schema_file:
        schema = json.load(open(a.schema_file))
        print(json.dumps(ask_json(model_id, a.prompt, schema), indent=2))
    elif a.stream:
        ask_stream(model_id, a.prompt, a.system)
    else:
        r = ask(model_id, a.prompt, a.system)
        print(r["output"]["message"]["content"][0]["text"])
if __name__ == "__main__":
    main()

Replace the earlier __main__ experiments with this block. The MODELS dict is doing quiet architectural work: it is the entire model-selection surface of the app. When Nova 3 or the next Claude lands, supporting it is one line here, zero lines anywhere else. That property does not survive contact with provider-specific SDKs, and it is the main reason to standardize on Converse even when you only use one model today.

Verify it works

Save INCIDENT_SCHEMA from Step 4 into incident.schema.json (the schema object itself, not the Python). Then:

python bx.py "Three reasons availability zones are not independent failure domains" --model claude --stream
python bx.py "Extract: checkout degraded 47 min, config push exhausted the payments DB pool" \
  --json incident.schema.json --model nova

The first command prints Claude's answer progressively with a final [tokens: N] line. The second prints something very close to:

{
  "service": "checkout",
  "severity": "sev2",
  "duration_minutes": 47,
  "root_cause": "config push exhausted the payments DB connection pool"
}

If both commands behave like that, you have the working artifact: one CLI, two model families, streaming, and schema-guaranteed JSON. Push it to GitHub; episodes 2 and 7 build on this file.

When it breaks

AccessDeniedException ... is not authorized to perform: bedrock:InvokeModel - either IAM (your role lacks the action) or model access (you never enabled the model in the console's Model access page). The error text names the model ARN; if the ARN is there and IAM has the action, it is the console toggle you are missing.

ValidationException: Invocation of model ID ... with on-demand throughput isn't supported - you used a base model ID where an inference profile is required. Use the global. (or your geography's) profile ID, exactly as in MODELS.

ParamValidationError: Unknown parameter in input: "outputConfig" - this one is local, not from AWS. Your boto3/botocore predates structured output. pip install --upgrade boto3 inside the venv and confirm python -c "import boto3; print(boto3.__version__)" reports the version you just installed and not a system-wide one.

First --json call takes much longer than the rest - expected, that is the one-time grammar compilation described in Step 4. If it bothers users, warm the schema at deploy time with a dummy request.

ThrottlingException under load - on-demand quotas are account-and-Region level. Before requesting quota raises, check whether the global. profile already solves it by routing to spare capacity; that is what it is for.

Where to take it next

Easiest: add a --compare flag that fires the same prompt at both models concurrently with concurrent.futures and prints answers side by side with token counts; the uniform response shape makes this a ten-line change, and it is a surprisingly effective informal eval harness. Next: make the incident extractor a strict tool instead of a text format, using toolConfig with strict: true, and notice the model can now decide whether the text contains an incident at all. Hardest: point the CLI at a Bedrock Knowledge Base so answers are grounded in your own documents - which is precisely episode 2, where we build production RAG on S3 Vectors.

The deeper habit this episode installs: on Bedrock, treat the model like you treat an instance type. You would not hardcode m5.xlarge into your application logic. Stop hardcoding your model.