This website uses cookies

Read our Privacy policy and Terms of use for more information.

What we are building

By the end of this tutorial you have a single Python script, bench.py, that runs the same summarization workload three ways and prints a cost-per-1000-requests comparison. The workload is the one almost every team ends up with: a long static instruction block plus a reference document, followed by a short, changing question. That shape is where Bedrock costs quietly balloon, because you pay full input price to reprocess the same 5000 tokens of context on every single call.

The script does three things. First it runs a naive baseline with the Converse API and records the input and output token counts. Then it inserts a cachePoint so the static prefix is billed at the reduced cache-read rate after the first call. Finally it shows the asynchronous path: the same prompts submitted as a batch job, which Bedrock prices at 50 percent of on-demand. Along the way we swap the bare model ID for a cross-Region inference profile, which costs nothing extra and, in the global variant, is about 10 percent cheaper.

The non-obvious design choice: we never trust a vendor blog's "up to X percent savings" number. We measure our own workload, because the savings from caching depend entirely on your prefix-to-question ratio, and the savings from batch depend on whether you can tolerate a 24-hour turnaround. The harness makes those trade-offs concrete for your data, not a demo's.

Prerequisites

You need an AWS account with Amazon Bedrock enabled in a US Region (this tutorial uses us-east-1), and model access granted for at least one Claude model. Go to the Bedrock console, open Model access, and enable Claude Haiku 4.5 if you have not already. Access is usually instant for Claude models.

You need Python 3.10 or newer and boto3 version 1.40 or newer (pip install -U boto3). Older boto3 releases predate the cachePoint field in the Converse API and will silently drop it. You also need the AWS CLI configured with credentials that can call Bedrock (aws configure), and for the batch section, permission to create an S3 bucket and an IAM service role.

Assumed knowledge: you are comfortable reading Python, you have called an AWS API with boto3 before, and you understand what an IAM role is. You do not need prior Bedrock experience. Budget roughly 1 to 3 US dollars of model spend to follow along, almost all of it in the baseline runs before caching kicks in. The cleanup section at the end deletes the only resource that costs money while idle (the S3 bucket holds pennies of output).

Setup

Create a working directory and a virtual environment, then confirm Bedrock answers.

mkdir bedrock-cost && cd bedrock-cost
python3 -m venv .venv && source .venv/bin/activate
pip install -U "boto3>=1.40"
export AWS_REGION=us-east-1

Smoke test with a one-line Converse call so we fail fast if model access or credentials are wrong:

python3 - <<'PY'
import boto3
br = boto3.client("bedrock-runtime")
r = br.converse(
    modelId="us.anthropic.claude-haiku-4-5-20251001-v1:0",
    messages=[{"role": "user", "content": [{"text": "Reply with the single word: ready"}]}],
)
print(r["output"]["message"]["content"][0]["text"])
print("usage:", r["usage"])
PY

If you see ready and a usage dict with inputTokens and outputTokens, you are set. Notice the model ID already starts with us. That is a Geographic cross-Region inference profile, not a bare model ID. We are using it from the first line because there is no reason not to: it costs the same as single-Region on-demand, raises your effective throughput, and absorbs traffic bursts by routing across us-east-1, us-east-2, and us-west-2. The bare ID anthropic.claude-haiku-4-5-20251001-v1:0 would also work, but you would cap yourself at one Region's quota for no benefit.

Step 1: Define the workload and a cost model

Before measuring anything, we need a way to turn token counts into dollars. Create costs.py. The multipliers are the part that matters and the part that does not change: a cache read is billed at roughly 10 percent of the standard input rate, and a cache write is billed at a premium over standard input (about 1.25x) as a one-time cost when the prefix first lands in the cache. The per-million base prices below are illustrative; pull the real numbers for your model from the Bedrock pricing page before you quote a figure to anyone.

# costs.py  -- example list prices, VERIFY on the Bedrock pricing page
PRICE = {
    "us.anthropic.claude-haiku-4-5-20251001-v1:0": {
        "input_per_mtok": 1.00,   # standard input, per 1M tokens
        "output_per_mtok": 5.00,  # output, per 1M tokens
    },
}
CACHE_READ_MULT = 0.10   # cache reads ~ 10% of input price
CACHE_WRITE_MULT = 1.25  # cache writes ~ 125% of input price (one-time)

def dollars(model, usage):
    p = PRICE[model]
    inp = usage.get("inputTokens", 0)
    out = usage.get("outputTokens", 0)
    cr = usage.get("cacheReadInputTokens", 0)
    cw = usage.get("cacheWriteInputTokens", 0)
    return (
        inp * p["input_per_mtok"]
        + out * p["output_per_mtok"]
        + cr * p["input_per_mtok"] * CACHE_READ_MULT
        + cw * p["input_per_mtok"] * CACHE_WRITE_MULT
    ) / 1_000_000

The important detail, straight from the Bedrock docs: when caching is on, the inputTokens field counts only the non-cached tokens. The cached ones move into cacheReadInputTokens and cacheWriteInputTokens. So total input equals inputTokens + cacheReadInputTokens + cacheWriteInputTokens, and our dollars() function prices each bucket at its own rate. Get this wrong and you will either over-report your spend or conclude caching did nothing because you forgot to read the new fields.

Step 2: Measure the naive baseline

Now bench.py. Build a workload with a fat static prefix (an instruction block plus a reference document) and a short variable question. The prefix needs to clear the model's cache-checkpoint minimum, which for Claude Haiku 4.5 is 4096 tokens, so we pad the document to comfortably exceed that.

# bench.py
import boto3, statistics
from costs import dollars

MODEL = "us.anthropic.claude-haiku-4-5-20251001-v1:0"
br = boto3.client("bedrock-runtime")

# ~5k tokens of static context. In real life this is your system prompt
# plus a policy doc, a schema, or a retrieved document.
DOC = ("You are a precise support assistant. Follow these rules strictly.\n"
       + "Rule: cite the policy section. " * 900)
QUESTIONS = [
    "What is the refund window?",
    "Can a customer transfer their plan?",
    "Is tax included in the listed price?",
    "How do I escalate a billing dispute?",
]

def run_baseline():
    total = 0.0
    for q in QUESTIONS:
        r = br.converse(
            modelId=MODEL,
            system=[{"text": DOC}],
            messages=[{"role": "user", "content": [{"text": q}]}],
        )
        total += dollars(MODEL, r["usage"])
    return total

if __name__ == "__main__":
    cost = run_baseline()
    print(f"baseline 4 calls: ${cost:.5f}  -> per 1000 req: ${cost/4*1000:.2f}")

Run it. Every call re-sends and reprocesses the full DOC at standard input price, because nothing is cached. Note the per-1000 number. With a 5000-token prefix and a 10-token question, you are paying for roughly 5010 input tokens every call, and 5000 of those are identical waste. This is the line item that makes finance ask what happened.

Step 3: Add a cache checkpoint

A cachePoint marks the end of the contiguous prefix you want cached. Place it right after the static content, before anything that changes. The first call pays a one-time cache write; every subsequent call within the TTL pays the cheap cache-read rate for that prefix instead of full input price.

def run_cached():
    total = 0.0
    for q in QUESTIONS:
        r = br.converse(
            modelId=MODEL,
            system=[
                {"text": DOC},
                {"cachePoint": {"type": "default"}},   # cache everything above
            ],
            messages=[{"role": "user", "content": [{"text": q}]}],
        )
        u = r["usage"]
        total += dollars(MODEL, u)
        print(f"  q={q[:24]:24}  in={u.get('inputTokens',0):5} "
              f"cw={u.get('cacheWriteInputTokens',0):5} "
              f"cr={u.get('cacheReadInputTokens',0):5}")
    return total

Add a call to run_cached() in __main__ and run again. The first question prints a large cw (cache write) and a small in. Every question after that prints cw=0 and a large cr (cache read), because the prefix is now served from cache. The default TTL is 5 minutes and resets on every hit, so a steady stream of traffic keeps the cache warm for free. For Claude Haiku 4.5, Sonnet 4.5, and Opus 4.5 you can extend that to 1 hour by writing {"cachePoint": {"type": "default", "ttl": "1h"}}, which is worth it when calls arrive more than 5 minutes apart, such as a slow agent loop or a user who walks away mid-conversation.

Two gotchas. The prefix must be byte-stable across calls. Inject a timestamp or a request ID into the cached block and every call becomes a cache miss, so keep volatile content below the cachePoint. And the checkpoint only takes effect once the prefix clears the model minimum (4096 tokens for the 4.x Claude models, 1024 for Claude 3.7 Sonnet); a short prefix silently falls back to uncached billing with no error.

Step 4: Push the bulk volume through batch inference

Caching wins when the same context is reused across nearby calls. The other big lever is for volume that does not need to be real time: nightly summarization, backfills, evaluation runs, bulk classification. Bedrock prices batch inference at 50 percent of on-demand, processes the job asynchronously, and writes results back to S3, typically within 24 hours. You hand it a JSONL file where each line is one request.

import json, boto3

def build_jsonl(path="batch_input.jsonl"):
    with open(path, "w") as f:
        for i, q in enumerate(QUESTIONS):
            rec = {
                "recordId": f"REQ{i:07d}",
                "modelInput": {
                    "system": [{"text": DOC}],
                    "messages": [{"role": "user", "content": [{"text": q}]}],
                    "inferenceConfig": {"maxTokens": 512},
                },
            }
            f.write(json.dumps(rec) + "\n")
    return path

Each line carries a recordId (Bedrock echoes it in the output so you can rejoin results to inputs) and a modelInput whose shape matches the API you select. We will tell the job to read these as Converse-format requests. Upload the file to S3, then submit the job with the control-plane client (bedrock, not bedrock-runtime):

def submit_batch(bucket, role_arn):
    s3 = boto3.client("s3")
    s3.upload_file("batch_input.jsonl", bucket, "in/batch_input.jsonl")
    bedrock = boto3.client("bedrock")
    r = bedrock.create_model_invocation_job(
        jobName="cost-demo-batch",
        roleArn=role_arn,
        modelId="us.anthropic.claude-haiku-4-5-20251001-v1:0",
        modelInvocationType="Converse",
        inputDataConfig={"s3InputDataConfig": {"s3Uri": f"s3://{bucket}/in/"}},
        outputDataConfig={"s3OutputDataConfig": {"s3Uri": f"s3://{bucket}/out/"}},
    )
    return r["jobArn"]

The roleArn is a service role Bedrock assumes to read your input bucket and write the output bucket. Create one whose trust policy allows bedrock.amazonaws.com and whose permission policy grants s3:GetObject and s3:ListBucket on the input prefix and s3:PutObject on the output prefix. Poll for completion with bedrock.get_model_invocation_job(jobIdentifier=job_arn) and watch status move from Submitted to InProgress to Completed. When it finishes, every output line carries the original recordId plus a modelOutput field with the model's response, so a dict keyed on recordId reunites questions with answers. The catch is latency: this is a throughput tool, not a request tool. If a user is waiting, batch is the wrong choice no matter how cheap.

Step 5: Take the last 10 percent with global routing

We have been calling the us. Geographic profile, which keeps data inside US Regions for residency and adds throughput at no cost premium. If you have no data-residency constraint, the Global inference profile routes to the optimal commercial Region worldwide and is priced about 10 percent below standard. The cleanest way to find the exact profile ID is to ask Bedrock rather than hardcode it, because the available profiles differ by model and account:

def list_profiles(prefix="global"):
    bedrock = boto3.client("bedrock")
    for p in bedrock.list_inference_profiles(typeEquals="SYSTEM_DEFINED")["inferenceProfileSummaries"]:
        if p["inferenceProfileId"].startswith(prefix) and "haiku-4-5" in p["inferenceProfileId"]:
            print(p["inferenceProfileId"], "->", [r["region"] for r in p["models"]])

Swap the discovered global profile ID into MODEL and rerun the cached benchmark. The token counts are identical; the per-token price is lower, so the per-1000 figure drops another tenth. One operational note worth keeping: because global routing can serve a request from any Region, log analysis changes. Every cross-Region call is recorded in CloudTrail in your source Region with an additionalEventData.inferenceRegion field telling you where it actually ran. If you have a Service Control Policy that blocks Regions, you must allow every destination Region in the profile or the routing fails, and with the global profile that means allowing aws:RequestedRegion of unspecified.

Verify it works

Run the full script. With caching and the cost model wired correctly, your terminal should look roughly like this (your dollar amounts depend on the prices in costs.py and your exact token counts):

baseline 4 calls: $0.02180  -> per 1000 req: $5.45
  q=What is the refund wind   in=   12 cw= 5024 cr=    0
  q=Can a customer transfer   in=   11 cw=    0 cr= 5024
  q=Is tax included in the    in=   12 cw=    0 cr= 5024
  q=How do I escalate a bil   in=   13 cw=    0 cr= 5024
cached 4 calls:  $0.00940  -> per 1000 req: $2.35

The contract for "it worked" is three signals. First, exactly one call shows a nonzero cw (cache write) and the rest show cw=0 with a large cr (cache read). Second, the cached per-1000 cost is well below the baseline; with a heavy prefix and light questions it should land somewhere near half, and it climbs toward the baseline as your questions get longer relative to the prefix. Third, the batch job reaches Completed and the output JSONL in s3://your-bucket/out/ contains one modelOutput per recordId. If all three hold, the harness is measuring real Bedrock billing behavior, not a mock.

When it breaks

If cacheReadInputTokens is always 0 and cacheWriteInputTokens is always large, your prefix is changing between calls. Something volatile (a timestamp, a UUID, a per-user string) is sitting above the cachePoint. Move it below the checkpoint and the reads will appear.

If both cache fields are always 0, your prefix is under the model's minimum token count, so the checkpoint is being ignored. Claude 4.x models need 4096 tokens before the cache engages; pad the static block or pick a model with a lower threshold.

If converse raises ValidationException mentioning an unknown field cachePoint, your boto3 is too old. Run pip install -U "boto3>=1.40" inside the venv and confirm with python -c "import boto3; print(boto3.__version__)".

If the batch job fails immediately with an access error, the service role is wrong. The trust policy must let bedrock.amazonaws.com assume the role, and the permission policy must cover both the input prefix (s3:GetObject, s3:ListBucket) and the output prefix (s3:PutObject). A job that submits but never leaves Submitted usually means Bedrock could not read the input location.

If cross-Region calls fail only sometimes, an SCP is blocking a destination Region. Check the inferenceRegion field in CloudTrail to see which Region the failed routing targeted, then allow it in the policy.

Where to take it next

First, parameterize the prefix-to-question ratio and chart cost versus ratio. The crossover point where caching stops paying for itself is specific to your workload, and seeing the curve tells you whether caching is a real win or a rounding error for your traffic shape. Second, combine the levers: run a batch job whose modelInput records each carry a cachePoint, since the 1-hour TTL was built partly for long-running batch scenarios, and measure whether stacking caching on top of the 50 percent batch discount compounds. Third, wire the per-call usage dict into CloudWatch as a custom metric so cost-per-request becomes a dashboard you watch, not a number you compute after the invoice lands.

The reframing worth keeping: none of these three features is a discount you switch on once. Caching pays off only for reused context, batch only for work that can wait, global routing only when residency does not bind you. The win is not picking one. It is measuring your own workload well enough to know which lever each request should pull.

Sources

Keep Reading