SL#70 - AWS AI Series (11/30) - Give Your Agent Hands: Build a Fact-Checker With AgentCore Code Interpreter and Browser

What we are building

A model alone is bad at two things that matter for trustworthy answers: it cannot see anything that happened after its training cut-off, and it cannot do arithmetic reliably. It will confidently tell you a year-over-year growth rate and get the division wrong. It will quote a price from memory that changed last quarter.

The fix is to give it hands. By the end of this tutorial you'll have a factcheck agent that does both. Hand it a claim like "the latest Python release is 3.x and that's at least 4 minor versions ahead of 3.9", and it will open a browser to python.org to read the actual current version, then run Python in a sandboxed interpreter to compute the version gap, and return an answer grounded in what it saw and what it computed, not what it remembered.

The two tools come from Amazon Bedrock AgentCore. The Browser tool is a managed, isolated Chrome you drive over the Chrome DevTools Protocol. The Code Interpreter is a sandboxed Python/JavaScript/TypeScript runtime with numpy, pandas, and matplotlib pre-installed. Both run in microVMs in your account, both bill only for active compute, and both plug into a Strands agent as ordinary tools.

The non-obvious design choice: these two tools attack opposite failure modes. The browser imports ground truth the model never had. The interpreter removes the model from the arithmetic entirely. Wire them together and the agent stops guessing on the two things it's worst at.

Prerequisites

You need an AWS account with credentials configured (aws sts get-caller-identity must return your identity), Python 3.10 or newer, and the AWS CLI set up locally. You need Bedrock model access for Anthropic Claude Sonnet 4 enabled in the Bedrock console, in the region you'll work in. AgentCore is not in every region yet; us-west-2 is the safe default and is what every example here uses.

You should be comfortable reading Python and running a virtualenv. No prior AgentCore experience is assumed, but if you followed episode 7 (Strands Agents 101) the agent wiring will look familiar; this is the same Agent object with different tools bolted on.

One cost note up front: both tools bill per second of active compute, and you'll also pay Bedrock token charges for the Claude calls. Following this whole tutorial costs well under two dollars. The exact math is in the cost section near the end.

Setup

Make a project folder and a clean virtualenv, then install the SDK, the Strands framework, its tool pack, and Playwright (the Browser tool drives Chrome through it).

mkdir factcheck-agent && cd factcheck-agent
python3 -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install "bedrock-agentcore>=0.1.0" strands-agents strands-agents-tools playwright nest-asyncio boto3

Now attach IAM permissions to the identity that get-caller-identity returned. You need two policies, one per tool. Create them as inline policies in the IAM console (replace <region> and <account_id>):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CodeInterpreter",
      "Effect": "Allow",
      "Action": [
        "bedrock-agentcore:StartCodeInterpreterSession",
        "bedrock-agentcore:InvokeCodeInterpreter",
        "bedrock-agentcore:StopCodeInterpreterSession",
        "bedrock-agentcore:GetCodeInterpreterSession",
        "bedrock-agentcore:ListCodeInterpreterSessions"
      ],
      "Resource": "arn:aws:bedrock-agentcore:<region>:<account_id>:code-interpreter/*"
    }
  ]
}

For the Browser tool, add a second policy with StartBrowserSession, StopBrowserSession, GetBrowserSession, ConnectBrowserAutomationStream, and ConnectBrowserLiveViewStream on arn:aws:bedrock-agentcore:<region>:<account_id>:browser/*, plus bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream on * so the agent can call Claude. The full Browser policy is in the AgentCore Browser quickstart linked in Sources.

Smoke test that credentials and region line up before writing any agent code:

aws sts get-caller-identity --region us-west-2

If that returns your account, you're ready.

Step 1: Prove the Code Interpreter works on its own

Before wiring anything into an agent, confirm the sandbox runs your code. The SDK gives you a CodeInterpreter client that manages the session lifecycle. Create smoke_code.py:

from bedrock_agentcore.tools.code_interpreter_client import CodeInterpreter
import json
client = CodeInterpreter("us-west-2")
client.start()
try:
    response = client.invoke("executeCode", {
        "language": "python",
        "code": "print(sum(range(1, 101)))",
    })
    for event in response["stream"]:
        print(json.dumps(event["result"], indent=2))
finally:
    client.stop()

Run it with python smoke_code.py. You should see a streamed result containing 5050. The important moves here: start() boots a fresh microVM and stop() tears it down. The session is where state lives, so variables you define in one invoke survive into the next call on the same client. Always wrap the work in try/finally so a crash still calls stop(). A session left running keeps billing until it hits its timeout (default 900 seconds, max 8 hours).

The executeCode tool name and the {"language", "code"} argument shape are fixed by the API. The interpreter ships with numpy, pandas, and matplotlib, so data work runs without a pip install. If you need a package that isn't there, you'd build a custom interpreter, which is out of scope here.

Step 2: Prove the Browser works on its own

Same idea for the browser: confirm you can drive the managed Chrome before handing it to an agent. The browser_session context manager gives you a CDP WebSocket URL and auth headers; you connect Playwright to that remote browser. Create smoke_browser.py:

import asyncio
from playwright.async_api import async_playwright
from bedrock_agentcore.tools.browser_client import browser_session
async def main():
    with browser_session("us-west-2") as client:
        ws_url, headers = client.generate_ws_headers()
        async with async_playwright() as pw:
            browser = await pw.chromium.connect_over_cdp(ws_url, headers=headers)
            page = browser.contexts[0].pages[0]
            await page.goto("https://www.python.org/downloads/")
            title = await page.title()
            print("Page title:", title)
            await browser.close()
asyncio.run(main())

Run python smoke_browser.py. You should see the downloads page title printed. What just happened: AgentCore booted an isolated Chrome in a microVM in your account, handed you a CDP endpoint, and Playwright connected to it as if it were a local browser. The page never touched your machine. While the script runs you can watch it live in the AgentCore console under Built-in tools, which is the single best debugging feature here; when an agent's browser run goes sideways, the live view shows you exactly where it got stuck.

This is the raw, deterministic path. You're scripting the browser yourself. In the next step we hand the wheel to the model instead.

Step 3: Wire both tools into one Strands agent

Now the actual build. Strands wraps each AgentCore tool in an adapter so the agent can call it through the model's tool-use loop. Create factcheck.py:

from strands import Agent
from strands_tools.code_interpreter import AgentCoreCodeInterpreter
from strands_tools.browser import AgentCoreBrowser
REGION = "us-west-2"
code_tool = AgentCoreCodeInterpreter(region=REGION)
browser_tool = AgentCoreBrowser(region=REGION)
SYSTEM_PROMPT = """You are a fact-checker. You never trust your own memory for
current facts or for arithmetic.
- To learn a current fact (a version number, a price, a date), use the browser
  tool to read it directly from an authoritative page.
- To compute or compare anything numeric, write Python and run it with the code
  interpreter. Never do math in your head.
State the claim, what you found, what you computed, and a clear VERDICT."""
agent = Agent(
    tools=[code_tool.code_interpreter, browser_tool.browser],
    system_prompt=SYSTEM_PROMPT,
)
if __name__ == "__main__":
    claim = (
        "The latest stable Python 3 release shown on python.org/downloads "
        "is at least 4 minor versions ahead of Python 3.9."
    )
    result = agent(claim)
    print(result.message["content"][0]["text"])

Two things are doing the work here. The system prompt is not decoration; it's the control surface. The line "never do math in your head" is what pushes the model to reach for the interpreter instead of emitting a plausible-looking number. Vague tool descriptions are the number one reason agents skip a tool they should use, so the prompt names exactly when each tool applies. The second thing is that tools=[...] takes both adapters, so a single agent now has a browser and a sandbox and decides which to call, in what order, from the task alone.

Step 4: Run it and read the tool-use trace

Run python factcheck.py. The agent will reason roughly like this: the claim has a "current fact" part (what's the latest Python version) and an "arithmetic" part (is the gap from 3.9 at least 4). It opens the browser, reads the version off python.org, then writes a tiny Python snippet in the interpreter to subtract the minor versions and compare against 4, and finally reports a verdict.

You don't have to take the agent's word for which tools it used. Strands records every tool invocation on the result object. Add this before the final print to see the trace:

for block in result.message["content"]:
    if "toolUse" in block:
        print("TOOL CALLED:", block["toolUse"]["name"])

In a clean run you'll see both browser and code_interpreter appear. That trace is the proof the agent actually fetched and actually computed rather than pattern-matching an answer from training data. If you only see one tool, your system prompt isn't forcing the split; tighten it (see the troubleshooting section). This separation is the whole point of the build: the browser supplies a fact the model could not have known, and the interpreter supplies arithmetic the model cannot be trusted to do.

Verify it works

A successful run produces three observable things. First, terminal output ending in a clear verdict, something like: "Claim: latest Python is at least 4 minor versions ahead of 3.9. Found: 3.13 on python.org. Computed: 13 - 9 = 4, which is >= 4. VERDICT: TRUE." The exact version will be whatever python.org shows the day you run it, which is exactly the point; the number comes from the live page, not from this tutorial.

Second, the tool trace from Step 4 prints both browser and code_interpreter. Third, if you open the AgentCore console under Built-in tools while the script runs, you'll see an active browser session reach status Ready, and a code interpreter session appear and then stop. If you see all three, the agent has hands and is using them. If the verdict is right but only one tool fired, the answer happened to be in the model's training data; change the claim to something the model cannot know (today's date arithmetic, a price) to force both tools.

When it breaks

AccessDeniedException on StartBrowserSession or StartCodeInterpreterSession. Your IAM identity is missing the tool's permissions, or the resource ARN region doesn't match. Re-check that the policy Resource uses the same region you pass to the tool constructor, and that aws sts get-caller-identity returns the identity you actually attached the policy to.

The agent answers without calling any tool. This is a prompt problem, not a code problem. The model decided it already knew the answer. Make the system prompt imperative ("you MUST use the browser for any current fact") and pick a test claim whose answer postdates the model's training. An ambiguous tool description also causes loops or skips; describe the tool by when to use it, not just what it is.

Browser session hangs or times out. Default session timeout is 900 seconds. A real run should finish in well under a minute; if it hangs, open the live view in the console to see whether Chrome is stuck on a consent banner or a redirect. The model sometimes needs an explicit nudge in the prompt to dismiss cookie dialogs.

ImportError inside the interpreter. Only the pre-installed libraries (numpy, pandas, matplotlib, and the standard library) are available. If the agent's generated code imports something exotic, it fails inside the sandbox. Tell the agent in the prompt to stick to the standard library and numpy/pandas for analysis tasks.

Sessions you forgot to stop keep billing. The Strands adapters manage their own session lifecycle, but if you wrote direct-client code (Steps 1 and 2) and crashed before stop(), a session can linger until timeout. List and stop stragglers with the cleanup snippet below.

Cost and cleanup

Both the Browser and Code Interpreter bill identically: $0.0895 per vCPU-hour and $0.00945 per GB-hour, per second, with a one-second minimum and a 128 MB memory floor. Crucially, you're billed only for active compute; the 30-70% of an agentic run spent waiting on the model or the network is free. AWS's own pricing example puts a 10-minute browser session at roughly $0.012 and a 2-minute code execution at roughly $0.0036. Running this tutorial a dozen times, you're looking at a few cents of AgentCore compute. The larger line item is Bedrock token usage for the Claude Sonnet 4 calls, which for these short runs is still well under a dollar total. Budget under two dollars to follow the whole thing end to end.

The Strands-managed sessions stop when the agent finishes. For any direct-client work, confirm nothing is left running:

import boto3
c = boto3.client("bedrock-agentcore", region_name="us-west-2")
for s in c.list_code_interpreter_sessions(
        codeInterpreterIdentifier="aws.codeinterpreter.v1").get("items", []):
    print("stopping", s["sessionId"])
    c.stop_code_interpreter_session(
        codeInterpreterIdentifier="aws.codeinterpreter.v1",
        sessionId=s["sessionId"])

You didn't create any persistent resources (no custom browser, no S3 recording bucket), so there's nothing else to delete. Sessions are ephemeral by design.

Where to take it next

First, swap the test claim for one the model genuinely cannot know: "the current USD/EUR rate on a given page implies X euros for $1000." This forces the browser-then-compute split every time and is the cleanest demo of why both tools earn their place.

Second, turn on Browser session recording. Create a custom browser with recording.enabled pointed at an S3 bucket (the Browser quickstart has the exact create_browser call and the execution-role trust policy), then replay the run in the console with a video timeline, DOM snapshots, and network events. For anything user-facing, that audit trail is the difference between "the agent said so" and "here's exactly what it did."

Third, deploy this agent to AgentCore Runtime so it's an HTTP endpoint instead of a script, reusing the harness from episode 8. The tools don't change; only the wrapper does. At that point you have a hosted fact-checking service that browses and computes on demand. The interesting question then isn't whether the agent can use its hands, but how you keep it from using them on the wrong page, which is where Identity and Guardrails come in later in this series.