SL#86 - Your Agent Reads 150,000 Tokens Before It Reads Your Request

There is a genre of Hacker News thread that shows up whenever a piece of infrastructure crosses from novelty into load-bearing. Someone posts a clean benchmark, a hundred people pile in, and the consensus crystallizes into a slogan. This month the slogan is "we're moving off MCP." Direct API calls instead. CLIs instead. The Model Context Protocol, barely eighteen months old, is suddenly the thing serious teams claim to be outgrowing because the overhead doesn't justify the convenience when you're paying for every token in a production pipeline.

The complaint is not made up. If you wire an agent to a few dozen MCP servers and let the client do what most clients do by default, your model processes a staggering amount of text before it has read a single word of the user's actual request. Anthropic put a number on it: an example workflow that should cost about 2,000 tokens was costing roughly 150,000. That is not a rounding error. That is the model reading the equivalent of a short novel to decide it needs to call two functions.

So the bill is real. But the diagnosis on those threads is wrong, and it's wrong in a way that matters, because the fix people are reaching for (throw out the protocol) discards the one part that was actually worth keeping. The thing that's bloated isn't MCP. It's a convention almost everyone adopted without noticing they had a choice: loading every tool definition into the context window up front, and then shuttling every intermediate result back through the model. Separate the protocol from that convention and the entire backlash evaporates.

The two taxes

Walk through where the tokens actually go, because the failure has two distinct halves and most people only see the first.

The first tax is tool definitions. Most MCP clients load every connected tool's schema into context before the conversation starts. Each definition is a small block of text: name, description, parameters, return shape. Individually trivial. But agents in production today routinely connect to hundreds or thousands of tools across dozens of servers, and the text adds up linearly. Connect enough servers and the model burns six figures of tokens just holding the menu in its head, every single turn, whether or not it touches any of those tools.

The second tax is sneakier. Even after the model picks the right tools, the data those tools return flows back through the context window. Anthropic's example is the one to sit with. You ask an agent to pull a meeting transcript from Google Drive and attach it to a Salesforce lead. The model calls gdrive.getDocument, the full transcript loads into context, then the model calls salesforce.updateRecord and has to write the entire transcript back out again as an argument. For a two-hour sales meeting, that single hop can mean 50,000 extra tokens, and the transcript passes through the model twice for a task where the model never needed to read a word of it. Worse, on large or structured payloads the model starts making copy errors, mangling the data it's mechanically relaying between two calls.

Notice what neither tax has to do with the protocol. MCP didn't mandate that you preload every schema. It didn't mandate that results round-trip through the model. Those are choices the client makes, defaults that were fine at three tools and quietly catastrophic at three hundred. The slogan blames the standard for the behavior of the harness sitting on top of it.

The fix nobody on those threads is proposing

Here's the part the "drop MCP" crowd skips. The two largest infrastructure shops to publish on this independently did not conclude you should abandon the protocol. They concluded you should stop using it as a tool-calling interface and start using it as a code API.

Anthropic's version: instead of exposing tools as direct calls, generate a filesystem of TypeScript files, one per tool, organized by server. The agent explores the tree the way a developer explores an unfamiliar repo, listing the ./servers/ directory, reading only the specific tool files it needs for the task at hand. It writes ordinary code against those files. The Google Drive to Salesforce task stops being two tool calls with a 50,000-token transcript in the middle and becomes three lines:

import * as gdrive from './servers/google-drive';
import * as salesforce from './servers/salesforce';

const transcript = (await gdrive.getDocument({ documentId: 'abc123' })).content;
await salesforce.updateRecord({
  objectType: 'SalesMeeting',
  recordId: '00Q5f000001abcXYZ',
  data: { Notes: transcript }
});

The transcript lives in the execution environment. It never enters the model's context at all. The model wrote code that moved data from A to B; it never had to read the data. That single change is what takes the workflow from 150,000 tokens to 2,000, a 98.7% cut, and Anthropic reports the pattern lands somewhere between 50% and 98% across implementations.

Cloudflare published the same insight a few weeks earlier and called it Code Mode. When you connect an MCP server, their Agents SDK fetches the schema and converts it into a typed TypeScript API with doc comments. The model gets exactly one tool, codemode({ code }), which runs its code in a sandboxed Worker isolate that can only reach the outside world through bindings to the connected servers. Their headline measurement is even more lopsided than Anthropic's: the token footprint of working across more than 2,500 API endpoints dropped from over 1.17 million tokens to roughly 1,000. Around 99.9%.

The reason both teams give for why this works is the most interesting sentence in either post, and it's Cloudflare's: making an LLM do tasks through tool calls "is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it." Tool-calling rides on special tokens the model only ever saw in a small, synthetic, vendor-constructed training set. Code is the opposite. The model has read millions of real repositories. Asking it to write a for loop that calls an API is asking it to do the thing it has the most practice in the world at. Asking it to emit a precise JSON tool-call is asking it to perform in a language it was tutored in for a month. Same underlying capability, wildly different fluency.

What this says about MCP itself

If code execution wins this hard, the obvious question is the one the HN threads think they're answering: why keep MCP at all? If you're going to write code against TypeScript wrappers anyway, why not write code against the real APIs those wrappers were built on top of, and skip the protocol entirely?

Because the protocol was never the tool-calling part. That was always a default, not the point. What MCP actually provides is a uniform way to discover an API, read its documentation, and authorize against it, regardless of whether the agent's authors and the server's authors have ever heard of each other. That uniformity is what lets a sandbox hand an agent a set of capabilities and nothing else. Cloudflare's design leans on exactly this: the sandboxed code has no general network access, only bindings to the MCP servers it was granted, and those bindings carry the auth tokens so the model literally cannot write code that leaks an API key. You don't get that from pointing an agent at fifty bespoke REST APIs. You get it from a standard that handles connectivity, auth, and discovery the same way every time.

So the real picture inverts the slogan. The thing people want to throw out, the protocol, is the part worth keeping. The thing they're keeping by accident, tool definitions dumped into context, is the part worth throwing out. The backlash got the right symptom and the wrong organ.

Yes, but code execution is not free

The honest objection: running model-authored code is a heavier operational posture than letting a client dispatch tool calls. You need a real sandbox, with resource limits, timeouts, and monitoring. Anthropic says so plainly in the same post: these infrastructure requirements add overhead and security considerations that direct tool calls simply avoid. A sandbox that can run arbitrary generated code is a meaningfully larger attack surface than a fixed dispatcher that only ever invokes named functions with validated arguments.

And the token math only bites at scale. If your agent connects to three tools and pipes around small JSON objects, the definitions cost you a few hundred tokens and nothing round-trips a 50,000-token transcript. At that size, standing up an isolate per code snippet is solving a problem you don't have, and the direct tool-calling path is the correct, boring choice. The 150,000-token horror story is a large-surface, large-payload story. It is not a universal one, and anyone selling code execution as the answer to every agent is overfitting to two blog posts.

Where it flips is exactly where the threads are complaining from: many servers, many tools, large intermediate data, long-running pipelines where token cost compounds across thousands of runs. That is the regime where the 98% number is life-or-death for your margins, and it is also the regime that produced the backlash. The people who feel the pain most are the people best served by the fix they're talking themselves out of.

What to do Monday

Before you migrate anything, measure the tax. Open your agent's actual request payload and count the tokens spent on tool definitions before the user's message appears. If it's under ten thousand and your tools return small results, you do not have this problem, and ripping out MCP for direct calls is motion without progress. Leave it alone.

If that number is large, or if you can point to a specific hop where a fat payload travels tool to model to tool, you have found your 98%. The move is not to abandon the protocol. It's to put the tools behind a code-execution boundary: expose them as a typed API or a tool filesystem, give the model one tool that runs code in a sandbox, and let intermediate data stay out of context unless the model explicitly logs it. Keep MCP for what it was always good at, the uniform discovery and auth, and drop the convention that was never actually required, the part where the model reads a novel before it reads the request. The slogan on the thread is wrong. You don't have an MCP problem. You have a context-management problem wearing an MCP costume.

SL#86 - Your Agent Reads 150,000 Tokens Before It Reads Your Request

The two taxes

The fix nobody on those threads is proposing

What this says about MCP itself

Yes, but code execution is not free

What to do Monday

Sources

Keep Reading

Software Letters

Home