Your LLM Agent Is Drowning in Its Own Context Window

Context windows just hit two million tokens. So why are 5% of production AI requests still failing? Because the industry confused having more space with knowing what to put in it.

Somewhere along the way, "context engineering" became a real job title. That's worth pausing on. Two years ago, the biggest complaint about LLMs was that they couldn't remember anything — you'd hit the token limit, and the model would forget what you were talking about three paragraphs ago. Anthropic, OpenAI, and Google poured enormous effort into extending context windows: from 4K to 32K to 128K to, as of this year, two million tokens for some Gemini tiers.


We got what we asked for. And then we promptly misused it.


Datadog just released its State of AI Engineering 2026 report, based on telemetry from thousands of organizations running LLMs in production. The headline finding sounds good: teams are using more context, deploying more models, building more sophisticated agents. The detail is more uncomfortable. About 5% of all LLM call spans return an error. Of those, nearly 60% fail due to rate limits — not bugs in the model, not bad prompts, but because teams are hammering the provider's capacity ceiling. Meanwhile, 69% of all input tokens go to system prompts. And only 28% of LLM calls use any prompt caching at all, even though most major models support it.


Those numbers, together, tell a story. Teams got bigger windows and filled them with the wrong things, in the wrong order, without caching the parts that don't change. The result is an agent that's expensive to run, fragile under load, and producing outputs that aren't measurably better than they'd be with a third of the context.


What actually lives inside your system prompt


The 69% figure deserves more unpacking. Datadog analyzed the sender role of tokens across LLM call traces in March 2026. System prompts — the scaffolding developers put above the user conversation — accounted for 69% of all input tokens. That means the actual user input, retrieved documents, tool outputs, and conversation history together accounted for only 31%.


This is backwards. The system prompt is supposed to give the model standing instructions: its role, its constraints, its tools. That content should be compact, stable, and — crucially — cacheable. What's actually happening in production is that teams are putting everything into the system prompt: verbose role descriptions, long policy sections, full tool schemas for every tool whether it's relevant to this call or not, and safety guardrails copy-pasted verbatim from the previous prompt version.


One common pattern: a team builds an agent with 15 tools. Rather than injecting only the tools relevant to the current step, they include all 15 tool schemas in every call. Each schema might be 200–400 tokens. That's 3,000–6,000 tokens of tool definitions the model has to process on every single LLM call in the loop, most of which are irrelevant to what the model is being asked to do right now.


The model isn't getting smarter from those extra tokens. It's getting noisier. And your bill is going up.


The caching problem: 72% of teams are paying twice


If your system prompt is large and mostly stable across calls, prompt caching is the single highest-leverage optimization available to you right now. The math is brutal: most providers charge significantly less for cached-read tokens than for fresh input tokens. Anthropic's caching, for instance, charges cache writes at 1.25x the base rate but cache reads at 0.1x — a 90% reduction for any token that hits the cache.


Datadog found that only 28% of LLM call spans showed any cached-read input tokens, restricted to models that actually support the feature. That means at least 72% of teams running on cache-capable models are paying full price to re-process content they already sent on the last call.


The reason is almost always prompt layout. Caching works by prefix matching — the provider caches the beginning of your prompt up to a stable breakpoint, and reuses that cached state on subsequent calls. If your prompt injects dynamic content early (say, the current timestamp, or the user's name, or a freshly retrieved document), the prefix changes every call and the cache never hits. The stable content — your system instructions, your tool schemas, your safety policy — needs to live at the top of the prompt, before anything that varies per call.


Context quality: the retrieval problem nobody's benchmarking


Here's the other side of the context engineering problem: for agentic workloads that do retrieval, the optimization instinct is to retrieve more. The model has a million tokens now — why not pull in 50 documents instead of 5?


Because retrieval at scale introduces noise faster than signal. When you retrieve 50 documents and inject them into context, you're betting that the model can find the 3 relevant passages buried in 47 mostly-irrelevant ones. Research consistently shows that LLM accuracy degrades when the relevant information is surrounded by irrelevant but plausible-looking text — this is sometimes called the "lost in the middle" problem. The model doesn't read linearly; attention is distributed, and relevant facts in the middle of a long context get systematically underweighted.


The implication is that context size is not a substitute for retrieval quality. A well-tuned retrieval system that returns 5 highly relevant chunks will outperform a naive system that returns 50 loosely related ones, even if the second system uses a model with a much larger context window. The constraint moved from "how many tokens can I fit?" to "how reliably can I identify which tokens actually matter?"


This is what Datadog means when they say context quality — not volume — is the new limiting factor. The majority of their customers aren't anywhere near the context ceiling of their chosen models. The ceiling isn't the problem. The floor is: what's the minimum well-structured context that gives the model what it needs to act correctly?


Concretely, this means investing in retrieval quality over retrieval quantity, deduplication and compression of conversation history, and dynamic tool loading that injects only the schemas relevant to each step.


The rate limit failure cascade


The 5% error rate with 60% of failures attributed to rate limits reveals a systems problem downstream of context bloat. When agents use more tokens per call than necessary, they burn through rate limits faster. When they don't cache, they re-process stable content that would otherwise be cheap. The result is that the provider's TPM (tokens per minute) cap becomes the ceiling for the whole system.


But there's a compounding effect specific to agents that Datadog flags: agent loops with variable iteration depth. A ReAct-style agent that calls tools in a loop doesn't have a fixed token budget per user request. A simple query might resolve in 3 hops; an ambiguous one might take 12. When the 12-hop cases cluster — say, during a traffic spike — the token consumption isn't linear. Multiple concurrent agents, each in a long loop, can exhaust shared rate limits in seconds and trigger a cascade of retries that makes the problem worse.


The operational fixes: set token budgets per agent loop (not just step limits), route lightweight tasks to smaller models, and implement backpressure at the request queue rather than relying on retry storms.


Yes, but: doesn't more context improve accuracy?


The strongest objection to the "less is more" framing is empirical: on many benchmarks, giving the model more context does improve accuracy. The RULER and NIAH benchmarks show that frontier models can attend correctly to information in very long contexts when the surrounding content is low-noise. So isn't the right move just to include everything and let the model sort it out?


The issue is that production agents aren't operating on a benchmark. The content injected into production contexts isn't curated; it includes retrieval results of variable quality, tool outputs that may be verbose, conversation history with outdated assumptions, and system-prompt content written by multiple engineers over six months. In that environment, the "more is better" thesis breaks down because the noise floor rises faster than the signal. There's also the cost and latency dimension the benchmarks ignore: twice as many tokens means twice the cost and meaningfully higher latency at inference time.


What to do on Monday


First, instrument your cache-hit rate. If you're running on Anthropic, OpenAI, or Gemini, your API responses return cached token counts. A cache-hit rate under 30% almost certainly means a prompt ordering problem. Move stable content — tool schemas, policy text, role descriptions — to the top of the system prompt, before any dynamic injections.


Second, audit your system prompt token share. If system prompt tokens are over 50% of your total input, you have room to trim. Start with tool schemas: are you injecting all tools on every call, or just the relevant ones for each agent step?


Third, if you're doing RAG, add a reranking step before context injection. Pull top-20 candidates from your vector store, rerank to top-5 using a cross-encoder. The latency cost is small relative to the accuracy gain from tighter context.


The context window arms race solved the problem engineers were complaining about in 2023. The problem in 2026 is different: it's not that there isn't enough space. It's that most teams haven't developed the discipline to decide what belongs in it.


Sources