This website uses cookies

Read our Privacy policy and Terms of use for more information.

There is a table in a recent ICLR 2026 blog post that should make anyone who has spent a quarter building agent infrastructure a little uncomfortable. It compares two software engineering agents on SWE-bench Verified, both running Claude 4 Sonnet as the backbone. The first is SWE-Agent, a mature, specialized system: planning, state tracking, custom tool handlers, error recovery, around 4,161 lines of Python. It scores 67% and costs roughly $2.50 per task. The second is Mini SWE-Agent, a 131-line loop that has no tools except bash and a linear message history. It scores 65% and costs $0.37 per task.

Two percentage points. Thirty times less code. Seven times cheaper. That is the trade you actually made when you built the elaborate version, and almost nobody measured it before building.

The thesis of this post is uncomfortable but defensible: most of the orchestration you wrap around a frontier model, the planning modules, the retriever layers, the bespoke memory, the hand-tuned error state machines, is buying you single-digit gains over a minimal loop, and you are paying for it in cost, latency, and the maintenance tax of a few thousand lines you now own. The leverage moved. It used to live in the scaffold. It now lives in the model and the environment you hand it. If you are still spending most of your engineering budget on the scaffold, you are optimizing the part that stopped mattering.

What the field is actually measuring now

For two years the interesting question was "how do we get a language model to act?" The answer was scaffolding: ReAct loops, then planners, then memory systems, then full domain-specific agents with components for planning, state tracking, tool use, and error handling. Those systems topped the leaderboards, and they earned it. The model on its own could not reliably hold a multi-step task together, so the structure around it did the holding.

That premise is quietly expiring. The ICLR 2026 blog post Ready For General Agents? Let's Test It. argues that the field is shifting from domain-specialized agents toward general ones: minimal, mostly model-driven loops that can be dropped into many environments. The authors are not making a vibes argument. They put the specialized and the minimal agents side by side on the same benchmarks with the same models and report the cost, the line count, and the score. That is the comparison most teams never run on their own systems, because once you have built the big version you have no incentive to discover that the small version was almost as good.

This matters now because the default has inverted. Every framework vendor recommends an agent loop. The consumer and developer LLM products people use all day have "quietly evolved from a conversation with an LLM to an interaction with an agent equipped with tools," as the post puts it. The question is no longer whether to build an agent. It is how much agent to build, and the honest answer turns out to be: less than you think, and you should prove otherwise before you add a line.

The SWE-bench case, in detail

Mini SWE-Agent is not a toy somebody wrote to make a point. It comes from the Princeton and Stanford team behind SWE-bench and the original SWE-agent. Their own repository headline is blunt: "The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo, but scores >74% on SWE-bench verified." The architecture is a loop. It takes a prompt, executes actions with bash and subprocess.run, and appends every step to a flat message list. No retrievers. No plugins. No planning subsystem. The entire control flow fits on a screen.

The ICLR table reports it at 65% with Claude 4 Sonnet against SWE-Agent's 67% with the same model, while the repo advertises >74% under a stronger configuration. Pick whichever number you like. The relationship holds either way: a few hundred lines of general loop lands within a couple of points of a few thousand lines of specialized machinery, and does it for a fraction of the per-task cost. The 4,000 lines were not wasted, exactly. They were the right answer for the model that existed when they were written. They are simply no longer where the marginal point comes from.

The cost gap is the part teams underweight. At $2.50 versus $0.37 a task, the specialized agent is not a little more expensive, it is nearly an order of magnitude more expensive for two points. If you are running a handful of tasks a day, fine, pay for the two points. If you are running a hundred thousand, that ratio is the difference between a viable product and a line item your CFO asks about.

It is not a SWE-bench fluke

The easy dismissal is that coding agents are special, that bash plus a model is unusually well-suited to GitHub issues, and the result will not generalize. The same blog post anticipates that and shows the pattern in a completely different domain: scientific research agents.

On ASTA Bench, a benchmark for deep research agents, the specialized ASTA-v0 system scores 53% at $3.40 per task and carries subsystems exceeding 13,768 lines of code. The second-best system on the board is a 358-line ReAct agent running GPT-5, scoring 44% at $0.31 per task. On the literature-understanding subtask specifically, that minimal ReAct agent scores 53%, beating both the specialized ASTA Paper Finder (21%) and OpenAI Deep Research (19%). A general loop in 358 lines outscored two purpose-built research systems on the task they were purpose-built for.

The authors summarize the regularity across both domains in one sentence: "small general agents of a few hundred lines consistently achieve 70% to 95% of the performance of systems that are thousands of lines." That is the number to internalize. Not "minimal agents win," they usually do not win outright. The claim is that they capture most of the value for a rounding error of the complexity, and the last slice of performance is the expensive slice.

Why the floor rose

The mechanism is not mysterious. The capabilities you used to implement in the scaffold migrated into the model. Planning, decomposing a task, recovering from a failed command, deciding what to read next: these were once things you wrote explicit code for because the model could not be trusted to do them in-context. Frontier models now do them in-context, well enough that an external planner is often re-deriving a plan the model already had.

So what is the scaffold for, once the model absorbed the reasoning? The honest answer is that the durable value of an agent system was never the orchestration. It was the environment. The tools the agent can call, the sandbox it runs in, the data it can reach, the actions that actually change the world. Mini SWE-Agent works because bash in a real repo is an absurdly powerful action space, not because its loop is clever. The leverage that survived is the connection between the model and the world. The leverage that evaporated is the elaborate machinery in between.

This is why I would now build any new agent in roughly this order. Start with the smallest possible loop: model, a flat history, and the tools the task genuinely requires. Get a number on your real task. Only then, and only against that baseline, add structure: a planner, a memory store, a retriever, whatever you think you need. Keep each addition exactly as long as it moves your number by more than its maintenance cost. Most will not survive that test. The ones that do are the real infrastructure, and you will be able to point to the percentage points they bought.

Yes, but the last two points are sometimes the whole game

The strongest objection is that this reasoning flatters the average case and ignores the frontier, and it is a fair objection. Two percentage points on SWE-bench is two percentage points only in aggregate. On a hard enough distribution of tasks, the specialized system's planning and error recovery are precisely what carry the cases the minimal loop drops, and those cases may be the ones you care about most. If you are shipping an agent where a 2% higher failure rate means a corrupted production database or a wrong number in a financial report, the cost math inverts. You will pay $2.50 a task gladly, and you should.

The gap also compounds with horizon. These benchmarks measure tasks that resolve in a bounded number of steps. Push to genuinely long-running work, the multi-hour autonomous runs that are becoming common, and the things scaffolding provides, durable memory, state that survives context limits, structured recovery, stop being luxuries. A flat message history that works beautifully for a twenty-step bug fix will not hold a seven-hour task together. The minimal-agent result is real, but it is a statement about a class of tasks, not a license to delete your infrastructure for every task.

And specialized systems still top the leaderboards. They win. The argument here is not that complexity is worthless. It is that complexity is now something you should be made to justify against a cheap, strong baseline, rather than something you reach for by default because two years ago it was the only thing that worked. The default changed. The justification did not used to be required. Now it is.

What to do Monday

Run the comparison the blog post ran, on your own agent. Take whatever elaborate system you are shipping, and build the 131-line version next to it: same model, a flat history, only the tools the task actually needs. Point both at your real evaluation set and read the two numbers and the two costs.

If the gap is two points, you have a decision to make in daylight instead of by inertia, and depending on your volume and your tolerance for failure, the small one might be the system you ship. If the gap is twenty points, you have just learned exactly which part of your scaffold is load-bearing, and you can stop apologizing for maintaining it. Either way you replaced a belief with a measurement. The teams that keep winning with agents over the next year will not be the ones with the most infrastructure. They will be the ones who can tell you, in percentage points, what each piece of their infrastructure is for.

Sources

Keep Reading