A bug nobody can explain
A teammate of mine spent two days last month trying to fix a routing bug in a service none of us had touched in six weeks. The code was readable. The tests passed. The pull request that originally introduced the bug had a clean diff and a one-line summary. He still couldn't fix it.
The reason was simple and uncomfortable: nobody on the team could tell him why the routing logic was structured the way it was. The PR had been merged after an LLM had written most of it; the reviewer had skimmed it; the author had moved to the next ticket. By the time the bug surfaced, the system contained dozens of small decisions that no human had ever held in their head at the same time. He wasn't fighting messy code. He was fighting a system that had no living theory behind it.
This is the part of the AI-assisted development story that the productivity dashboards keep missing. We've spent five years optimizing the metric "code that gets written." Almost nobody is tracking the metric that actually predicts whether a six-month-old codebase is fixable: the number of people who can explain why it looks the way it does. That number is dropping, fast, in a lot of teams. And refactoring, the move every senior engineer reaches for first, doesn't fix it.
Where the debt actually went
Two pieces of evidence from this year are worth putting next to each other.
The first is GitClear's 2025 AI Code Quality study, which analyzed 211 million changed lines of code from repositories at Google, Microsoft, Meta, and a slate of enterprise customers between 2020 and 2024. The headline number is that the share of changes classified as refactoring fell from 25% in 2021 to under 10% in 2024. Copy-pasted blocks rose from 8.3% to 12.3% over the same window. Code reuse (the kind that shows up as moved lines in a structured diff) kept dropping. The trend is consistent across language and company size. Teams are adding more code per unit time and improving less of what they already have.
The second is Margaret-Anne Storey's post on cognitive debt, written from a panel at this year's Future of Software Engineering retreat with Martin Fowler and Thoughtworks. Storey, who holds the Canada Research Chair in Human and Social Aspects of Software Engineering, makes a single claim that reframes the GitClear data: the most expensive debt that AI-assisted development creates does not live in the code. It lives in the developers' minds. She borrows Peter Naur's old framing of a program as a theory, held jointly across the heads of the people who built it, and points out that an agent that generates fifty merged PRs per week is perfectly capable of stripping that theory out of the team faster than anyone can rebuild it.
Hold those two findings next to each other. Refactoring is collapsing at the same moment the system's theory is most at risk. That isn't a coincidence. You refactor code when you understand it; if you don't, you only dare to add.
Technical debt has a fixable surface. Cognitive debt does not.
Ward Cunningham's original metaphor was always about the gap between the code as written and the code as it should have been written, given what you now know. The remediation was always understood: you refactor. You rename. You extract. You delete. The work is unpleasant, but it is bounded. A senior engineer with two free days and a working test suite can make a measurable dent.
This worked because the metaphor assumed one thing it never had to name: the engineer doing the refactor understood what the code was supposed to do. The understanding was the asset. The code was the liability. Refactoring traded effort against a clearer expression of an idea that already existed in the team.
Cognitive debt breaks that assumption. The asset is gone. There is no clean idea in anybody's head to refactor against. The senior engineer staring at the AI-generated module has the same problem the original author had: she can read the lines, but she cannot reconstruct the chain of small decisions that produced them, because that chain was made by a model in a context window that no longer exists. There is nothing to extract a method into, because nobody knows what the method was supposed to express. The code is the only artifact left, and the code is not enough.
This is why the conventional remediation playbook quietly fails on AI-heavy codebases. The team agrees on a tech-debt sprint, picks the gnarliest module, and finds that two days of refactoring produces something marginally cleaner that nobody trusts any more than the original. The refactoring is, in the precise sense, unanchored - it isn't shaped by a deeper understanding of the system because there is no deeper understanding to anchor it to.
The MIT EEG study that puts numbers on it
If you want hard data instead of metaphor, the Kosmyna et al. paper out of the MIT Media Lab is the most quoted piece of evidence right now, and it deserves the attention. They took 54 participants, split them into three groups (LLM-assisted, search-assisted, brain-only), and had each group write SAT-style essays over multiple sessions while wearing EEG. The brain-only group lit up across the strongest, most distributed neural networks. The search group sat in the middle. The LLM group was, neurally, the least engaged - the connectivity in their alpha and beta bands was flattest, and they reported the lowest ownership of their own essays. Many of them couldn't accurately quote text they had submitted minutes earlier.
The crossover session is the part to take seriously. Participants who had spent three sessions writing with the LLM and were then forced to write without it showed reduced neural engagement compared to the brain-only group's baseline. The four-month effect on a population of college kids writing 500-word essays is striking enough. Extrapolate to engineering teams shipping a few thousand lines of agent-authored code a week, for years, and the question isn't whether the team is losing its theory of the system. It's how far gone it already is.
Yes, the paper is preprint-stage, the sample is small, the task is essay writing rather than software engineering. None of those caveats kill the basic finding, which is that cognitive engagement scales down with tool sophistication, and the deficit persists after the tool is removed. If you've ever taken over a codebase that an LLM-heavy team built and felt the eerie sense that nobody knows what the code does, you've felt the engineering version of this effect.
Yes, but: hasn't code always been a black box at scale?
The strongest objection to all of this is honest, and senior engineers will raise it: large codebases at Google or Microsoft have always been opaque to most engineers who touch them. The theory of a thirty-million-line system never lived fully in any one head. We invented code review, design docs, ownership boundaries, and ADRs precisely because the theory had to be reconstructed continuously across people. AI-generated code is just a more aggressive version of the same problem.
That objection is partly right and partly misses the rate change. In the pre-AI world, theory was distributed across many engineers, each holding a small piece, with the joints between pieces enforced by interfaces and conversation. Losing the theory took years and tended to happen at organizational seams (reorgs, attrition, acquisitions) rather than inside a single sprint. The remediation was social: hire someone who knew it, write a design doc, run a deep code review.
What's new is the compression. A team of five can now generate the volume of code that a team of fifteen produced two years ago, but the theory still has to live in five heads, and only some of the time, since the original author of any given change was a model with no memory. The capacity to hold theory hasn't scaled with the capacity to write code. The pre-AI playbook (ADRs, design docs, ownership) is still correct, just badly underused. The teams that retain their theory are the ones that have already made writing it down a non-negotiable part of any PR that an agent touched.
What to do on Monday morning
Storey's recommendation in her post is right, and it has a cheap version that doesn't require any process meeting to ship. Add one rule to your PR template, and enforce it on the AI-touched changes specifically: no merge without a human reviewer who can explain, in their own words and without rereading the diff, why this change exists and what would have happened if it hadn't. Not "looks good to me." Not a thumbs-up emoji. A two-sentence written explanation in the PR description before the merge button works, written by the human, not the agent.
You will be amazed at how often the reviewer cannot do this. That's the signal. When the explanation comes out garbled, the change isn't ready, no matter how clean the diff is. Send it back. The cost is real (you will ship fewer PRs in the first month) and the savings are real and lagging (you will not pay the six-month tax of a system nobody understands).
The deeper habit, the one that doesn't fit in a PR template, is to notice when refactoring stops being a useful tool and start treating the missing theory as the actual asset to rebuild. That means pairing on the AI-generated module before you touch it, writing a one-page explanation of how the system is supposed to work and circulating it for argument, scheduling a regular session where someone presents the parts of the codebase they own and gets challenged on the why. None of these are new ideas. They are the practices that senior engineers used to do informally when the volume of code per person was low enough for the theory to survive on its own. It isn't anymore.
The refactor-your-way-out playbook had a long, useful run. It needs a successor for the codebases we're building now, and the successor isn't a tool. It's a discipline about whose head the system has to live in.

