SL#76 - Code Review Was Never About Reading the Diff

A teammate of mine approved a 600-line pull request in four minutes last month. The diff was generated by an agent, it passed CI, and it looked right. Two weeks later we found out it had silently changed the rounding behavior on a billing path. Nobody wrote that bug. Nobody really read it either. The PR had an author field with a human name on it, but the human had prompted for a feature, skimmed the output, and clicked merge.

This is the part of the AI coding story that the productivity charts miss. Generation got an order of magnitude cheaper. Review did not get cheaper at all. A model can produce 600 lines in twenty seconds; a competent reviewer still needs the better part of an hour to actually understand them. That asymmetry is the whole problem, and most teams are resolving it in the worst possible way: by reviewing less.

The thesis I want to defend is that code review was never primarily a defect filter. It is a context-transfer mechanism. And AI broke it not by writing bad code but by writing code that arrives with no author who holds the context. Once you see review that way, almost every popular fix for the "review bottleneck" looks like it is solving the wrong problem.

What review actually does

When you write a piece of code, you carry a stack of invisible state with you: the three approaches you tried and rejected, the edge case you discovered in the existing function, the reason you reached for a map instead of a list. The diff is the residue of all that thinking. It is not the thinking itself.

A reviewer has to reconstruct the thinking from the residue. Thomas Johnson, CTO of Multiplayer, put this well in a LeadDev piece in June: reviewing code has always been harder than writing it because of a context asymmetry that has nothing to do with AI. The author carries everything; the reviewer rebuilds it from the diff alone. That is hard when the author is a colleague who can answer in Slack. It is much harder when the author is an agent that made hundreds of non-deterministic decisions you have zero visibility into.

So the value a good review produces is not "found three bugs." It is "a second human now understands this change well enough to maintain it, and the design got pressure-tested by someone who didn't write it." Bug-catching is a side effect of that understanding, not the main event. This is why review-by-checklist always felt hollow, and why the best reviews you have ever received were arguments about intent, not nitpicks about syntax.

Hold that definition, because it predicts exactly what happens when you flood the pipeline with authorless code.

The data is already in, and it is not subtle

The 2025 DORA Report, which Google published in September drawing on survey responses from nearly 5,000 technology professionals, found that AI adoption now correlates positively with software delivery throughput. That is a reversal from 2024 and it is real. Teams are shipping more. But the same report found that AI adoption continues to have a negative relationship with software delivery stability. More change going out the door, less of it staying up.

DORA's own explanation is the one to internalize: "AI doesn't fix a team; it amplifies what's already there." Acceleration exposes weaknesses downstream. Without strong automated testing, mature version control, and fast feedback loops, a jump in change volume turns into instability. The report is blunt that 30% of respondents still report little or no trust in AI-generated code, even as 90% use it and 80% say it makes them more productive. People do not trust the code, and they are merging more of it faster. That is not a contradiction the tooling will resolve on its own.

Now pair DORA with GitClear's 2025 code-quality study, which analyzed 211 million changed lines of code from 2020 through 2024, including repositories at Google, Microsoft, and Meta. The findings line up disturbingly well with the "authorless code" theory. Copy-pasted code rose from 8.3% of lines in 2020 to 12.3% in 2024. Blocks with five or more duplicated lines increased eightfold in 2024 alone. For the first time in the history of their measurement, "copy/paste" exceeded "moved" code, and moved code is the fingerprint of refactoring. The share of changed lines associated with refactoring fell from 25% in 2021 to under 10% in 2024.

Read those two reports together and the picture is coherent. We are generating more, duplicating more, refactoring less, and shipping less stable software. The generation side of the pipeline got faster and the verification side did not move. The review step, the place where someone is supposed to notice "this is the fourth copy of this function," is being asked to absorb all of that and is instead quietly giving up.

Why the obvious fixes make it worse

Confronted with a review bottleneck, teams reach for two instincts. Both keep the same mental model - review is where quality gets added at the end - and both fail for the same reason.

The first instinct is to go faster. Approve quicker, batch less, trust the author. This is rubber-stamping with a nicer name. There is a rough heuristic going around that is uncomfortable but useful: if your average review time for an AI-generated PR matches your average for a human-written one, you are not reviewing, you are approving. The defect data backs up the discomfort. AI-generated PRs that contained substantially more redundant code have been observed drawing fewer critical reactions from reviewers, not more. Surface-level plausibility suppresses scrutiny. The code reads as competent, so the reviewer's guard drops, exactly when it should rise.

The second instinct is to add an AI reviewer. Bolt a model onto the PR, let it comment, let it approve. This helps with the mechanical layer and I will give it real credit in a moment. But as a replacement for human review it is a category error. The thing missing from authorless code is intent. A second model cannot supply intent it never had; it can only generate a plausible-looking assessment of plausible-looking code. You have now automated both sides of a conversation that exists to transfer understanding between humans, and understanding is the one thing neither participant is producing. GitHub's Octoverse 2025 report gave the failure mode a name that open-source maintainers had already coined: AI slop. High volume, low quality, confidently wrong, expensive to triage.

The deeper issue, as Johnson argues, is that agents are making decisions on data that was never designed for machine reasoning. They work from raw logs and incomplete context and produce a 400-line PR that passes CI, looks syntactically clean, and fixes the symptom instead of the failure. A human catching that has to reconstruct a causal chain the agent never had. You cannot review your way out of that at the end of the pipeline. The context had to exist earlier or it does not exist at all.

Move the burden of proof to the author

If review is context transfer, the fix is to make the change carry its own context before a human ever opens it. Shift the burden of proof from the reviewer to the author, where "author" now means the human plus whatever agent they drove.

Concretely, that means a PR should not be eligible for human review until it arrives with three things. A statement of intent: the spec or the issue it satisfies, in enough detail that a reviewer can check the code against a target instead of guessing the target from the code. A test that pins the new behavior: not coverage theater, but the specific assertion that would fail if the change were reverted, so the reviewer can see what "working" was defined to mean. And a small diff. The 400-line agent PR is not a unit of review; it is four-to-ten units that were never decomposed because decomposition is the one thing the agent made no cheaper.

This is the same logic DORA arrives at from the data and that Johnson arrives at from the trenches: a layered, Swiss-cheese system where automated verification catches mechanical errors, specs validate intent before code exists, and the scarce resource - human judgment - is spent only on the questions humans are uniquely good at. Is this the right design? Does it fit the architecture? What breaks in six months? Everything below that line should be caught by a linter, a contract test, or a type, not by a tired person reading a diff at 5pm.

Notice this also rehabilitates the AI reviewer. As a slice of the cheese, an automated reviewer that runs strict static analysis and flags the N+1 query and the missing index is genuinely useful, because those are mechanical and it never gets tired. The error is letting that slice stand in for the intent slice. Use the machine for what is mechanical. Reserve the human for what is judgment. Do not let either pretend to be the other.

Yes, but the line-by-line review is dying anyway

The honest objection is that I am defending a practice that is on its way out. If AI keeps getting better, the argument goes, the 400-line PR is just the new atomic unit, and insisting on small human-readable diffs is nostalgia. Johnson himself predicts the line-by-line diff review will go extinct because the volume makes it cognitively and economically unsustainable.

He is probably right about the mechanics, and I am not arguing to preserve the ritual of a human reading every line. I am arguing about where the understanding lives. You can absolutely stop reading diffs line by line. What you cannot do is stop transferring context and expect the system to stay stable, because the DORA stability numbers are what happens when you try. If the future is humans defining intent and constraints while machines verify the constraints were met, then the spec and the test are not optional artifacts you generate to satisfy a process. They are the context, made legible, so it survives the fact that no human held it. The review moved. It did not disappear.

What to change on Monday

Pick one repository and change the definition of "ready for review" before you change anything else. A PR is not reviewable until its description states the intent and links the spec, includes the one test that would fail if the change were reverted, and comes in under a size your team can actually hold in its head. Three hundred lines is a defensible ceiling; four hundred is where, by most accounts, real review stops happening.

Then change what you measure. Stop tracking review by approval latency, which only rewards speed and quietly rewards rubber-stamping. Track the share of PRs that arrive with intent and a behavior-pinning test, and track your two-week churn the way GitClear does, because rising churn is the early smoke of code nobody understood well enough to get right the first time. The teams that win the AI trade will not be the ones who review fastest. They will be the ones who made sure the context was in the change before anyone had to go looking for it in the diff.

SL#76 - Code Review Was Never About Reading the Diff

What review actually does

The data is already in, and it is not subtle

Why the obvious fixes make it worse

Move the burden of proof to the author

Yes, but the line-by-line review is dying anyway

What to change on Monday

Sources

Keep Reading

Software Letters

Home