SL#55 - GitHub Just Shipped Stacked PRs. They Fix the Wrong Half of Code Review.

On 13 April 2026, GitHub put stacked pull requests into private preview. The pitch is clean and the timing is obvious. AI agents now produce code faster than any human can read it, the average diff has ballooned, and large pull requests are "hard to review, slow to merge, and prone to conflicts." Stacked PRs let you break a big change into a chain of small ones, each targeting the branch below it, each reviewable on its own. Meta and Google have worked this way for a decade. The tooling finally landed in the place most of us actually work.

The launch leans on a number that has been making the rounds: an analysis of 1.5 million pull requests found that PRs between 200 and 400 lines had 40% fewer defects and were approved three times faster than larger ones. GitHub's framing is that stacking keeps each PR inside that golden band even when the feature is huge. Smaller units, faster approvals, fewer bugs. Who would argue with that.

I would, a little. Not with the tool, which is good and overdue. With the story we are telling ourselves about why it helps. Code review in 2026 is not bottlenecked on diff size. It is bottlenecked on reviewer attention, and stacked PRs do not create any. Worse, they make it cheaper to spend the attention you do have badly. If you adopt stacking without changing how you actually review, you will ship the same bugs you ship today, just spread across five green checkmarks instead of one.

The number measures the reviews people already chose to do

Start with that 1.5-million-PR stat, because it is doing a lot of load-bearing work and it does not say what people think it says.

"PRs of 200 to 400 lines get approved three times faster" is an observation about PRs that got reviewed. It is not a controlled experiment where the same change was reviewed at two sizes. Small PRs get approved faster for several reasons at once, and reviewer thoroughness is tangled up in all of them. A 40-line PR gets a fast approval partly because it is genuinely easy to reason about, and partly because it looks trivial, so the reviewer skims it, sees nothing alarming, and clicks approve. The speed is real. Some of it is comprehension and some of it is inattention, and the aggregate number cannot tell you the ratio.

There is a well-known piece of folk data in this area, the Cisco/SmartBear review study, which found reviewer defect-detection drops sharply once a review exceeds roughly 400 lines, and that effectiveness also falls off when reviews run longer than about 60 minutes regardless of size. The honest reading of that work is not "small PRs are reviewed well." It is "human review effectiveness is capped, and the cap is low." Past a few hundred lines per sitting, you are not really reviewing anymore. You are pattern-matching for things that look wrong.

Stacked PRs accept that cap and route around it by cutting the work into cap-sized pieces. That is a real win for the readability of any single diff. But the cap was never only about the size of one diff. It is about the total attention a reviewer brings to your change on a given afternoon. Splitting a 2,000-line feature into five PRs does not give the reviewer five fresh afternoons. It gives them the same afternoon and five things to click.

Stacking multiplies approvals, not attention

Here is the failure mode I would bet money on seeing in teams six months from now.

A developer, very plausibly an AI agent prompted to "break this into stacked PRs, each under 200 lines, each doing one logical thing," produces a clean stack of five. Layer one: a new interface. Layer two: the implementation. Layer three: wiring. Layer four: tests. Layer five: the call site. Each PR is small. Each one, in isolation, does one logical thing and reads fine. The reviewer opens the stack map, walks the layers, and approves all five in fifteen minutes, feeling productive and rigorous because look how clean each piece was.

The bug, if there is one, is almost never inside a single 180-line layer. It lives in the seam between layer two and layer three, in the assumption layer four's tests quietly encode, in the thing layer one's interface made impossible to express and so layer five worked around. Those are exactly the defects that a reviewer holding the whole change in their head catches and a reviewer clicking through focused diffs does not. Stacking optimizes for local readability and, as a side effect, dissolves the global picture. You traded "one diff too big to fully grasp" for "five diffs too separated to connect."

This is not hypothetical hand-waving. It is the same dynamic that makes microservices harder to reason about than a monolith even though each service is simpler. Decomposition moves complexity from inside the units to the boundaries between them, and boundaries are where review attention is weakest because no single PR owns them. The stack map helps you navigate the layers. It does not make you reconstruct the feature.

And there is a quieter incentive problem. Approving is satisfying. A green checkmark feels like progress for both author and reviewer. When a change is one big PR, the reviewer feels the weight of that single approval and tends to take it seriously. When it is five small PRs, each approval feels minor, the social cost of waving one through is lower, and the path of least resistance is to keep the author unblocked. Stacking lowers the activation energy for approval at exactly the moment AI has raised the volume of things asking to be approved. That is the wrong direction.

The lever that actually moves review is depth, and it is not human

Buried in InfoQ's coverage of the stacked-PR launch is the most important number in this whole conversation, and almost nobody is repeating it. Anthropic's agent-based code review for Claude Code found that substantive review comments on PRs with more than 1,000 lines changed rose from 16% to 84% after teams adopted automated review tooling.

Sit with that. On large PRs, the kind humans are worst at, only 16% of review comments said anything substantive before. The other 84% were nits, style, "can you rename this," and approvals with no comment at all. The thing that fixed it was not making the PRs smaller. It was putting a reviewer on them that does not get tired at line 400, does not lose the thread across a 2,000-line change, and reads every layer with the same attention it gave the first one.

That is the real shape of the 2026 review problem. The supply of code went up because agents write it. The supply of careful human attention is fixed, and was already maxed out below the size of a typical AI diff. You cannot close that gap by reshaping the work into human-sized bites, because the volume of bites is also going up. You close it by adding a reviewer whose attention scales with the volume, which means an automated one, and reserving the scarce human for the things automation is bad at: does this change make sense, is this the right design, what does this break that no diff will show you.

Stacked PRs and agent review are not competitors. They are complements, and the order matters. Use the agent to do the depth pass on every layer, every line, no fatigue. Use the human to read the stack as a whole and answer the questions that require knowing what the company is trying to do. The stack structure is genuinely useful here, because it gives the human a clean spine to reason along. But that only works if you have explicitly handed the line-by-line correctness pass to something that does not skim.

Yes, but small PRs really are easier to review

The strongest objection to all of this is also the most obvious one: smaller diffs are easier to review, full stop, and pretending otherwise to be contrarian is silly. That is true, and I am not pretending otherwise. A 300-line PR genuinely is easier to reason about than a 3,000-line one. The Cisco data, the 1.5-million-PR data, and the lived experience of every engineer who has been handed a giant diff all point the same way. Stacked PRs make small diffs the default instead of an act of discipline, and defaults win. That is a real improvement and I will use the feature.

The point is narrower than "stacking is bad." It is that stacking improves the readability of each unit while doing nothing for, and possibly hurting, the thing that was already the binding constraint: whether a reviewer with finite attention actually understands the change as a whole and is motivated to push back. If your reviews were thorough before, stacking makes them more pleasant. If your reviews were rubber stamps before, stacking gives you five rubber stamps and a nicer UI, and the cleaner workflow can make the rubber-stamping feel more legitimate than it did. The tool is an amplifier. What it amplifies depends entirely on whether real review attention exists upstream of it.

What to actually do on Monday

If you turn on stacked PRs, decide first what each reviewer in the chain is for, and write it down. Splitting the work does not split the responsibility unless you say so. The most common silent failure with stacks is diffusion: five reviewers each assume someone else owns the integration, and nobody reads the seams.

Concretely. Put an automated reviewer on every layer and let it own line-level correctness, the boundary checks, the "you handled the happy path but not the error" catches. That is the part that scales with AI-generated volume and the part humans were already failing at on big diffs. Then assign one human to own the stack as a whole, not a layer, and give them the question that no individual diff answers: does this set of changes, taken together, do the right thing in the right way. Make that person read the stack top to bottom as one change, because that is the view stacking takes away from you by default.

And watch your approval latency per layer. If layers in a stack are getting approved dramatically faster than your old single PRs of the same total size, that is not a productivity win you earned. That is the rubber stamp showing up in your metrics. The whole feature was sold on speed, so speed is exactly the signal that will lie to you about whether review still means anything. Smaller PRs that approve faster can mean the work got clearer. It can also mean nobody is really looking. The diff size will not tell you which. Only a defect that ships through five green checkmarks will, and by then it is in production.

SL#55 - GitHub Just Shipped Stacked PRs. They Fix the Wrong Half of Code Review.

The number measures the reviews people already chose to do

Stacking multiplies approvals, not attention

The lever that actually moves review is depth, and it is not human

Yes, but small PRs really are easier to review

What to actually do on Monday

Sources

Keep Reading

Software Letters

Home