When the same AI writes both the code and the tests from the same prompt, coverage stops being a signal. The green dashboard is a mirror, not a check. Here's the gap, and how to close it.
A 94% coverage failure
A backend engineer I'll call Sara opened her laptop on a Saturday morning to find her on-call phone going off. Production was down. The crash was a null pointer exception in an order-processing path. The PR that introduced it had shipped Friday afternoon with 94% test coverage. Every test had passed. CI was green across the board.
The tests had all been AI-generated. They looked thorough. They covered every function. And they were close to useless.
This is not a story about a careless engineer. The PR had been reviewed. The tests had been read. The numbers had been checked. The whole pipeline did exactly what it was designed to do, and it shipped a bug into production anyway. I have heard four versions of this story in the last month and the shape is always the same: high coverage, all green, post-mortem reveals tests that asserted the implementation rather than the requirements.
The thesis of this essay is short. The old invariant - "if the tests pass, the code probably works" - is broken when an AI writes both the code and the tests in the same session. Coverage no longer measures what you think it measures. The fix is not better AI. It is a discipline shift in who decides what gets tested.
What's actually being asserted
The cleanest illustration I've seen of the problem is from Vaibhav Verma's piece AI-Generated Tests Give False Confidence, which I'm going to lean on heavily because the example is unimprovable. Here's a discount calculator:
function calculateDiscount(order: Order): number {
if (order.total > 100) return order.total * 0.1;
if (order.items.length > 5) return order.total * 0.05;
return 0;
}
And here's what the AI typically generates for it: three tests, one per branch, each picking values that fall comfortably inside the branch. Coverage hits 100%. Every test is green. The PR looks bulletproof.
Now read the function again and ask the questions a thoughtful reviewer would ask. What happens when total is exactly 100? Does the boundary belong to the 10% bracket or the 5% bracket? The implementation says one thing. Was that the intent? What happens when total is negative because of a refund coupon flow you forgot about? What happens when an order qualifies for both discounts? The code applies only one. Is that a deliberate priority decision or a bug? What happens when items is empty? Should the function even run?
The AI didn't ask any of those questions. It couldn't. It doesn't know the business rules. It only has the code. And tests written from the code can only ever assert what the code does, not what the code should do. That is a profound shift, and it breaks the social contract test suites used to provide.
When humans wrote tests, the implicit invariant was that the test reflected at least one human's understanding of the requirement. When you read a human-written test, you were reading their interpretation of the spec, sometimes wrong but never tautological. When you read an AI-generated test, you are reading the AI's interpretation of the code it just wrote. There is no second source. There is no second mind. The check is missing.
What the mutation testing data actually says
If the argument so far feels like vibes, the numbers tighten it. Verma ran mutation testing against three projects and the gap is brutal.
Project | Coverage | Mutation Score | Gap |
|---|---|---|---|
Project A (AI tests) | 91% | 34% | 57% |
Project B (AI tests) | 87% | 41% | 46% |
Project C (human tests) | 76% | 68% | 8% |
A short refresher because mutation testing is still niche outside dedicated QA teams: the tool flips operators in your code, removes statements, swaps < for <=, that kind of thing. Then it reruns your tests. If the tests pass against the mutated code, the mutation "survived" and the tests didn't actually verify that behavior. The mutation score is the percentage of mutations the tests killed.
A 34% mutation score against 91% coverage means roughly two-thirds of possible bugs would slip past the test suite. The tests touch the lines but they don't check them in a way that resists change. The 57-point gap is not a small statistical artifact. It is a coverage system gaslighting its operators.
Project C, with lower coverage and higher mutation score, is what tests written from requirements look like. Less surface area touched, but what is touched is actually checked. That's the trade you want.
A 2026 empirical study on arXiv puts the broader picture in agreement: LLM-generated test suites achieve roughly 20% mutation scores on complex real-world functions. Roughly eighty percent of potential bugs go undetected. The shiny dashboard is mostly noise.
The five patterns that make this so hard to spot
If you sit with a few hundred AI-generated test files you start to see the same anti-patterns repeated, and Verma's piece names them well. I'll restate them in my own words because spotting these in review is now part of the job.
The mirror test restates the implementation. formatName("John", "Doe") returns "John Doe". So the test asserts that. There is no extra information in the test. If you replaced the test file with console.log(implementation.toString()) you would have learned the same thing.
The happy-path-only test covers the success case and nothing else. No timeouts, no malformed responses, no rate limits, no concurrent writes. The path that breaks in production is, by definition, not the happy path.
The over-mocked test is the most dangerous because it looks robust. Every external dependency is mocked, the test runs in milliseconds, the assertions all pass. But the mocks were generated by the same AI that wrote the code. The test verifies that processOrder calls db.create, payment.charge, and email.send. It does not verify that the order is correctly constructed, that the charge amount matches the order total, or that the email contains the right items. It verifies sequence, not correctness.
The snapshot trap generates large snapshot files that break on every cosmetic change. Teams stop reading the diffs after the second false alarm. After the fourth, they auto-accept. After the eighth, snapshot tests are decoration.
The "it works" assertion is expect(result).toBeDefined(). It passes for any non-undefined return value. It is indistinguishable from no test at all.
These are not exotic edge cases. They are the median output of asking Claude or Copilot to "write tests for this." If you've never sat with a junior engineer's AI-generated test file and counted how many of these five show up, do it this week. The exercise rewires what "comprehensive tests" means.
Yes, but mutation testing is expensive
The honest objection to the fix I'm about to propose is that mutation testing is slow and noisy. Stryker, PIT, and Mutmut can take ten or twenty times as long as your regular test suite on a real codebase. The mutations are sometimes irrelevant. The signal-to-noise ratio is not great out of the box.
That objection is true and I want to take it seriously. Mutation testing is not free. You cannot run it on every PR for every file. The throughput cost is real and engineering leaders should not pretend otherwise.
But here's the move. You don't need mutation testing as a per-PR gate. You need it as a calibration tool. Run it nightly against the modules where AI generates the most code. Watch the mutation score over time. When it drifts down, you have actionable evidence that the test suite is rotting even though coverage is stable. That is information you cannot get any other way.
Verma's CI workflow uses mutation testing as a quality gate with thresholds. That's the high-end version. The lightweight version is just: pick the five most business-critical modules, run mutation testing on them once a week, look at the score, and treat a drop as a regression worth investigating. You can run that against a real production codebase for the cost of a few hours of CI compute per week. It is the cheapest way I know to detect when AI-generated tests have started lying to you.
What to do on Monday morning
There are three changes that actually move the needle, in order of how soon they pay off.
First, switch the test-writing order. The new discipline is: humans write describe/it blocks describing the behaviors that matter, then the AI fills in the test bodies. The descriptions are the requirements; the bodies are the mechanical work. This sounds like a tiny process tweak. It is the central move. As long as the AI gets to decide what to test, your tests will mirror the implementation. As soon as a human writes the descriptions first, the AI is constrained to test what you actually care about. Most teams I've watched make this shift get a working version in a sprint.
Second, add mutation testing to one critical module. Just one, this week. Pick the module that would cost you the most if it shipped a silent bug. Run Stryker (or your language equivalent) against it. Look at the score. If it's below 60%, you have a backlog of test-writing to do that coverage was hiding from you. If it's above 70%, you have evidence that the test discipline in that module works, and you have a baseline to defend.
Third, name the patterns out loud in code review. When you see a happy-path-only test, write a review comment that says "this is a happy-path-only test, can we add the failure modes?" When you see an over-mocked test, ask "what real behavior is this checking?" Naming the anti-patterns turns them from vague unease into actionable feedback, and it teaches juniors what bad AI-generated tests look like before they internalize them as normal.
The hard part of this whole shift is that the green dashboard is comfortable. Tests passing feels like progress. Mutation scores below 50% feel like a regression even though they're just an honest measurement of what was already true. You'll have to be willing to make your CI dashboard less green in exchange for it being less of a lie. That trade is worth making. The alternative is more Saturday-morning pages.

