On July 20, 2026, GitHub Code Quality leaves public preview and becomes a paid product at $10 per active committer per month. More than 10,000 enterprises already ran it during the preview, and the GA release adds the parts that will actually change how teams behave: repository- and organization-level quality scoring, dashboards, and quality gates that block a pull request from merging when it falls below a maintainability, reliability, or coverage threshold. GitHub is doing something SonarQube and its clones have done for years, except now it is wired directly into the branch ruleset every team already uses, one click from on.
I have watched enough of these rollouts to predict the first meeting. Someone screenshots the new repository score, it is a C, and the room decides the number should be a B by end of quarter. A ruleset goes up that blocks any PR with unresolved maintainability findings. Six weeks later the score is a B, the dashboard is green, and the same three services page the on-call engineer at 2am that they always did. Nothing that hurt actually got better.
The reason is not that the tool is bad. CodeQL is good, the findings are real, the per-file view is genuinely useful. The reason is that a single number for a whole repository is an average, and code decay is one of the least evenly distributed phenomena in software. When you optimize an average, you get told exactly where you are allowed to do the least valuable work, and you do it, because the number rewards you for it.
Decay concentrates. The score smears it flat.
The empirical shape of a rotting codebase has been known for a long time, and it is not a gentle uniform gradient. It is a handful of files on fire and everything else basically fine. Adam Tornhill built an entire company, CodeScene, on this observation, first laid out in his book Your Code as a Crime Scene. His measure of a "hotspot" is not complexity alone. It is complexity multiplied by change frequency: the file that is both complicated and touched constantly. Complexity that nobody edits is harmless. It is the intersection that predicts pain. CodeScene's own research reports that the top hotspots make up a minor fraction of the code yet account for somewhere between 25 and 70 percent of the defects a team actually reports and fixes.
Sit with that ratio. If two to five percent of your files carry a third to two-thirds of your defect load, then the aggregate maintainability of the other ninety-five percent is close to noise for the outcome you care about. A repository score is computed across all of it. You can move that score by fixing a hundred trivial findings in cold, stable files that no one has edited since 2023 and no user will ever feel. The grade goes up. The dashboard goes green. Your hotspots, the files that page you, are untouched, because touching them is hard and scary and the score does not make you.
This is not a hypothetical failure mode. It is the default one, because the cheap findings are always in the cold code. Legacy files that nobody changes accumulate lint-level debt and then sit still, which is exactly what makes them safe to refactor for points. The hot files resist cleanup precisely because they are load-bearing and constantly moving. An average-based objective points every engineer at the safe, low-value work and away from the dangerous, high-value work. It is Goodhart's law with a progress bar.
The gate rewards the wrong pull requests
The dashboard is passive. The gate is not, and the gate is where the real distortion lives. GitHub's model is a branch rule, "Require code quality results," where you pick the lowest severity that must be resolved before merge: errors only, warnings and higher, notes and higher, or all. It reads as a clean, strict standard. In practice it measures the absolute state of the code a PR touches, and absolute state is the wrong thing to measure at the PR boundary.
Consider two pull requests. The first adds a brand new module in clean greenfield code; it sails through, because new code written this week against a modern style has few findings. The second is a one-line bug fix inside your oldest, gnarliest payments file, the worst hotspot you own. The moment that PR's diff makes CodeQL re-surface the pile of pre-existing findings in that file, the gate blocks the merge. The engineer who dared to go into the scariest file in the building to fix a real bug now owes you a cleanup of debt they did not create, or they route around the gate with an exception, or, most likely, they stop volunteering for work in that file at all. You have taxed exactly the behavior you needed and subsidized the behavior that was already easy.
The fix is not to turn the gate off. It is to gate on the delta, not the absolute. The idea SonarQube named "clean as you code" is the right instinct: judge the code a PR changed, and require that it does not get worse, rather than requiring the whole file to be clean. A PR that leaves the lines it touched at least as healthy as it found them passes. A PR that adds new findings fails. Under that rule the payments bug fix merges, because it introduced nothing new, and the greenfield module still has to keep its own house clean. GitHub's severity gate can approximate this if you scope it to new findings on the diff rather than all unresolved findings on the file, and you should configure it that way before you configure anything else.
The second thing worth doing is cheap and does not need a license at all. Your git history already knows where the hotspots are. You can rank your files by raw change frequency in one command:
git log --since='1 year ago' --name-only --pretty=format: \
| grep -E '\.(ts|tsx|py|go|java|rb)$' \
| sort | uniq -c | sort -rn | head -20
That list, crossed against a complexity signal as blunt as lines-of-code or as precise as a cyclomatic count, is your real risk map. A ten-line script gets you most of what a dashboard sells, and unlike the dashboard it tells you where to point people, not just what letter you scored. Combine the two into a single ordering:
git log --since='1 year ago' --name-only --pretty=format: \
| grep -E '\.(ts|py|go|java)$' | sort | uniq -c \
| while read count file; do
[ -f "$file" ] && echo "$((count * $(wc -l < "$file"))) $file"
done | sort -rn | head -15
The top of that list is where a quality budget should go. Not the repository average. The top of that list.
What the score is actually good for
The honest concession is that an aggregate score is not useless; it is just mis-aimed when a single team stares at its own repo. Where it earns its keep is one level up. If you run fifty services and you want to know which three to fund a reliability quarter for, an organization-wide maintainability ranking is a reasonable triage input. Trend also matters more than level: a repo sliding from B to C over two quarters is a signal worth acting on even when the absolute grade looks fine, and the org dashboard is a decent place to catch that drift early. Used as a portfolio-level attention allocator, the score is doing something the git-churn script on a single repo cannot.
The trouble is only that the number is so easy to turn into a target. The instant "raise the score to B" becomes an OKR, every incentive I described above switches on, and the tool starts producing motion in the cold corners of the codebase while the fires keep burning. The SIG State of Software 2026 report, drawn from more than 30,000 systems and over 400 billion lines of code, found that 86 percent of code already sits below their recommended maintainability rating, and that AI-assisted output magnifies whatever engineering discipline a team already had rather than supplying its own. A score does not supply discipline either. It supplies a number, and a number that is easy to game will be gamed by tired people under deadline, which is all of us.
Yes, but the findings are per-file, not just one grade
The strongest objection is fair: GitHub Code Quality is not only a letter grade. It surfaces findings per rule and per file, it groups them so you can prioritize, and the reliability and maintainability scores are meant to summarize severity, not replace the detail underneath. A careful team can absolutely open the findings, sort by the files that matter, and ignore the aggregate. All of that is true.
But tools are used the way their defaults and their headline numbers steer people, not the way their most disciplined users could theoretically use them. The default artifact is the score. The default enforcement is the gate. Both are computed over the whole repo, and neither knows which of your files changed forty times last quarter, because change frequency is a property of your git history, not of a static scan of the current tree. The temporal dimension, the single most predictive input to where defects live, is exactly the dimension a snapshot-based quality score cannot see. That is not a bug GitHub will patch. It is what a static analysis of one point in time is, by definition, blind to.
What to do before July 20
If Code Quality is coming to your org, spend an hour now, not after the first green-dashboard meeting. Run the churn command against your two or three most important repositories and write down the top ten files. That list is your ground truth; keep it next to the GitHub score and notice how little they agree. Configure the PR gate to fire on new findings introduced by the diff, not on all unresolved findings in the touched files, so that the people brave enough to work in your hotspots are not punished for it. And refuse, out loud and in the meeting, to make the repository grade a target. Make the target the health of the top of your hotspot list, measured PR by PR as clean-as-you-code, and let the aggregate score be a thermometer you read rather than a dial you turn. The number GitHub is about to sell you is real. It is just an average, and the thing that will page you at 2am has never once been average.

