SL#54 - Your 12-Million-Token Context Window Is a Lie Until You Measure It

A few weeks ago, Subquadratic launched SubQ, a 12-million-token context model that scores 86.2% on MRCR v2 against Claude Opus 4.6's 78.3% and Gemini 3.1 Pro's 26.3%. The headlines wrote themselves. Twelve million tokens. Fifty-six times faster than FlashAttention at one million tokens. A subquadratic frontier model from a $29M seed startup in Miami. The dense-attention era is over, etc.

The architecture is real and the benchmarks were run by Appen after the initial release, so this isn't a vapor announcement. But the most useful number in Subquadratic's own writeup is the one nobody quoted in the launch coverage: Opus 4.7 scores 32.2 on MRCR v2. Opus 4.6 scores 78.3. Same model family, point release apart, both shipped within months of each other, and the score collapses by 46 points on the benchmark Subquadratic is using to make its case.

That's not a SubQ story. That's the story.

The thesis: the benchmark game on long context has decoupled from anything you can use in production. The number on Subquadratic's slide and the number on Anthropic's slide are both real, both reproducible under the right conditions, and both nearly useless for predicting whether the model will work on your actual million-token codebase. If you're picking an LLM today based on its context window or its MRCR score, you're shopping for the wrong thing.

The benchmark Subquadratic chose tells you what's broken

MRCR v2 is one of the better long-context benchmarks. It evaluates whether a model can retrieve and integrate multiple non-adjacent pieces of evidence distributed across a long input, where the relevant set isn't specified in advance. This is closer to real work than needle-in-a-haystack. Real work is "find every place in the codebase where this contract is enforced, even the ones that paraphrase it," not "find the sentence with the magic word."

Subquadratic published a comparison table that looks like this:

Model           MRCR v2
SSA / SubQ      86.2%
Gemini 3.1 Pro  26.3%
Opus 4.6        78.3%
Opus 4.7        32.2%
GPT 5.4         36.6%
GPT 5.5         74.0%

Look at the variance inside a single vendor's lineup. Anthropic ships Opus 4.7 as a point upgrade to 4.6 and the MRCR v2 score drops from 78.3 to 32.2. OpenAI ships GPT 5.4 to 5.5 and goes the other direction, 36.6 to 74.0. These aren't different models from different eras. They are sibling releases, weeks apart, on the benchmark Subquadratic is asking you to use to pick a vendor.

One of three things is happening. Either Anthropic shipped a regression that nobody noticed, or the benchmark is fragile to small changes in post-training that don't affect what most people actually use the model for, or Subquadratic ran the benchmark in a configuration that disadvantages Opus 4.7. Probably some combination. The point isn't to litigate the table. The point is that if MRCR v2 is the best long-context benchmark we have, and a forty-six-point swing between consecutive minor versions of the same model is plausible, then the benchmark isn't measuring what the marketing copy implies it's measuring.

This isn't unique to MRCR. Long-context evals have a long history of breaking when the test passages are perturbed slightly, when haystacks contain syntactically similar distractors, when the needles are paraphrased rather than copied verbatim, or when the questions require any integration step beyond surface retrieval. The numbers move because the benchmarks haven't fully cracked the underlying capability. We're measuring the thing we know how to measure, then assuming it generalizes.

What SubQ actually solved, and what it didn't

The architecture itself is the part that deserves serious attention. Dense attention is O(n²). FlashAttention is still O(n²); it just executes the work more efficiently against the memory hierarchy. Subquadratic Sparse Attention (SSA) is the first commercial attempt to make the work itself sublinear-in-the-key-set: for each query, the model selects a subset of positions to attend to, then computes exact attention over that subset. This is the bet that prior approaches like Mamba and RWKV punted on. State-space models give you linear scaling by compressing the past into a fixed-capacity state. They preserve gist, they lose retrieval. SSA tries to keep both.

The wall-clock numbers in the SSA writeup are the genuinely impressive part:

Context     FA2 latency     SSA latency     Speedup
128K        319.88 ms       46.5 ms         6.88x
256K        1,272.19 ms     94.2 ms         13.51x
512K        5,228.55 ms     189.85 ms       27.54x
1M          21,410.51 ms    380.96 ms       56.2x

At 128K, the speedup is meaningful. At 1M, it's the difference between an interactive system and a batch job. If you've ever tried to feed a million-token codebase into a dense-attention model and waited twenty seconds for the prefill before the first token, you know exactly why that gap matters.

But "we made attention faster" is not the same thing as "our model is better at long context than yours." Speedup numbers describe the system you can build. They don't describe the quality of the answers you'll get out of it. The MRCR v2 number is the one that's supposed to bridge the two, and the MRCR v2 number is the one whose stability has been collapsing in front of us in real time.

There's a more honest framing of SubQ that the launch coverage didn't reach for: it's a research-grade architectural win that ships at the same time as a marketing claim about retrieval quality that the benchmark ecosystem isn't ready to support. Both things are true. Treating them as one claim does a disservice to the architecture work.

The functional context window is the only number that matters

Here is the distinction Subquadratic gets exactly right in their own writeup and then walks away from: a nominal context window is the number of tokens a model will accept. A functional context window is the number of tokens it can actually reason over reliably enough that you'd bet a production decision on the output.

Every frontier model today has a nominal window that's an order of magnitude larger than its functional window. Anthropic ships a 1M-token beta for Opus 4.6. Gemini ships 1M and 2M variants. GPT-5.5 supports long context. SubQ ships 12M. Inside each of those windows, there's a soft frontier where retrieval quality degrades, attention starts missing things in the middle of the sequence, and the model defaults to whatever's nearby because the nearby evidence is computationally cheaper to attend to than the far-away evidence.

You can't read this number off a benchmark. You have to measure it on your corpus, with your task, with the way you actually compose prompts. A model that scores 95% on RULER at 128K might score 60% on your specific use case at 200K because the structure of your data fights the model's attention prior in a way the eval doesn't.

This is the operational version of the lesson. The MRCR v2 number is interesting but not actionable. The number that's actionable is one you have to produce yourself: pick the 20 hardest queries from your real workload, run them at increasing context sizes against the candidate models, and watch where the quality cliff is. Every model has a cliff. The cliff is at a different place for different tasks. The cliff is usually well below the nominal window. Knowing where your cliff is for your problem is worth ten times any vendor benchmark.

What this means for picking a model in 2026

For most teams shipping LLM features today, the optimization is no longer "pick the model with the biggest context window." The window is functionally infinite for any reasonable single-document task. The next-mile optimization is figuring out which model degrades least on your specific shape of long context, and that requires private evals.

A few practical heuristics that hold up better than vendor benchmarks. If your task is single-needle retrieval inside a long document, almost any current frontier model handles it; pick on price. If your task is multi-hop integration across many documents, the model choice matters and you need to test, because the benchmark numbers are noise at this resolution. If your task is long-running agentic work where context accumulates across many turns, the dominant failure mode is context decay across turns, not the attention mechanism's raw capacity; the fix is in your harness, not the model. If your task is a million-token codebase, prefill latency is a real constraint and SubQ's speedup numbers matter; quality you'll have to measure yourself.

The deeper shift is what counts as table stakes. Five years ago, building an LLM eval harness was something only labs did. In 2026, if you ship a feature on top of a model, owning a private eval set is closer to operations than research. Vendor benchmarks tell you which models are roughly in the conversation. They don't tell you which one will work for you. Closing that gap is on you.

Yes, but: the benchmark critique cuts both ways

The honest counterargument: yes, MRCR v2 is unstable across minor versions, but it's also one of the only public benchmarks that meaningfully separates models at long context. If we throw it out because the numbers move around, we're back to vibes-based vendor selection, and vibes are even worse. Better to have a noisy signal than no signal.

This is fair. The conclusion isn't "ignore MRCR." The conclusion is "treat MRCR as one weak indicator among many, not as a ranking." If your model needs to handle 500K-token contexts, you should care that SubQ scores well on MRCR v2. You should care more that two consecutive Opus versions disagree by 46 points on the same benchmark, because that fact tells you the signal-to-noise ratio of any single MRCR comparison is lower than the marketing implies. Both can be true.

There's also a generous reading of Subquadratic's MRCR position that I want to acknowledge: they're a young company shipping a real architectural innovation, and they used the strongest public long-context benchmark available to make their case. That's the right move. The criticism here is less about Subquadratic specifically and more about the ecosystem we're in, where the strongest public benchmark is still weak enough that consecutive model versions can swing 46 points and nobody pauses to ask what that means.

What to do Monday morning

If you operate an LLM feature in production, the cheapest high-leverage move this quarter is to build a 50-item private eval set targeting your hardest long-context tasks. Not a synthetic haystack. Real queries from your real users, with the kind of context they actually drag in. Run it against every candidate model at the context lengths you actually use. Track the score over time as vendors ship updates, because the variance you'll see between minor versions will tell you more about your model risk than any vendor blog post.

When SubQ becomes generally available, run it through the same harness. Maybe it lives up to the 86.2 on your data, maybe it scores 50. You'll know in an afternoon. The teams who know what their cliff is will pick models with confidence in 2026. The teams who don't will keep being surprised every time a point release moves their numbers around.

Twelve million tokens is an impressive engineering result. It's also a marketing number until you measure how many of those tokens your task can actually use. Do the measurement. Pick on the result.

SL#54 - Your 12-Million-Token Context Window Is a Lie Until You Measure It

The benchmark Subquadratic chose tells you what's broken

What SubQ actually solved, and what it didn't

The functional context window is the only number that matters

What this means for picking a model in 2026

Yes, but: the benchmark critique cuts both ways

What to do Monday morning

Sources

Keep Reading

Software Letters

Home