SWE-bench broke because static benchmarks inevitably get gamed (Goodhart’s Law). Code Review Bench proposes a self-correcting alternative: an offline benchmark (controlled, gold-set evaluation) continuously checked against real-world developer behavior (online data on which review comments get acted on). When the two diverge, that signals the benchmark is wrong. The goal is a continuously refreshed, reality-grounded foundation for measuring and improving code generation.

The best models take billions of dollars to train. Only thousands have been spent on our best benchmarks. This is an experiment in closing that gap.

We're open-sourcing Code Review Bench, a benchmark for code review tools that uses real-world developer behavior to avoid going stale. It has 200K+ PRs and updates daily.

The methodology and code are available at github.com/withmartian/code-review-benchmark.

SWE-bench Verified, created by OpenAI through three iterations of careful human review, has been the standard benchmark for code generation. Earlier this week, they killed it, recommending that everyone stop reporting scores. The reason: frontier models can reproduce gold patches from memory. When you ask GPT-5.2 about a specific SWE-bench task, it outputs the exact golden diff. Claude Opus quotes inline comments verbatim. Gemini produces the precise regex fix with line numbers. On top of that, 59.4% of the problems models still couldn't solve turned out to have broken tests.

SWE-bench was good work. It still broke. We think that's the interesting question: not "how do we fix this benchmark" but "why do benchmarks keep breaking, and can you build one that doesn't?"

The rest of this post explains how we're trying, and why we think a code review benchmark is the best foundation for measuring code generation.

Why code review is the highest-leverage thing to measure

Code generation is the most economically valuable application of AI right now. Measuring it well matters. And the best way to measure code generation might not be to measure code generation directly.

It's easier to check code than to write it. A reviewer looking at a diff can spot a bug without having to have written the correct implementation themselves. This asymmetry is fundamental: verifying a solution is cheaper than producing one. It's why code review exists as a practice in the first place.

This has a direct consequence for benchmarks. If you try to benchmark code generation directly, you need to verify whether the generated code is correct. That's what SWE-bench does with its test suites, and we've seen how that breaks. But if you can benchmark the verifier well, you get a measure of code generation for free: run a generator, run the verifier, and the verifier's accuracy tells you how much you can trust the result.

It also has a direct consequence for training. RL training requires a reward signal. For code generation, that reward signal is a judgment about whether the code is correct. That's code review. A good code review benchmark is effectively a benchmark for the reward function. If the reward function is well-measured and well-understood, you can train better generators. If it's not, you're optimizing against a signal you can't trust, which is the Goodhart problem.

So when we say "code review benchmark," we don't just mean a leaderboard for code review tools. We mean a foundation for measuring and improving code generation itself.

The Goodhart Problem: Why benchmarks keep breaking

The natural response to a broken benchmark is to fix the dataset and re-release. SWE-bench did this three times: original → Verified → Pro. Each version addresses real problems with the last one. Pro uses copyleft repos to reduce contamination, tests in multiple languages instead of just Python, and requires larger patches across more files. These are real improvements. They don't address the underlying problem.

Consider what happens when you have 500 problems from 12 Python repos and that's your benchmark. There are many ways to improve your score. You could make a model that's genuinely better at writing code. But you could also include those repos in your training data. You could train against features of those specific problems that don't generalize. You could train multiple checkpoints and release the one with the highest score. You could engineer your scaffolding to handle the specific patterns that appear in those 12 repos. All of these improve the number. Only the first one improves the product.

This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. What makes it hard is that each specific failure mode looks like a fixable bug:

Contamination looks like a data hygiene problem: just filter the benchmark from training data. But public benchmarks end up in training data because training data is built from public code and public discussions about code. You can use copyleft repos, but you can't prevent people from discussing the problems, writing blog posts about solutions, or referencing the same codebases for other work. OpenAI's contamination post is itself now training data for the next generation of models.

Broken tests look like a curation problem: just review the tests more carefully. But writing tests that verify correctness while remaining agnostic to implementation is genuinely hard. OpenAI hired 93 software engineers to review the problems. Three independent reviewers per problem. They still found that 59.4% of the hardest remaining problems had flawed tests. The proxy ("tests pass") drifts from the intent ("problem is solved"), and human review slows the drift but doesn't stop it.

Staleness looks like a refresh problem: just add new problems. But as long as the set is fixed between refreshes, the same pressures apply to the new set. And each refresh is expensive enough that it happens infrequently, leaving long windows where the pressures operate.

The usual response is to patch each of these individually. But Goodhart's Law isn't a bug you patch. It's a continuous pressure that operates on every static benchmark. The question is whether you can build something that holds up against it.

That requires two things. A way to detect when your benchmark is drifting from reality, ideally before you've spent months reporting inflated scores. And a way to fix it continuously, not through periodic re-releases but as a structural property of the benchmark itself.

Code review has a property typical AI benchmarks don't

When a developer gets a comment from a code review tool, they do something with it: they fix the issue, they dismiss it, they ignore it. This happens thousands of times a day across open source. And it's not controlled by anyone. No benchmark designer, no vendor, no model provider decides which flags developers act on.

This is unusual. Most AI benchmarks have no equivalent signal. If a model generates code, you can check whether it passes tests, but that has all the problems we just described. If a model writes a summary, there's no natural moment where a human reveals whether the summary was actually useful. Code review is one of the few AI applications where the tool's output gets directly tested by the person it's meant to help, and their response is observable at scale.

This behavioral data can't replace a controlled benchmark. You can't run two tools on the same repo, because each tool sees the other's output. Tool adoption correlates with repo characteristics, so comparing raw acceptance rates across tools would be an apples-to-oranges comparison. A tool installed mostly on well-maintained TypeScript repos will look different from one installed mostly on legacy Java codebases, regardless of quality.

But behavioral data can do something a benchmark alone can't: tell you when the benchmark is wrong. If a tool scores 80% on your offline benchmark but developers ignore 60% of its comments in practice, your benchmark has a problem. You don't need a manual audit of 138 problems by six engineers each to discover this. You can see it in the data, and you can see it continuously.

A benchmark that checks itself

Code Review Bench has an offline benchmark and an online benchmark.

The offline benchmark is the controlled comparison. We run every tool on the same PRs, with the same context, and score them against a curated gold set of known issues. This is what lets us say "tool A found 70% of the bugs, tool B found 55%" and have that comparison mean something. Any tool or model can be evaluated, including new ones with no public installation. The v0 builds on work published by Augment and Greptile.

The online benchmark is the reality check. Each day, we collect data from code review tools operating in open source repos. We track which comments developers act on, which they ignore, and the characteristics of the repos, PRs, and diffs involved. This tells us what tools are actually doing in practice and how useful developers find them.

The offline benchmark will Goodhart. That's not a prediction of failure; it's an inevitability we've designed around. When it does, the online data catches it. If offline rankings stop matching what we see in the wild, that's the signal to update: expand the gold set, recalibrate the judge, adjust what counts as a bug. When the two agree, we have some confidence the offline benchmark is measuring something real. When they diverge, we know where to look.

We also refresh monthly. Each iteration uses PRs from the prior month, versioned and numbered. A static set of 500 problems gives the ecosystem time to overfit; a moving target makes that much harder. Anchor models run on every version so scores remain comparable across iterations.

The online benchmark is the headline metric at launch. Not because it's better, but because it's the one we can trust today. The offline benchmark exists from day one but has significant divergence from what we see in the wild. We'll be posting regular updates showing how we close that gap. Once the offline benchmark reliably reflects real-world tool value, it becomes the primary metric, because it can do what the online benchmark can't: controlled, fair comparisons on identical inputs.

The full methodology, including how we measure precision and recall and why both are harder than they sound, is here.

What we've learned so far

We can see an example of the benchmark by looking at the F1 scores of different tools (a measure which values precision and recall equally). We use F1 scores here as an example comparison, but we don't take an opinionated stance on the relative value of precision vs recall. If you want a low-noise tool that surfaces only the issues that really matter, F1 might not be the right metric. Play around with the data on your own here.

The most important finding is that the online and offline benchmarks disagree. This is what we expected, and it's why we built both. Some tools are consistent. Graphite has the highest precision and lowest recall in both benchmarks — it comments rarely, but when it does, it's usually right. Look at F1 scores across all tools, and the bottom of the benchmark correlates more closely than the top. These agreements give us some confidence that both benchmarks are picking up real signal, even where they disagree on magnitude. But the deltas are interesting because they tell us the methodology is working.

These kinds of disagreements existed in SWE-bench too — broken tests, contamination, scores that didn't reflect real capability. But OpenAI had to hire 93 engineers for a dedicated audit to spot them. Here, you can see them in the data on day one. That's the structural advantage of building both benchmarks: problems that would stay hidden in a traditional benchmark announce themselves. And it means that once we close the gap (once the offline and online results correlate in the ways that matter) you can have genuine confidence that the offline benchmark is measuring something real, not just something that hasn't been audited yet.

This means we can use the methods described in our methodology doc to understand and close the distance. That's what we're doing ahead of the benchmark's v1. By building this benchmark in public, we hope to gain the trust of the field and surface how we can most effectively improve the benchmark.

Equally exciting: no tool found more than 63% of the known issues. The best tool still missed a third of the bugs. The verifier is far from solved.

The gold set is wrong, and it matters. When we built the offline benchmark, we started with Augment and Greptile's previously published datasets of PRs with known issues. When we ran tools against those datasets, some of the comments we initially scored as false positives turned out to be real issues the gold set didn't include. This is the same thing Augment found when they expanded Greptile's original dataset: multiple real issues per PR that the first round of human annotation missed. The offline benchmark may be biased toward the kinds of issues those tools prioritize.

This isn't a surprise — it's the problem we described earlier with recall measurement. But seeing it concretely changes how you read the offline results. Every tool's precision is being understated (real finds are scored as false positives) and every tool's recall is being overstated (the denominator of "bugs that exist" is too small). The magnitude of this effect is large enough to change rankings. It's one of the first things we'll be investigating as we improve the offline benchmark against the online signal.

Noisy tools might be better than they look. The offline data seems to show a clear pattern: tools that comment more have lower precision. But the online data complicates that story. Coderabbit has the highest recall in the online data (0.54) and the most PRs reviewed (5,035). Developers are acting on its comments at a substantial rate despite the volume. One possibility: noisy tools surface real issues that quieter tools miss, but the offline benchmark scores those discoveries as false positives because the gold set doesn't include them. Another possibility: developers on repos with high-volume tools develop different habits around which comments to engage with. Separating these explanations is exactly the kind of question the online/offline comparison is designed to answer, and it's one of the places we'll be focusing next.

How we plan to close the gaps. The gold set problem and the volume question both point to the same underlying issue: we don't have good enough ground truth. As examples of how the benchmark will evolve, we want to highlight two ideas we think can help.

The first is what we call bug lifetime analysis. From GitHub data, we can trace bug fixes back to the PRs that introduced them, giving us the lifespan of each bug — time from introduction to discovery. If you fit a survival curve to this data, you can estimate how many bugs in a given PR are likely still latent. This gives us a probabilistic bound on how complete the gold set is, rather than assuming it's ground truth. It also serves as a natural severity proxy: a bug that survives years without causing problems is, by revealed preference, less important than one that gets caught in a week.

The second is what we call comment afterlife. Right now, if a developer ignores a comment, we don't know if they thought it was wrong or just didn't want to deal with it in that PR. But we can track what happens afterward. If a tool flags a potential race condition, the developer ignores it, and six weeks later they fix exactly that race condition — that's not a false positive. If a certain class of flag gets acted on 80% of the time across all repos, the other 20% probably aren't all hallucinations either. Comment afterlife lets us separate "the tool was wrong" from "the developer wasn't ready to hear it yet," which is the core ambiguity in precision measurement.

Neither of these is implemented yet. We're describing them because they show the kind of investigation the online/offline loop makes possible, and because we'd welcome collaboration from anyone who's thought about these problems. The full methodology, including several other approaches we're exploring, is here: https://github.com/withmartian/code-review-benchmark/blob/main/methodology/full.md

We'll be publishing detailed analyses as we investigate each divergence. The goal isn't just a leaderboard — it's to show, transparently, how the online data changes our understanding of the offline results and what we do about it.

Conclusion

Today we're releasing a v0 of both benchmarks. The offline benchmark has significant divergence from the online data. We'll be posting regular updates showing what we learn and how we change the offline methodology in response.

Doing this well is expensive. New data every month, human annotation, calibration against behavioral signals, ongoing maintenance as the ecosystem shifts. This is why benchmarks have historically been either academic (rigorous but under-resourced and eventually stale) or vendor-published (well-resourced but potentiallybiased toward the vendor's tool). We're trying a third option: a well-funded research lab that doesn't train models or sell coding tools, and has no stake in which tool wins. But we don't want to do it alone. The benchmark and methodology are open source. We want tool builders and model makers reviewing our work at each step, and we'd welcome collaboration with peers in industry and academia. We'd rather catch errors before publishing results than correct them after.

The best models take billions of dollars to train. Only thousands have been spent on our best benchmarks. This is an experiment in closing that gap.

Website | Methodology | Repo