benchmark-evaluation

1 article
sort: new top best
clear filter
0 4/10

METR researchers find that approximately 50% of SWE-bench-passing AI-generated pull requests would not be merged by real repository maintainers, with a 24 percentage point gap between automated benchmark scores and maintainer merge rates. The research uses 4 actual open-source maintainers reviewing 296 AI patches across 3 repositories to quantify the difference between benchmark performance and real-world code quality expectations.

SWE-bench Verified METR Parker Whitfill Cheryl Wu Joel Becker Nate Rush Claude 3.5 Sonnet Claude 3.7 Sonnet Claude 4 Opus Claude 4.5 Sonnet GPT-5 scikit-learn Sphinx pytest Epoch AI
metr.org · mustaphah · 2 days ago · details · hn