benchmark-evaluation

1 article

sort: new top best

bug-bounty507 xss274 rce154 google122 bragging-post119 account-takeover115 facebook111 privilege-escalation101 exploit98 malware97 authentication-bypass95 open-source94 microsoft90 csrf87 access-control78 stored-xss75 cve73 ai-agents67 web-security66 reflected-xss63 phishing60 information-disclosure52 input-validation52 sql-injection51 smart-contract49 privacy49 cross-site-scripting48 ssrf48 defi48 tool46 reverse-engineering46 ethereum46 writeup45 api-security45 ai-security41 apple40 vulnerability-disclosure40 web-application38 llm38 opinion37 burp-suite37 automation36 web336 responsible-disclosure35 credential-theft35 remote-code-execution34 supply-chain34 race-condition34 browser33 infrastructure33

0 4/10

Many SWE-bench-Passing PRs would not be merged

research

METR researchers find that approximately 50% of SWE-bench-passing AI-generated pull requests would not be merged by real repository maintainers, with a 24 percentage point gap between automated benchmark scores and maintainer merge rates. The research uses 4 actual open-source maintainers reviewing 296 AI patches across 3 repositories to quantify the difference between benchmark performance and real-world code quality expectations.

ai-agents benchmark-evaluation software-engineering code-review swe-bench llm-capabilities methodology

SWE-bench Verified METR Parker Whitfill Cheryl Wu Joel Becker Nate Rush Claude 3.5 Sonnet Claude 3.7 Sonnet Claude 4 Opus Claude 4.5 Sonnet GPT-5 scikit-learn Sphinx pytest Epoch AI

metr.org · mustaphah · 2 days ago · details · hn