bug-bounty507
xss274
rce154
google122
bragging-post119
account-takeover115
facebook111
privilege-escalation101
exploit98
malware97
authentication-bypass95
open-source94
microsoft90
csrf87
access-control78
stored-xss75
cve73
ai-agents67
web-security66
reflected-xss63
phishing60
information-disclosure52
input-validation52
sql-injection51
smart-contract49
privacy49
cross-site-scripting48
ssrf48
defi48
tool46
reverse-engineering46
ethereum46
writeup45
api-security45
ai-security41
apple40
vulnerability-disclosure40
web-application38
llm38
opinion37
burp-suite37
automation36
web336
responsible-disclosure35
credential-theft35
remote-code-execution34
supply-chain34
race-condition34
browser33
infrastructure33
0
4/10
METR researchers find that approximately 50% of SWE-bench-passing AI-generated pull requests would not be merged by real repository maintainers, with a 24 percentage point gap between automated benchmark scores and maintainer merge rates. The research uses 4 actual open-source maintainers reviewing 296 AI patches across 3 repositories to quantify the difference between benchmark performance and real-world code quality expectations.
ai-agents
benchmark-evaluation
software-engineering
code-review
swe-bench
llm-capabilities
methodology
SWE-bench Verified
METR
Parker Whitfill
Cheryl Wu
Joel Becker
Nate Rush
Claude 3.5 Sonnet
Claude 3.7 Sonnet
Claude 4 Opus
Claude 4.5 Sonnet
GPT-5
scikit-learn
Sphinx
pytest
Epoch AI