swe-bench

2 articles
sort: new top best
clear filter
0 5/10

Research demonstrates that removing code comments from SWE-bench Verified tasks unexpectedly improves performance for GPT-5-mini but not GPT-5.2, revealing that semantic content in comments creates model-dependent 'memetic' effects (distraction, anchoring, overgeneralization) that can either help or hinder AI agent reasoning. The study frames codebases as informational organisms and proposes antimemetics—using documentation as a defensive system to guide or constrain agent behavior.

SWE-bench Verified mini-swe-agent GPT-5-mini GPT-5.2 OpenAI requests Matplotlib Antimemetic AI
antimemeticai.com · irgolic · 1 day ago · details · hn
0 4/10

METR researchers find that approximately 50% of SWE-bench-passing AI-generated pull requests would not be merged by real repository maintainers, with a 24 percentage point gap between automated benchmark scores and maintainer merge rates. The research uses 4 actual open-source maintainers reviewing 296 AI patches across 3 repositories to quantify the difference between benchmark performance and real-world code quality expectations.

SWE-bench Verified METR Parker Whitfill Cheryl Wu Joel Becker Nate Rush Claude 3.5 Sonnet Claude 3.7 Sonnet Claude 4 Opus Claude 4.5 Sonnet GPT-5 scikit-learn Sphinx pytest Epoch AI
metr.org · mustaphah · 2 days ago · details · hn