swe-bench

2 articles

sort: new top best

bug-bounty520 xss287 rce157 google122 bragging-post119 exploit109 account-takeover107 open-source92 microsoft89 privilege-escalation87 csrf86 facebook84 authentication-bypass83 cve78 stored-xss75 malware68 access-control66 ai-agents64 writeup64 reflected-xss63 web-security63 ssrf54 input-validation52 phishing50 smart-contract49 defi48 sql-injection47 cross-site-scripting47 ethereum46 tool46 privacy45 information-disclosure44 apple42 api-security40 cloudflare39 lfi39 reverse-engineering39 dos38 web-application37 vulnerability-disclosure37 llm37 browser37 oauth36 burp-suite36 opinion36 idor34 automation34 web334 smart-contract-vulnerability33 race-condition33

0 5/10

Removing Comments from SWE-Bench Improves Agent Performance

research

Research demonstrates that removing code comments from SWE-bench Verified tasks unexpectedly improves performance for GPT-5-mini but not GPT-5.2, revealing that semantic content in comments creates model-dependent 'memetic' effects (distraction, anchoring, overgeneralization) that can either help or hinder AI agent reasoning. The study frames codebases as informational organisms and proposes antimemetics—using documentation as a defensive system to guide or constrain agent behavior.

ai-agent-behavior code-analysis llm-evaluation benchmark prompt-injection semantic-content codebase-alignment memetics swe-bench gpt-models code-comments agent-robustness

SWE-bench Verified mini-swe-agent GPT-5-mini GPT-5.2 OpenAI requests Matplotlib Antimemetic AI

antimemeticai.com · irgolic · 1 day ago · details · hn

0 4/10

Many SWE-bench-Passing PRs would not be merged

research

METR researchers find that approximately 50% of SWE-bench-passing AI-generated pull requests would not be merged by real repository maintainers, with a 24 percentage point gap between automated benchmark scores and maintainer merge rates. The research uses 4 actual open-source maintainers reviewing 296 AI patches across 3 repositories to quantify the difference between benchmark performance and real-world code quality expectations.

ai-agents benchmark-evaluation software-engineering code-review swe-bench llm-capabilities methodology

SWE-bench Verified METR Parker Whitfill Cheryl Wu Joel Becker Nate Rush Claude 3.5 Sonnet Claude 3.7 Sonnet Claude 4 Opus Claude 4.5 Sonnet GPT-5 scikit-learn Sphinx pytest Epoch AI

metr.org · mustaphah · 2 days ago · details · hn