Research demonstrates that removing code comments from SWE-bench Verified tasks unexpectedly improves performance for GPT-5-mini but not GPT-5.2, revealing that semantic content in comments creates model-dependent 'memetic' effects (distraction, anchoring, overgeneralization) that can either help or hinder AI agent reasoning. The study frames codebases as informational organisms and proposes antimemetics—using documentation as a defensive system to guide or constrain agent behavior.
METR researchers find that approximately 50% of SWE-bench-passing AI-generated pull requests would not be merged by real repository maintainers, with a 24 percentage point gap between automated benchmark scores and maintainer merge rates. The research uses 4 actual open-source maintainers reviewing 296 AI patches across 3 repositories to quantify the difference between benchmark performance and real-world code quality expectations.