llm-evaluation

3 articles
sort: new top best
clear filter
0 5/10

Research demonstrates that removing code comments from SWE-bench Verified tasks unexpectedly improves performance for GPT-5-mini but not GPT-5.2, revealing that semantic content in comments creates model-dependent 'memetic' effects (distraction, anchoring, overgeneralization) that can either help or hinder AI agent reasoning. The study frames codebases as informational organisms and proposes antimemetics—using documentation as a defensive system to guide or constrain agent behavior.

SWE-bench Verified mini-swe-agent GPT-5-mini GPT-5.2 OpenAI requests Matplotlib Antimemetic AI
antimemeticai.com · irgolic · 1 day ago · details · hn
0 2/10

TabbyML created jj-benchmark, a dataset of 63 evaluation tasks to test how well current AI coding agents can use Jujutsu version control. Results show Claude 4.6 Sonnet leads with 92% success rate, while open-weight models like Kimi-k2.5 achieved competitive 79% performance on this novel VCS tool.

TabbyML Jujutsu jj Harbor Pochi Claude 4.6 Sonnet GPT-5.4 Gemini-3.1-pro Kimi-k2.5 Meng
tabbyml.github.io · wsxiaoys · 1 day ago · details · hn
0 2/10

A newsletter commentary on the escalating legal conflict between Anthropic and the Department of War over supply chain risk designations and government AI policy, alongside analysis of recent LLM improvements and reliability concerns in AI systems.

Anthropic Department of War OpenAI GPT-5.4 Claude Opus 4.6 Zvi Mowshowitz Sayash Kapoor Dario Amodei Peter Wildeford Terence Tao Bernie Sanders
thezvi.substack.com · 7777777phil · 2 days ago · details · hn