llm-evaluation

3 articles

sort: new top best

bug-bounty525 xss296 rce184 google174 exploit143 microsoft135 malware135 facebook134 account-takeover122 bragging-post117 cve113 privilege-escalation96 csrf88 open-source88 authentication-bypass83 phishing78 stored-xss75 access-control69 ai-agents67 web-security64 writeup63 apple63 reflected-xss63 reverse-engineering55 input-validation53 ssrf51 sql-injection51 cross-site-scripting49 browser49 smart-contract48 dos48 defi48 supply-chain47 api-security47 ethereum45 lfi45 tool44 information-disclosure44 privacy43 cloudflare41 web-application39 vulnerability-disclosure38 ctf38 race-condition38 opinion37 burp-suite37 web337 ai-security37 llm37 automation36

0 5/10

Removing Comments from SWE-Bench Improves Agent Performance

research

Research demonstrates that removing code comments from SWE-bench Verified tasks unexpectedly improves performance for GPT-5-mini but not GPT-5.2, revealing that semantic content in comments creates model-dependent 'memetic' effects (distraction, anchoring, overgeneralization) that can either help or hinder AI agent reasoning. The study frames codebases as informational organisms and proposes antimemetics—using documentation as a defensive system to guide or constrain agent behavior.

ai-agent-behavior code-analysis llm-evaluation benchmark prompt-injection semantic-content codebase-alignment memetics swe-bench gpt-models code-comments agent-robustness

SWE-bench Verified mini-swe-agent GPT-5-mini GPT-5.2 OpenAI requests Matplotlib Antimemetic AI

antimemeticai.com · irgolic · 1 day ago · details · hn

0 2/10

Show HN: jj-benchmark – Evaluating AI agents on Jujutsu version control

tool

TabbyML created jj-benchmark, a dataset of 63 evaluation tasks to test how well current AI coding agents can use Jujutsu version control. Results show Claude 4.6 Sonnet leads with 92% success rate, while open-weight models like Kimi-k2.5 achieved competitive 79% performance on this novel VCS tool.

ai-agents version-control jujutsu benchmark llm-evaluation coding-agents tool-usage

TabbyML Jujutsu jj Harbor Pochi Claude 4.6 Sonnet GPT-5.4 Gemini-3.1-pro Kimi-k2.5 Meng

tabbyml.github.io · wsxiaoys · 1 day ago · details · hn

0 2/10

See You in Court

opinion

A newsletter commentary on the escalating legal conflict between Anthropic and the Department of War over supply chain risk designations and government AI policy, alongside analysis of recent LLM improvements and reliability concerns in AI systems.

opinion ai-policy legal government-relations anthropic openai llm-evaluation ai-reliability government-contracts

Anthropic Department of War OpenAI GPT-5.4 Claude Opus 4.6 Zvi Mowshowitz Sayash Kapoor Dario Amodei Peter Wildeford Terence Tao Bernie Sanders

thezvi.substack.com · 7777777phil · 2 days ago · details · hn