ai-evaluation

2 articles
sort: new top best
clear filter
0 2/10

A critical analysis rejecting vague claims about generative model utility, proposing a scientific framework based on three factors: encoding cost, verification cost, and task process-dependency. The author argues most current generative AI deployment lacks rigorous justification and predicts usefulness decreases with task complexity.

williamjbowman.com · takira · 3 days ago · details · hn
0 2/10

SWE-CI is a new benchmark for evaluating LLM-powered agents on long-term code maintenance tasks through continuous integration loops, shifting evaluation from static one-shot bug fixes to dynamic, multi-iteration codebase evolution across 100 real-world repository tasks averaging 233 days and 71 commits each.

SWE-CI SWE-bench Jialong Chen Xander Xu Hu Wei Chuan Chen Bing Zhao
arxiv.org · mpweiher · 6 days ago · details · hn