code-maintenance

1 article
sort: new top best
clear filter
0 2/10

SWE-CI is a new benchmark for evaluating LLM-powered agents on long-term code maintenance tasks through continuous integration loops, shifting evaluation from static one-shot bug fixes to dynamic, multi-iteration codebase evolution across 100 real-world repository tasks averaging 233 days and 71 commits each.

SWE-CI SWE-bench Jialong Chen Xander Xu Hu Wei Chuan Chen Bing Zhao
arxiv.org · mpweiher · 6 days ago · details · hn