tool-usage

1 article
sort: new top best
clear filter
0 2/10

TabbyML created jj-benchmark, a dataset of 63 evaluation tasks to test how well current AI coding agents can use Jujutsu version control. Results show Claude 4.6 Sonnet leads with 92% success rate, while open-weight models like Kimi-k2.5 achieved competitive 79% performance on this novel VCS tool.

TabbyML Jujutsu jj Harbor Pochi Claude 4.6 Sonnet GPT-5.4 Gemini-3.1-pro Kimi-k2.5 Meng
tabbyml.github.io · wsxiaoys · 2 days ago · details · hn