tool-usage

1 article

sort: new top best

bug-bounty517 xss286 rce150 bragging-post119 google112 exploit106 account-takeover106 open-source92 csrf85 privilege-escalation84 microsoft83 authentication-bypass83 facebook79 stored-xss75 cve71 access-control66 ai-agents64 reflected-xss63 web-security63 writeup63 malware61 ssrf53 input-validation52 smart-contract49 defi48 phishing48 cross-site-scripting47 sql-injection47 ethereum46 tool46 privacy45 information-disclosure44 api-security40 cloudflare39 apple39 lfi37 vulnerability-disclosure37 dos37 llm37 web-application37 burp-suite36 browser36 reverse-engineering36 opinion36 automation34 oauth34 web333 html-injection33 smart-contract-vulnerability33 responsible-disclosure33

0 2/10

Show HN: jj-benchmark – Evaluating AI agents on Jujutsu version control

tool

TabbyML created jj-benchmark, a dataset of 63 evaluation tasks to test how well current AI coding agents can use Jujutsu version control. Results show Claude 4.6 Sonnet leads with 92% success rate, while open-weight models like Kimi-k2.5 achieved competitive 79% performance on this novel VCS tool.

ai-agents version-control jujutsu benchmark llm-evaluation coding-agents tool-usage

TabbyML Jujutsu jj Harbor Pochi Claude 4.6 Sonnet GPT-5.4 Gemini-3.1-pro Kimi-k2.5 Meng

tabbyml.github.io · wsxiaoys · 2 days ago · details · hn