model-evaluation

2 articles
sort: new top best
clear filter
0 2/10

Cursor describes CursorBench, their internal benchmark suite for evaluating AI coding agent performance on real developer tasks, which provides better model discrimination and developer alignment than public benchmarks like SWE-bench by using actual user sessions and measuring multi-dimensional agent behavior beyond simple correctness.

Cursor CursorBench SWE-bench Terminal-Bench OpenAI Haiku GPT-5
cursor.com · xdotli · 16 hours ago · details · hn
0 1/10

A meta-ranking system that ranks AI leaderboards on Hugging Face in real-time based on community engagement (trending scores and likes), providing transparency into which benchmarks the research community actually trusts across nine domain categories.

Hugging Face Open LLM Leaderboard Chatbot Arena MTEB BigCodeBench FINAL Bench Smol AI WorldCup ALL Bench mayafree
huggingface.co · seawolf2357 · 17 hours ago · details · hn