model-evaluation

2 articles

sort: new top best

bug-bounty448 google354 microsoft311 facebook262 xss238 apple179 malware174 rce149 exploit124 bragging-post101 cve99 account-takeover93 phishing83 csrf79 privilege-escalation77 supply-chain65 stored-xss65 authentication-bypass63 dos60 browser57 reflected-xss57 react50 cloudflare49 cross-site-scripting48 reverse-engineering48 input-validation48 access-control47 aws45 docker45 smart-contract45 node44 sql-injection43 ethereum43 web343 defi42 web-security42 web-application41 ssrf38 burp-suite35 idor34 vulnerability-disclosure34 info-disclosure33 race-condition33 html-injection33 cloud32 writeup32 oauth32 buffer-overflow32 smart-contract-vulnerability32 information-disclosure30

0 2/10

We compare model quality in Cursor

research

Cursor describes CursorBench, their internal benchmark suite for evaluating AI coding agent performance on real developer tasks, which provides better model discrimination and developer alignment than public benchmarks like SWE-bench by using actual user sessions and measuring multi-dimensional agent behavior beyond simple correctness.

ai-benchmarking code-generation model-evaluation llm-performance ai-agents software-engineering testing-methodology

Cursor CursorBench SWE-bench Terminal-Bench OpenAI Haiku GPT-5

cursor.com · xdotli · 16 hours ago · details · hn

0 1/10

Leaderboard of Leaderboards – A Real-Time Meta-Ranking of AI Benchmarks

tool

A meta-ranking system that ranks AI leaderboards on Hugging Face in real-time based on community engagement (trending scores and likes), providing transparency into which benchmarks the research community actually trusts across nine domain categories.

ai-benchmarks leaderboards model-evaluation llm-ranking hugging-face ai-research transparency

Hugging Face Open LLM Leaderboard Chatbot Arena MTEB BigCodeBench FINAL Bench Smol AI WorldCup ALL Bench mayafree

huggingface.co · seawolf2357 · 17 hours ago · details · hn