Ask HN: What benchmarks do you trust most when comparing large LLMs?

QubridAI · 3 days ago · view on HN · opinion

quality 1/10 · low quality

0 net

AI Summary

A discussion post asking the Hacker News community which LLM benchmarks they find most trustworthy when evaluating models like Nemotron-3-Super-120B and Qwen3.5-122B, noting trade-offs between accuracy and inference throughput across different quantization formats.

Tags

Entities

Nemotron-3-Super-120B GPT-OSS-120B Qwen3.5-122B IFBench SWE-Bench Tau Bench RULER

So, I was checking out this research paper that compares Nemotron-3-Super-120B, GPT-OSS-120B, and Qwen3.5-122B. They looked at how these models performed on different benchmarks like IFBench, SWE-Bench, Tau Bench, and RULER.

One thing that stood out was the trade-off between accuracy and inference throughput, especially with formats like NVFP4 vs BF16.

I'm really interested to know which benchmarks folks here actually rely on when they're checking out models for real-life tasks. What seems to work best for you?

Do you rely more on reasoning benchmarks, coding benchmarks, or long-context tests?