Grok 4.20 brings minimal improvements over Grok-4.1-fast

aibenchy.com · XCSme · 2 days ago · view on HN · not-security-related
quality 1/10 · low quality
0 net
AI Summary

This article is not security-related. It appears to be a benchmark comparison page for AI language models (Grok, Gemini, GPT, Claude, etc.) with performance rankings, not a cybersecurity topic.

Tags
xAI: Grok 4.20 Beta vs xAI: Grok 4.20 Multi-Agent Beta vs xAI: Grok 4.1 Fast vs Google: Gemini 3 Flash Preview | AI BENCHY Navigate AI BENCHY Language: ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic ๐Ÿ‡ง๐Ÿ‡ฉ Bangla ๐Ÿ‡ฉ๐Ÿ‡ช German ๐Ÿ‡บ๐Ÿ‡ธ English ๐Ÿ‡ช๐Ÿ‡ธ Spanish ๐Ÿ‡ซ๐Ÿ‡ท French ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian ๐Ÿ‡ฏ๐Ÿ‡ต Japanese ๐Ÿ‡ฎ๐Ÿ‡ณ Marathi ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch ๐Ÿ‡ต๐Ÿ‡น Portuguese ๐Ÿ‡ท๐Ÿ‡ด Romanian ๐Ÿ‡ท๐Ÿ‡บ Russian ๐Ÿ‡ฐ๐Ÿ‡ช Swahili ๐Ÿ‡ต๐Ÿ‡ฐ Urdu ๐Ÿ‡จ๐Ÿ‡ณ Chinese Theme โค๏ธ Made by XCS Navigate Language ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic ๐Ÿ‡ง๐Ÿ‡ฉ Bangla ๐Ÿ‡ฉ๐Ÿ‡ช German ๐Ÿ‡บ๐Ÿ‡ธ English ๐Ÿ‡ช๐Ÿ‡ธ Spanish ๐Ÿ‡ซ๐Ÿ‡ท French ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian ๐Ÿ‡ฏ๐Ÿ‡ต Japanese ๐Ÿ‡ฎ๐Ÿ‡ณ Marathi ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch ๐Ÿ‡ต๐Ÿ‡น Portuguese ๐Ÿ‡ท๐Ÿ‡ด Romanian ๐Ÿ‡ท๐Ÿ‡บ Russian ๐Ÿ‡ฐ๐Ÿ‡ช Swahili ๐Ÿ‡ต๐Ÿ‡ฐ Urdu ๐Ÿ‡จ๐Ÿ‡ณ Chinese Theme โค๏ธ Made by XCS Your ad here AD Track all your projects in one dashboard. Do you own multiple domains? Get stats for all of them in one place! Get ๐Ÿ“Š stats , ๐Ÿ”ฅ heatmaps and ๐Ÿ‘€ recordings in one self-hosted dashboard. See where visitors come from and how they use your sites. Get ๐Ÿ“Š stats , ๐Ÿ”ฅ heatmaps and ๐Ÿ‘€ recordings for all your projects, in your own, self-hosted dashboard! uxwizz.com โ€” #1 Gemini 3 Flash Preview โ€” (medium) #2 Gemini 3.1 Pro Preview โ€” (medium) #3 Seed-2.0-Lite โ€” (medium) #4 GPT-5.3-Codex โ€” (medium) #5 Qwen3.5 Plus 2026-02-15 โ€” (medium) #6 Gemini 3 Flash Preview โ€” (low) #7 Gemini 3 Pro Preview โ€” (medium) #8 Qwen3.5-27B โ€” (medium) #9 Gemini 3.1 Flash Lite Preview โ€” (high) ยท Archived #10 GPT-5.4 โ€” (medium) #11 Qwen3.5-122B-A10B โ€” (medium) #12 Claude Sonnet 4.6 โ€” (medium) #13 Gemini 3.1 Flash Lite Preview โ€” (medium) #14 Step 3.5 Flash โ€” (medium) #15 GLM 5 โ€” (medium) #16 GPT-5.2 Chat โ€” (none) #17 Gemini 2.5 Flash โ€” (medium) #18 Gemini 3.1 Flash Lite Preview โ€” (low) #19 DeepSeek V3.2 โ€” (medium) #20 GPT-5.3 Chat โ€” (none) #21 Gemini 3 Flash Preview โ€” (none) #22 MiMo-V2-Flash โ€” (medium) #23 Gemini 3.1 Flash Lite Preview โ€” (none) #24 Grok 4.20 Beta โ€” (medium) #25 Seed-2.0-Mini โ€” (medium) #26 Qwen3.5-Flash โ€” (medium) #27 Claude Sonnet 4.6 โ€” (none) #28 Claude Opus 4.6 โ€” (medium) #29 GPT-5.2 โ€” (medium) #30 Kimi K2.5 โ€” (medium) #31 Qwen3.5 Plus 2026-02-15 โ€” (none) #32 Grok 4.1 Fast โ€” (medium) #33 GLM 5 โ€” (none) #34 GPT-5 Mini โ€” (medium) #35 Hunter Alpha โ€” (medium) #36 Nemotron 3 Super 120b A12b โ€” (medium) #37 DeepSeek V3.2 โ€” (none) #38 GPT-5 Nano โ€” (medium) #39 Qwen3.5-35B-A3B โ€” (medium) #40 Mercury 2 โ€” (medium) #41 Qwen3.5-Flash โ€” (none) #42 Gemini 2.5 Flash โ€” (none) #43 gpt-oss-120b โ€” (medium) #44 Qwen3.5-122B-A10B โ€” (none) #45 Seed-2.0-Lite โ€” (none) #46 Qwen3.5-27B โ€” (none) #47 Grok 4.20 Multi-Agent Beta โ€” (medium) #48 Qwen3.5-35B-A3B โ€” (none) #49 MiniMax M2.5 โ€” (medium) #50 Hunter Alpha โ€” (none) #51 GPT-5.4 โ€” (none) #52 Grok 4.20 Beta โ€” (none) #53 Trinity Large Preview โ€” (none) #54 Kimi K2.5 โ€” (none) #55 GPT-4o-mini โ€” (none) #56 Qwen3 Coder Next โ€” (none) #57 GLM 4.7 Flash โ€” (none) #58 Qwen3 Coder Next โ€” (medium) #59 Nemotron 3 Super 120b A12b โ€” (none) #60 Qwen3.5-9B โ€” (none) #61 Mercury 2 โ€” (none) #62 GLM 4.7 Flash โ€” (medium) #63 Grok 4.1 Fast โ€” (none) #64 MiMo-V2-Flash โ€” (none) #65 LFM2-24B-A2B โ€” (none) ยท Archived #66 Qwen3.5-9B โ€” (medium) โ€” #1 Gemini 3 Flash Preview โ€” (medium) #2 Gemini 3.1 Pro Preview โ€” (medium) #3 Seed-2.0-Lite โ€” (medium) #4 GPT-5.3-Codex โ€” (medium) #5 Qwen3.5 Plus 2026-02-15 โ€” (medium) #6 Gemini 3 Flash Preview โ€” (low) #7 Gemini 3 Pro Preview โ€” (medium) #8 Qwen3.5-27B โ€” (medium) #9 Gemini 3.1 Flash Lite Preview โ€” (high) ยท Archived #10 GPT-5.4 โ€” (medium) #11 Qwen3.5-122B-A10B โ€” (medium) #12 Claude Sonnet 4.6 โ€” (medium) #13 Gemini 3.1 Flash Lite Preview โ€” (medium) #14 Step 3.5 Flash โ€” (medium) #15 GLM 5 โ€” (medium) #16 GPT-5.2 Chat โ€” (none) #17 Gemini 2.5 Flash โ€” (medium) #18 Gemini 3.1 Flash Lite Preview โ€” (low) #19 DeepSeek V3.2 โ€” (medium) #20 GPT-5.3 Chat โ€” (none) #21 Gemini 3 Flash Preview โ€” (none) #22 MiMo-V2-Flash โ€” (medium) #23 Gemini 3.1 Flash Lite Preview โ€” (none) #24 Grok 4.20 Beta โ€” (medium) #25 Seed-2.0-Mini โ€” (medium) #26 Qwen3.5-Flash โ€” (medium) #27 Claude Sonnet 4.6 โ€” (none) #28 Claude Opus 4.6 โ€” (medium) #29 GPT-5.2 โ€” (medium) #30 Kimi K2.5 โ€” (medium) #31 Qwen3.5 Plus 2026-02-15 โ€” (none) #32 Grok 4.1 Fast โ€” (medium) #33 GLM 5 โ€” (none) #34 GPT-5 Mini โ€” (medium) #35 Hunter Alpha โ€” (medium) #36 Nemotron 3 Super 120b A12b โ€” (medium) #37 DeepSeek V3.2 โ€” (none) #38 GPT-5 Nano โ€” (medium) #39 Qwen3.5-35B-A3B โ€” (medium) #40 Mercury 2 โ€” (medium) #41 Qwen3.5-Flash โ€” (none) #42 Gemini 2.5 Flash โ€” (none) #43 gpt-oss-120b โ€” (medium) #44 Qwen3.5-122B-A10B โ€” (none) #45 Seed-2.0-Lite โ€” (none) #46 Qwen3.5-27B โ€” (none) #47 Grok 4.20 Multi-Agent Beta โ€” (medium) #48 Qwen3.5-35B-A3B โ€” (none) #49 MiniMax M2.5 โ€” (medium) #50 Hunter Alpha โ€” (none) #51 GPT-5.4 โ€” (none) #52 Grok 4.20 Beta โ€” (none) #53 Trinity Large Preview โ€” (none) #54 Kimi K2.5 โ€” (none) #55 GPT-4o-mini โ€” (none) #56 Qwen3 Coder Next โ€” (none) #57 GLM 4.7 Flash โ€” (none) #58 Qwen3 Coder Next โ€” (medium) #59 Nemotron 3 Super 120b A12b โ€” (none) #60 Qwen3.5-9B โ€” (none) #61 Mercury 2 โ€” (none) #62 GLM 4.7 Flash โ€” (medium) #63 Grok 4.1 Fast โ€” (none) #64 MiMo-V2-Flash โ€” (none) #65 LFM2-24B-A2B โ€” (none) ยท Archived #66 Qwen3.5-9B โ€” (medium) โ€” #1 Gemini 3 Flash Preview โ€” (medium) #2 Gemini 3.1 Pro Preview โ€” (medium) #3 Seed-2.0-Lite โ€” (medium) #4 GPT-5.3-Codex โ€” (medium) #5 Qwen3.5 Plus 2026-02-15 โ€” (medium) #6 Gemini 3 Flash Preview โ€” (low) #7 Gemini 3 Pro Preview โ€” (medium) #8 Qwen3.5-27B โ€” (medium) #9 Gemini 3.1 Flash Lite Preview โ€” (high) ยท Archived #10 GPT-5.4 โ€” (medium) #11 Qwen3.5-122B-A10B โ€” (medium) #12 Claude Sonnet 4.6 โ€” (medium) #13 Gemini 3.1 Flash Lite Preview โ€” (medium) #14 Step 3.5 Flash โ€” (medium) #15 GLM 5 โ€” (medium) #16 GPT-5.2 Chat โ€” (none) #17 Gemini 2.5 Flash โ€” (medium) #18 Gemini 3.1 Flash Lite Preview โ€” (low) #19 DeepSeek V3.2 โ€” (medium) #20 GPT-5.3 Chat โ€” (none) #21 Gemini 3 Flash Preview โ€” (none) #22 MiMo-V2-Flash โ€” (medium) #23 Gemini 3.1 Flash Lite Preview โ€” (none) #24 Grok 4.20 Beta โ€” (medium) #25 Seed-2.0-Mini โ€” (medium) #26 Qwen3.5-Flash โ€” (medium) #27 Claude Sonnet 4.6 โ€” (none) #28 Claude Opus 4.6 โ€” (medium) #29 GPT-5.2 โ€” (medium) #30 Kimi K2.5 โ€” (medium) #31 Qwen3.5 Plus 2026-02-15 โ€” (none) #32 Grok 4.1 Fast โ€” (medium) #33 GLM 5 โ€” (none) #34 GPT-5 Mini โ€” (medium) #35 Hunter Alpha โ€” (medium) #36 Nemotron 3 Super 120b A12b โ€” (medium) #37 DeepSeek V3.2 โ€” (none) #38 GPT-5 Nano โ€” (medium) #39 Qwen3.5-35B-A3B โ€” (medium) #40 Mercury 2 โ€” (medium) #41 Qwen3.5-Flash โ€” (none) #42 Gemini 2.5 Flash โ€” (none) #43 gpt-oss-120b โ€” (medium) #44 Qwen3.5-122B-A10B โ€” (none) #45 Seed-2.0-Lite โ€” (none) #46 Qwen3.5-27B โ€” (none) #47 Grok 4.20 Multi-Agent Beta โ€” (medium) #48 Qwen3.5-35B-A3B โ€” (none) #49 MiniMax M2.5 โ€” (medium) #50 Hunter Alpha โ€” (none) #51 GPT-5.4 โ€” (none) #52 Grok 4.20 Beta โ€” (none) #53 Trinity Large Preview โ€” (none) #54 Kimi K2.5 โ€” (none) #55 GPT-4o-mini โ€” (none) #56 Qwen3 Coder Next โ€” (none) #57 GLM 4.7 Flash โ€” (none) #58 Qwen3 Coder Next โ€” (medium) #59 Nemotron 3 Super 120b A12b โ€” (none) #60 Qwen3.5-9B โ€” (none) #61 Mercury 2 โ€” (none) #62 GLM 4.7 Flash โ€” (medium) #63 Grok 4.1 Fast โ€” (none) #64 MiMo-V2-Flash โ€” (none) #65 LFM2-24B-A2B โ€” (none) ยท Archived #66 Qwen3.5-9B โ€” (medium) โ€” #1 Gemini 3 Flash Preview โ€” (medium) #2 Gemini 3.1 Pro Preview โ€” (medium) #3 Seed-2.0-Lite โ€” (medium) #4 GPT-5.3-Codex โ€” (medium) #5 Qwen3.5 Plus 2026-02-15 โ€” (medium) #6 Gemini 3 Flash Preview โ€” (low) #7 Gemini 3 Pro Preview โ€” (medium) #8 Qwen3.5-27B โ€” (medium) #9 Gemini 3.1 Flash Lite Preview โ€” (high) ยท Archived #10 GPT-5.4 โ€” (medium) #11 Qwen3.5-122B-A10B โ€” (medium) #12 Claude Sonnet 4.6 โ€” (medium) #13 Gemini 3.1 Flash Lite Preview โ€” (medium) #14 Step 3.5 Flash โ€” (medium) #15 GLM 5 โ€” (medium) #16 GPT-5.2 Chat โ€” (none) #17 Gemini 2.5 Flash โ€” (medium) #18 Gemini 3.1 Flash Lite Preview โ€” (low) #19 DeepSeek V3.2 โ€” (medium) #20 GPT-5.3 Chat โ€” (none) #21 Gemini 3 Flash Preview โ€” (none) #22 MiMo-V2-Flash โ€” (medium) #23 Gemini 3.1 Flash Lite Preview โ€” (none) #24 Grok 4.20 Beta โ€” (medium) #25 Seed-2.0-Mini โ€” (medium) #26 Qwen3.5-Flash โ€” (medium) #27 Claude Sonnet 4.6 โ€” (none) #28 Claude Opus 4.6 โ€” (medium) #29 GPT-5.2 โ€” (medium) #30 Kimi K2.5 โ€” (medium) #31 Qwen3.5 Plus 2026-02-15 โ€” (none) #32 Grok 4.1 Fast โ€” (medium) #33 GLM 5 โ€” (none) #34 GPT-5 Mini โ€” (medium) #35 Hunter Alpha โ€” (medium) #36 Nemotron 3 Super 120b A12b โ€” (medium) #37 DeepSeek V3.2 โ€” (none) #38 GPT-5 Nano โ€” (medium) #39 Qwen3.5-35B-A3B โ€” (medium) #40 Mercury 2 โ€” (medium) #41 Qwen3.5-Flash โ€” (none) #42 Gemini 2.5 Flash โ€” (none) #43 gpt-oss-120b โ€” (medium) #44 Qwen3.5-122B-A10B โ€” (none) #45 Seed-2.0-Lite โ€” (none) #46 Qwen3.5-27B โ€” (none) #47 Grok 4.20 Multi-Agent Beta โ€” (medium) #48 Qwen3.5-35B-A3B โ€” (none) #49 MiniMax M2.5 โ€” (medium) #50 Hunter Alpha โ€” (none) #51 GPT-5.4 โ€” (none) #52 Grok 4.20 Beta โ€” (none) #53 Trinity Large Preview โ€” (none) #54 Kimi K2.5 โ€” (none) #55 GPT-4o-mini โ€” (none) #56 Qwen3 Coder Next โ€” (none) #57 GLM 4.7 Flash โ€” (none) #58 Qwen3 Coder Next โ€” (medium) #59 Nemotron 3 Super 120b A12b โ€” (none) #60 Qwen3.5-9B โ€” (none) #61 Mercury 2 โ€” (none) #62 GLM 4.7 Flash โ€” (medium) #63 Grok 4.1 Fast โ€” (none) #64 MiMo-V2-Flash โ€” (none) #65 LFM2-24B-A2B โ€” (none) ยท Archived #66 Qwen3.5-9B โ€” (medium) Compare Share AI BENCHY Compare Compared models Grok 4.20 Beta โ‹ฎโ‹ฎ โ–พ Search Grok 4.20 Multi-Agent Beta โ‹ฎโ‹ฎ โ–พ Search Grok 4.1 Fast โ‹ฎโ‹ฎ โ–พ Search Gemini 3 Flash Preview โ‹ฎโ‹ฎ โ–พ Search Last updated at: 2026-03-12 th+th]:border-l [&>th+th]:border-slate-300 dark:[&>th+th]:border-slate-800" data-compare-sticky-overlay-models> Metric Grok 4.20 Beta Grok 4.20 Beta medium Release: 2026-03-12 Grok 4.20 Multi-Agent Beta Grok 4.20 Multi-Agent Beta medium Release: 2026-03-12 Grok 4.1 Fast Grok 4.1 Fast medium Release: 2025-11-19 Gemini 3 Flash Preview Gemini 3 Flash Preview medium Release: 2025-12-17 th+th]:border-l [&>th+th]:border-slate-300 dark:[&>th+th]:border-slate-800" data-compare-sticky-source-row> Metric Grok 4.20 Beta Grok 4.20 Beta medium Release: 2026-03-12 Grok 4.20 Multi-Agent Beta Grok 4.20 Multi-Agent Beta medium Release: 2026-03-12 Grok 4.1 Fast Grok 4.1 Fast medium Release: 2025-11-19 Gemini 3 Flash Preview Gemini 3 Flash Preview medium Release: 2025-12-17 tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300/70 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300/70 dark:[&>tr+tr>td]:border-slate-800/70 dark:[&>tr>td+td]:border-slate-800/70"> Rank #24 #47 #32 #1 Avg Score 7.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 4.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 6.2 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ Consistency 9.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 7.1 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 7.9 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ Cost per result 5.989 Shows the average cost per correct benchmark answer in cents (lower is better). โ€ฆ 97.178 Shows the average cost per correct benchmark answer in cents (lower is better). โ€ฆ 0.563 Shows the average cost per correct benchmark answer in cents (lower is better). โ€ฆ 1.025 Shows the average cost per correct benchmark answer in cents (lower is better). โ€ฆ Total Cost $0.599 Total Cost โ€ฆ $4.859 Total Cost โ€ฆ $0.051 Total Cost โ€ฆ $0.164 Total Cost โ€ฆ Tests Correct 10/16 A test is fully passed only if every run passed for that test. Did not follow instructions: 3 Wrong answer: 3 Response Time (avg) 8.89s Response Time (max) 24.21s Response Time (total) 142.18s A test is fully passed only if every run passed for that test. โ€ฆ 5/16 A test is fully passed only if every run passed for that test. Did not follow instructions: 4 Wrong answer: 3 API error: 2 Extra formatting: 2 Response Time (avg) 9.08s Response Time (max) 35.28s Response Time (total) 127.09s A test is fully passed only if every run passed for that test. โ€ฆ 9/16 A test is fully passed only if every run passed for that test. Did not follow instructions: 3 Wrong answer: 2 No answer: 1 Timed out: 1 Response Time (avg) 26.35s Response Time (max) 121.79s Response Time (total) 237.11s A test is fully passed only if every run passed for that test. โ€ฆ 16/16 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 12.36s Response Time (max) 50.16s Response Time (total) 111.21s A test is fully passed only if every run passed for that test. โ€ฆ Attempt pass rate 70.8% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 52.1% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ Flaky tests 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 6 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 4 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ Total Runs 48 Total Runs โ€ฆ 48 Total Runs โ€ฆ 48 Total Runs โ€ฆ 48 Total Runs โ€ฆ Output Tokens 1,481 Output Tokens โ€ฆ 293,634 Output Tokens โ€ฆ 1,183 Output Tokens โ€ฆ 1,634 Output Tokens โ€ฆ Reasoning Tokens 86,628 Reasoning Tokens โ€ฆ 291,260 Reasoning Tokens โ€ฆ 83,875 Reasoning Tokens โ€ฆ 47,907 Reasoning Tokens โ€ฆ Response Time (avg) 8.89s Response Time (avg) โ€ฆ 9.08s Response Time (avg) โ€ฆ 26.35s Response Time (avg) โ€ฆ 12.36s Response Time (avg) โ€ฆ Response Time (max) 24.21s Response Time (max) โ€ฆ 35.28s Response Time (max) โ€ฆ 121.79s Response Time (max) โ€ฆ 50.16s Response Time (max) โ€ฆ Response Time (total) 142.18s Response Time (total) โ€ฆ 127.09s Response Time (total) โ€ฆ 237.11s Response Time (total) โ€ฆ 111.21s Response Time (total) โ€ฆ Download PNG Copy image Top Models by Score Download PNG Copy image Score vs Total Cost Download PNG Copy image Response Time (avg) Download PNG Copy image Avg Score vs Response Time (avg) Download PNG Copy image Total Output Tokens Download PNG Copy image Avg Score vs Total Output Tokens Category Breakdown Anti-AI Tricks Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 7.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 88.9% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 2/3 A test is fully passed only if every run passed for that test. Wrong answer: 1 Response Time (avg) 3.19s Response Time (max) 3.44s Response Time (total) 9.57s A test is fully passed only if every run passed for that test. โ€ฆ 3.19s Response Time (avg) โ€ฆ 262 Output Tokens โ€ฆ 6,289 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 4.4 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/3 A test is fully passed only if every run passed for that test. Extra formatting: 1 Wrong answer: 1 Response Time (avg) 3.77s Response Time (max) 4.38s Response Time (total) 11.31s A test is fully passed only if every run passed for that test. โ€ฆ 3.77s Response Time (avg) โ€ฆ 28,392 Output Tokens โ€ฆ 27,808 Reasoning Tokens โ€ฆ Grok 4.1 Fast 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.65s Response Time (max) 5.65s Response Time (total) 5.65s A test is fully passed only if every run passed for that test. โ€ฆ 5.65s Response Time (avg) โ€ฆ 102 Output Tokens โ€ฆ 4,021 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.61s Response Time (max) 5.61s Response Time (total) 5.61s A test is fully passed only if every run passed for that test. โ€ฆ 5.61s Response Time (avg) โ€ฆ 299 Output Tokens โ€ฆ 3,127 Reasoning Tokens โ€ฆ Combined Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 20.93s Response Time (max) 20.93s Response Time (total) 20.93s A test is fully passed only if every run passed for that test. โ€ฆ 20.93s Response Time (avg) โ€ฆ 227 Output Tokens โ€ฆ 12,212 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0/1 A test is fully passed only if every run passed for that test. API error: 1 Response Time (avg) 0ms Response Time (max) 0ms Response Time (total) 0ms A test is fully passed only if every run passed for that test. โ€ฆ 0ms Response Time (avg) โ€ฆ 0 Output Tokens โ€ฆ 0 Reasoning Tokens โ€ฆ Grok 4.1 Fast 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 37.64s Response Time (max) 37.64s Response Time (total) 37.64s A test is fully passed only if every run passed for that test. โ€ฆ 37.64s Response Time (avg) โ€ฆ 261 Output Tokens โ€ฆ 12,272 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 50.16s Response Time (max) 50.16s Response Time (total) 50.16s A test is fully passed only if every run passed for that test. โ€ฆ 50.16s Response Time (avg) โ€ฆ 351 Output Tokens โ€ฆ 12,645 Reasoning Tokens โ€ฆ Data parsing and extraction Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.01s Response Time (max) 4.27s Response Time (total) 8.02s A test is fully passed only if every run passed for that test. โ€ฆ 4.01s Response Time (avg) โ€ฆ 180 Output Tokens โ€ฆ 5,281 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.54s Response Time (max) 7.51s Response Time (total) 11.08s A test is fully passed only if every run passed for that test. โ€ฆ 5.54s Response Time (avg) โ€ฆ 25,306 Output Tokens โ€ฆ 25,051 Reasoning Tokens โ€ฆ Grok 4.1 Fast 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 6.63s Response Time (max) 6.63s Response Time (total) 6.63s A test is fully passed only if every run passed for that test. โ€ฆ 6.63s Response Time (avg) โ€ฆ 180 Output Tokens โ€ฆ 5,409 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.72s Response Time (max) 4.72s Response Time (total) 4.72s A test is fully passed only if every run passed for that test. โ€ฆ 4.72s Response Time (avg) โ€ฆ 279 Output Tokens โ€ฆ 5,333 Reasoning Tokens โ€ฆ Domain specific Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 33.3% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/3 A test is fully passed only if every run passed for that test. Wrong answer: 2 Response Time (avg) 21.33s Response Time (max) 24.21s Response Time (total) 64.00s A test is fully passed only if every run passed for that test. โ€ฆ 21.33s Response Time (avg) โ€ฆ 251 Output Tokens โ€ฆ 40,255 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 11.1% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0/3 A test is fully passed only if every run passed for that test. Wrong answer: 2 Extra formatting: 1 Response Time (avg) 24.67s Response Time (max) 35.28s Response Time (total) 74.02s A test is fully passed only if every run passed for that test. โ€ฆ 24.67s Response Time (avg) โ€ฆ 164,609 Output Tokens โ€ฆ 163,647 Reasoning Tokens โ€ฆ Grok 4.1 Fast 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 4.4 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/3 A test is fully passed only if every run passed for that test. Timed out: 1 Wrong answer: 1 Response Time (avg) 121.79s Response Time (max) 121.79s Response Time (total) 121.79s A test is fully passed only if every run passed for that test. โ€ฆ 121.79s Response Time (avg) โ€ฆ 11 Output Tokens โ€ฆ 37,657 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 21.12s Response Time (max) 21.12s Response Time (total) 21.12s A test is fully passed only if every run passed for that test. โ€ฆ 21.12s Response Time (avg) โ€ฆ 12 Output Tokens โ€ฆ 14,908 Reasoning Tokens โ€ฆ General Intelligence Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.78s Response Time (max) 5.78s Response Time (total) 5.78s A test is fully passed only if every run passed for that test. โ€ฆ 5.78s Response Time (avg) โ€ฆ 72 Output Tokens โ€ฆ 3,440 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 2.8 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0/1 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 6.40s Response Time (max) 6.40s Response Time (total) 6.40s A test is fully passed only if every run passed for that test. โ€ฆ 6.40s Response Time (avg) โ€ฆ 15,848 Output Tokens โ€ฆ 15,746 Reasoning Tokens โ€ฆ Grok 4.1 Fast 3.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 9.9 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0/1 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 16.25s Response Time (max) 16.25s Response Time (total) 16.25s A test is fully passed only if every run passed for that test. โ€ฆ 16.25s Response Time (avg) โ€ฆ 127 Output Tokens โ€ฆ 3,456 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.09s Response Time (max) 4.09s Response Time (total) 4.09s A test is fully passed only if every run passed for that test. โ€ฆ 4.09s Response Time (avg) โ€ฆ 111 Output Tokens โ€ฆ 1,285 Reasoning Tokens โ€ฆ Instructions following Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 9.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 50.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/2 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 4.97s Response Time (max) 6.05s Response Time (total) 9.94s A test is fully passed only if every run passed for that test. โ€ฆ 4.97s Response Time (avg) โ€ฆ 57 Output Tokens โ€ฆ 7,107 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 9.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 50.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/2 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 4.63s Response Time (max) 5.46s Response Time (total) 9.26s A test is fully passed only if every run passed for that test. โ€ฆ 4.63s Response Time (avg) โ€ฆ 25,457 Output Tokens โ€ฆ 25,322 Reasoning Tokens โ€ฆ Grok 4.1 Fast 5.5 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 50.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/2 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 5.30s Response Time (max) 5.30s Response Time (total) 5.30s A test is fully passed only if every run passed for that test. โ€ฆ 5.30s Response Time (avg) โ€ฆ 55 Output Tokens โ€ฆ 3,489 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 6.10s Response Time (max) 6.10s Response Time (total) 6.10s A test is fully passed only if every run passed for that test. โ€ฆ 6.10s Response Time (avg) โ€ฆ 72 Output Tokens โ€ฆ 4,558 Reasoning Tokens โ€ฆ Puzzle Solving Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 7.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 88.9% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 2/3 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 3.85s Response Time (max) 4.53s Response Time (total) 11.55s A test is fully passed only if every run passed for that test. โ€ฆ 3.85s Response Time (avg) โ€ฆ 249 Output Tokens โ€ฆ 6,660 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 6.3 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 5.1 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 77.8% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/3 A test is fully passed only if every run passed for that test. Did not follow instructions: 2 Response Time (avg) 5.01s Response Time (max) 5.49s Response Time (total) 15.03s A test is fully passed only if every run passed for that test. โ€ฆ 5.01s Response Time (avg) โ€ฆ 34,022 Output Tokens โ€ฆ 33,686 Reasoning Tokens โ€ฆ Grok 4.1 Fast 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 44.4% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/3 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Wrong answer: 1 Response Time (avg) 8.08s Response Time (max) 8.38s Response Time (total) 16.17s A test is fully passed only if every run passed for that test. โ€ฆ 8.08s Response Time (avg) โ€ฆ 187 Output Tokens โ€ฆ 6,086 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.43s Response Time (max) 4.68s Response Time (total) 8.85s A test is fully passed only if every run passed for that test. โ€ฆ 4.43s Response Time (avg) โ€ฆ 276 Output Tokens โ€ฆ 4,921 Reasoning Tokens โ€ฆ Tool Calling Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0/1 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 12.39s Response Time (max) 12.39s Response Time (total) 12.39s A test is fully passed only if every run passed for that test. โ€ฆ 12.39s Response Time (avg) โ€ฆ 183 Output Tokens โ€ฆ 5,384 Reasoning Tokens โ€ฆ Grok 4.20 Multi-Agent Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0/1 A test is fully passed only if every run passed for that test. API error: 1 Response Time (avg) 0ms Response Time (max) 0ms Response Time (total) 0ms A test is fully passed only if every run passed for that test. โ€ฆ 0ms Response Time (avg) โ€ฆ 0 Output Tokens โ€ฆ 0 Reasoning Tokens โ€ฆ Grok 4.1 Fast 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 1.6 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 33.3% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 0/1 A test is fully passed only if every run passed for that test. No answer: 1 Response Time (avg) 27.71s Response Time (max) 27.71s Response Time (total) 27.71s A test is fully passed only if every run passed for that test. โ€ฆ 27.71s Response Time (avg) โ€ฆ 260 Output Tokens โ€ฆ 11,485 Reasoning Tokens โ€ฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โ€ฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โ€ฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โ€ฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โ€ฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 10.55s Response Time (max) 10.55s Response Time (total) 10.55s A test is fully passed only if every run passed for that test. โ€ฆ 10.55s Response Time (avg) โ€ฆ 234 Output Tokens โ€ฆ 1,130 Reasoning Tokens โ€ฆ Quick Compare Switch Comparison Pair Qwen3.5 Plus 2026-02-15 none vs Grok 4.1 Fast medium Qwen3.5-27B none vs Grok 4.20 Multi-Agent Beta medium Seed-2.0-Lite none vs Grok 4.20 Multi-Agent Beta medium Gemini 3.1 Flash Lite Preview none vs Grok 4.20 Beta medium Qwen3.5-122B-A10B none vs Grok 4.20 Multi-Agent Beta medium Grok 4.1 Fast medium vs GLM 5 none Qwen3.5-35B-A3B none vs Grok 4.20 Multi-Agent Beta medium Gemini 3 Flash Preview none vs Grok 4.20 Beta medium Claude Sonnet 4.6 none vs Grok 4.20 Beta medium GPT-5.3 Chat none vs Grok 4.20 Beta medium Gemini 2.5 Flash none vs Grok 4.20 Multi-Agent Beta medium Gemini 3.1 Flash Lite Preview low vs Grok 4.20 Beta medium