Grok 4.20 brings minimal improvements over Grok-4.1-fast
quality 1/10 · low quality
0 net
AI Summary
This article is not security-related. It appears to be a benchmark comparison page for AI language models (Grok, Gemini, GPT, Claude, etc.) with performance rankings, not a cybersecurity topic.
Tags
xAI: Grok 4.20 Beta vs xAI: Grok 4.20 Multi-Agent Beta vs xAI: Grok 4.1 Fast vs Google: Gemini 3 Flash Preview | AI BENCHY Navigate AI BENCHY Language: ๐ธ๐ฆ Arabic ๐ง๐ฉ Bangla ๐ฉ๐ช German ๐บ๐ธ English ๐ช๐ธ Spanish ๐ซ๐ท French ๐ฎ๐ณ Hindi ๐ฎ๐ฉ Indonesian ๐ฏ๐ต Japanese ๐ฎ๐ณ Marathi ๐ณ๐ฑ Dutch ๐ต๐น Portuguese ๐ท๐ด Romanian ๐ท๐บ Russian ๐ฐ๐ช Swahili ๐ต๐ฐ Urdu ๐จ๐ณ Chinese Theme โค๏ธ Made by XCS Navigate Language ๐ธ๐ฆ Arabic ๐ง๐ฉ Bangla ๐ฉ๐ช German ๐บ๐ธ English ๐ช๐ธ Spanish ๐ซ๐ท French ๐ฎ๐ณ Hindi ๐ฎ๐ฉ Indonesian ๐ฏ๐ต Japanese ๐ฎ๐ณ Marathi ๐ณ๐ฑ Dutch ๐ต๐น Portuguese ๐ท๐ด Romanian ๐ท๐บ Russian ๐ฐ๐ช Swahili ๐ต๐ฐ Urdu ๐จ๐ณ Chinese Theme โค๏ธ Made by XCS Your ad here AD Track all your projects in one dashboard. Do you own multiple domains? Get stats for all of them in one place! Get ๐ stats , ๐ฅ heatmaps and ๐ recordings in one self-hosted dashboard. See where visitors come from and how they use your sites. Get ๐ stats , ๐ฅ heatmaps and ๐ recordings for all your projects, in your own, self-hosted dashboard! uxwizz.com โ #1 Gemini 3 Flash Preview โ (medium) #2 Gemini 3.1 Pro Preview โ (medium) #3 Seed-2.0-Lite โ (medium) #4 GPT-5.3-Codex โ (medium) #5 Qwen3.5 Plus 2026-02-15 โ (medium) #6 Gemini 3 Flash Preview โ (low) #7 Gemini 3 Pro Preview โ (medium) #8 Qwen3.5-27B โ (medium) #9 Gemini 3.1 Flash Lite Preview โ (high) ยท Archived #10 GPT-5.4 โ (medium) #11 Qwen3.5-122B-A10B โ (medium) #12 Claude Sonnet 4.6 โ (medium) #13 Gemini 3.1 Flash Lite Preview โ (medium) #14 Step 3.5 Flash โ (medium) #15 GLM 5 โ (medium) #16 GPT-5.2 Chat โ (none) #17 Gemini 2.5 Flash โ (medium) #18 Gemini 3.1 Flash Lite Preview โ (low) #19 DeepSeek V3.2 โ (medium) #20 GPT-5.3 Chat โ (none) #21 Gemini 3 Flash Preview โ (none) #22 MiMo-V2-Flash โ (medium) #23 Gemini 3.1 Flash Lite Preview โ (none) #24 Grok 4.20 Beta โ (medium) #25 Seed-2.0-Mini โ (medium) #26 Qwen3.5-Flash โ (medium) #27 Claude Sonnet 4.6 โ (none) #28 Claude Opus 4.6 โ (medium) #29 GPT-5.2 โ (medium) #30 Kimi K2.5 โ (medium) #31 Qwen3.5 Plus 2026-02-15 โ (none) #32 Grok 4.1 Fast โ (medium) #33 GLM 5 โ (none) #34 GPT-5 Mini โ (medium) #35 Hunter Alpha โ (medium) #36 Nemotron 3 Super 120b A12b โ (medium) #37 DeepSeek V3.2 โ (none) #38 GPT-5 Nano โ (medium) #39 Qwen3.5-35B-A3B โ (medium) #40 Mercury 2 โ (medium) #41 Qwen3.5-Flash โ (none) #42 Gemini 2.5 Flash โ (none) #43 gpt-oss-120b โ (medium) #44 Qwen3.5-122B-A10B โ (none) #45 Seed-2.0-Lite โ (none) #46 Qwen3.5-27B โ (none) #47 Grok 4.20 Multi-Agent Beta โ (medium) #48 Qwen3.5-35B-A3B โ (none) #49 MiniMax M2.5 โ (medium) #50 Hunter Alpha โ (none) #51 GPT-5.4 โ (none) #52 Grok 4.20 Beta โ (none) #53 Trinity Large Preview โ (none) #54 Kimi K2.5 โ (none) #55 GPT-4o-mini โ (none) #56 Qwen3 Coder Next โ (none) #57 GLM 4.7 Flash โ (none) #58 Qwen3 Coder Next โ (medium) #59 Nemotron 3 Super 120b A12b โ (none) #60 Qwen3.5-9B โ (none) #61 Mercury 2 โ (none) #62 GLM 4.7 Flash โ (medium) #63 Grok 4.1 Fast โ (none) #64 MiMo-V2-Flash โ (none) #65 LFM2-24B-A2B โ (none) ยท Archived #66 Qwen3.5-9B โ (medium) โ #1 Gemini 3 Flash Preview โ (medium) #2 Gemini 3.1 Pro Preview โ (medium) #3 Seed-2.0-Lite โ (medium) #4 GPT-5.3-Codex โ (medium) #5 Qwen3.5 Plus 2026-02-15 โ (medium) #6 Gemini 3 Flash Preview โ (low) #7 Gemini 3 Pro Preview โ (medium) #8 Qwen3.5-27B โ (medium) #9 Gemini 3.1 Flash Lite Preview โ (high) ยท Archived #10 GPT-5.4 โ (medium) #11 Qwen3.5-122B-A10B โ (medium) #12 Claude Sonnet 4.6 โ (medium) #13 Gemini 3.1 Flash Lite Preview โ (medium) #14 Step 3.5 Flash โ (medium) #15 GLM 5 โ (medium) #16 GPT-5.2 Chat โ (none) #17 Gemini 2.5 Flash โ (medium) #18 Gemini 3.1 Flash Lite Preview โ (low) #19 DeepSeek V3.2 โ (medium) #20 GPT-5.3 Chat โ (none) #21 Gemini 3 Flash Preview โ (none) #22 MiMo-V2-Flash โ (medium) #23 Gemini 3.1 Flash Lite Preview โ (none) #24 Grok 4.20 Beta โ (medium) #25 Seed-2.0-Mini โ (medium) #26 Qwen3.5-Flash โ (medium) #27 Claude Sonnet 4.6 โ (none) #28 Claude Opus 4.6 โ (medium) #29 GPT-5.2 โ (medium) #30 Kimi K2.5 โ (medium) #31 Qwen3.5 Plus 2026-02-15 โ (none) #32 Grok 4.1 Fast โ (medium) #33 GLM 5 โ (none) #34 GPT-5 Mini โ (medium) #35 Hunter Alpha โ (medium) #36 Nemotron 3 Super 120b A12b โ (medium) #37 DeepSeek V3.2 โ (none) #38 GPT-5 Nano โ (medium) #39 Qwen3.5-35B-A3B โ (medium) #40 Mercury 2 โ (medium) #41 Qwen3.5-Flash โ (none) #42 Gemini 2.5 Flash โ (none) #43 gpt-oss-120b โ (medium) #44 Qwen3.5-122B-A10B โ (none) #45 Seed-2.0-Lite โ (none) #46 Qwen3.5-27B โ (none) #47 Grok 4.20 Multi-Agent Beta โ (medium) #48 Qwen3.5-35B-A3B โ (none) #49 MiniMax M2.5 โ (medium) #50 Hunter Alpha โ (none) #51 GPT-5.4 โ (none) #52 Grok 4.20 Beta โ (none) #53 Trinity Large Preview โ (none) #54 Kimi K2.5 โ (none) #55 GPT-4o-mini โ (none) #56 Qwen3 Coder Next โ (none) #57 GLM 4.7 Flash โ (none) #58 Qwen3 Coder Next โ (medium) #59 Nemotron 3 Super 120b A12b โ (none) #60 Qwen3.5-9B โ (none) #61 Mercury 2 โ (none) #62 GLM 4.7 Flash โ (medium) #63 Grok 4.1 Fast โ (none) #64 MiMo-V2-Flash โ (none) #65 LFM2-24B-A2B โ (none) ยท Archived #66 Qwen3.5-9B โ (medium) โ #1 Gemini 3 Flash Preview โ (medium) #2 Gemini 3.1 Pro Preview โ (medium) #3 Seed-2.0-Lite โ (medium) #4 GPT-5.3-Codex โ (medium) #5 Qwen3.5 Plus 2026-02-15 โ (medium) #6 Gemini 3 Flash Preview โ (low) #7 Gemini 3 Pro Preview โ (medium) #8 Qwen3.5-27B โ (medium) #9 Gemini 3.1 Flash Lite Preview โ (high) ยท Archived #10 GPT-5.4 โ (medium) #11 Qwen3.5-122B-A10B โ (medium) #12 Claude Sonnet 4.6 โ (medium) #13 Gemini 3.1 Flash Lite Preview โ (medium) #14 Step 3.5 Flash โ (medium) #15 GLM 5 โ (medium) #16 GPT-5.2 Chat โ (none) #17 Gemini 2.5 Flash โ (medium) #18 Gemini 3.1 Flash Lite Preview โ (low) #19 DeepSeek V3.2 โ (medium) #20 GPT-5.3 Chat โ (none) #21 Gemini 3 Flash Preview โ (none) #22 MiMo-V2-Flash โ (medium) #23 Gemini 3.1 Flash Lite Preview โ (none) #24 Grok 4.20 Beta โ (medium) #25 Seed-2.0-Mini โ (medium) #26 Qwen3.5-Flash โ (medium) #27 Claude Sonnet 4.6 โ (none) #28 Claude Opus 4.6 โ (medium) #29 GPT-5.2 โ (medium) #30 Kimi K2.5 โ (medium) #31 Qwen3.5 Plus 2026-02-15 โ (none) #32 Grok 4.1 Fast โ (medium) #33 GLM 5 โ (none) #34 GPT-5 Mini โ (medium) #35 Hunter Alpha โ (medium) #36 Nemotron 3 Super 120b A12b โ (medium) #37 DeepSeek V3.2 โ (none) #38 GPT-5 Nano โ (medium) #39 Qwen3.5-35B-A3B โ (medium) #40 Mercury 2 โ (medium) #41 Qwen3.5-Flash โ (none) #42 Gemini 2.5 Flash โ (none) #43 gpt-oss-120b โ (medium) #44 Qwen3.5-122B-A10B โ (none) #45 Seed-2.0-Lite โ (none) #46 Qwen3.5-27B โ (none) #47 Grok 4.20 Multi-Agent Beta โ (medium) #48 Qwen3.5-35B-A3B โ (none) #49 MiniMax M2.5 โ (medium) #50 Hunter Alpha โ (none) #51 GPT-5.4 โ (none) #52 Grok 4.20 Beta โ (none) #53 Trinity Large Preview โ (none) #54 Kimi K2.5 โ (none) #55 GPT-4o-mini โ (none) #56 Qwen3 Coder Next โ (none) #57 GLM 4.7 Flash โ (none) #58 Qwen3 Coder Next โ (medium) #59 Nemotron 3 Super 120b A12b โ (none) #60 Qwen3.5-9B โ (none) #61 Mercury 2 โ (none) #62 GLM 4.7 Flash โ (medium) #63 Grok 4.1 Fast โ (none) #64 MiMo-V2-Flash โ (none) #65 LFM2-24B-A2B โ (none) ยท Archived #66 Qwen3.5-9B โ (medium) โ #1 Gemini 3 Flash Preview โ (medium) #2 Gemini 3.1 Pro Preview โ (medium) #3 Seed-2.0-Lite โ (medium) #4 GPT-5.3-Codex โ (medium) #5 Qwen3.5 Plus 2026-02-15 โ (medium) #6 Gemini 3 Flash Preview โ (low) #7 Gemini 3 Pro Preview โ (medium) #8 Qwen3.5-27B โ (medium) #9 Gemini 3.1 Flash Lite Preview โ (high) ยท Archived #10 GPT-5.4 โ (medium) #11 Qwen3.5-122B-A10B โ (medium) #12 Claude Sonnet 4.6 โ (medium) #13 Gemini 3.1 Flash Lite Preview โ (medium) #14 Step 3.5 Flash โ (medium) #15 GLM 5 โ (medium) #16 GPT-5.2 Chat โ (none) #17 Gemini 2.5 Flash โ (medium) #18 Gemini 3.1 Flash Lite Preview โ (low) #19 DeepSeek V3.2 โ (medium) #20 GPT-5.3 Chat โ (none) #21 Gemini 3 Flash Preview โ (none) #22 MiMo-V2-Flash โ (medium) #23 Gemini 3.1 Flash Lite Preview โ (none) #24 Grok 4.20 Beta โ (medium) #25 Seed-2.0-Mini โ (medium) #26 Qwen3.5-Flash โ (medium) #27 Claude Sonnet 4.6 โ (none) #28 Claude Opus 4.6 โ (medium) #29 GPT-5.2 โ (medium) #30 Kimi K2.5 โ (medium) #31 Qwen3.5 Plus 2026-02-15 โ (none) #32 Grok 4.1 Fast โ (medium) #33 GLM 5 โ (none) #34 GPT-5 Mini โ (medium) #35 Hunter Alpha โ (medium) #36 Nemotron 3 Super 120b A12b โ (medium) #37 DeepSeek V3.2 โ (none) #38 GPT-5 Nano โ (medium) #39 Qwen3.5-35B-A3B โ (medium) #40 Mercury 2 โ (medium) #41 Qwen3.5-Flash โ (none) #42 Gemini 2.5 Flash โ (none) #43 gpt-oss-120b โ (medium) #44 Qwen3.5-122B-A10B โ (none) #45 Seed-2.0-Lite โ (none) #46 Qwen3.5-27B โ (none) #47 Grok 4.20 Multi-Agent Beta โ (medium) #48 Qwen3.5-35B-A3B โ (none) #49 MiniMax M2.5 โ (medium) #50 Hunter Alpha โ (none) #51 GPT-5.4 โ (none) #52 Grok 4.20 Beta โ (none) #53 Trinity Large Preview โ (none) #54 Kimi K2.5 โ (none) #55 GPT-4o-mini โ (none) #56 Qwen3 Coder Next โ (none) #57 GLM 4.7 Flash โ (none) #58 Qwen3 Coder Next โ (medium) #59 Nemotron 3 Super 120b A12b โ (none) #60 Qwen3.5-9B โ (none) #61 Mercury 2 โ (none) #62 GLM 4.7 Flash โ (medium) #63 Grok 4.1 Fast โ (none) #64 MiMo-V2-Flash โ (none) #65 LFM2-24B-A2B โ (none) ยท Archived #66 Qwen3.5-9B โ (medium) Compare Share AI BENCHY Compare Compared models Grok 4.20 Beta โฎโฎ โพ Search Grok 4.20 Multi-Agent Beta โฎโฎ โพ Search Grok 4.1 Fast โฎโฎ โพ Search Gemini 3 Flash Preview โฎโฎ โพ Search Last updated at: 2026-03-12 th+th]:border-l [&>th+th]:border-slate-300 dark:[&>th+th]:border-slate-800" data-compare-sticky-overlay-models> Metric Grok 4.20 Beta Grok 4.20 Beta medium Release: 2026-03-12 Grok 4.20 Multi-Agent Beta Grok 4.20 Multi-Agent Beta medium Release: 2026-03-12 Grok 4.1 Fast Grok 4.1 Fast medium Release: 2025-11-19 Gemini 3 Flash Preview Gemini 3 Flash Preview medium Release: 2025-12-17 th+th]:border-l [&>th+th]:border-slate-300 dark:[&>th+th]:border-slate-800" data-compare-sticky-source-row> Metric Grok 4.20 Beta Grok 4.20 Beta medium Release: 2026-03-12 Grok 4.20 Multi-Agent Beta Grok 4.20 Multi-Agent Beta medium Release: 2026-03-12 Grok 4.1 Fast Grok 4.1 Fast medium Release: 2025-11-19 Gemini 3 Flash Preview Gemini 3 Flash Preview medium Release: 2025-12-17 tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300/70 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300/70 dark:[&>tr+tr>td]:border-slate-800/70 dark:[&>tr>td+td]:border-slate-800/70"> Rank #24 #47 #32 #1 Avg Score 7.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 4.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 6.2 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ Consistency 9.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 7.1 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 7.9 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ Cost per result 5.989 Shows the average cost per correct benchmark answer in cents (lower is better). โฆ 97.178 Shows the average cost per correct benchmark answer in cents (lower is better). โฆ 0.563 Shows the average cost per correct benchmark answer in cents (lower is better). โฆ 1.025 Shows the average cost per correct benchmark answer in cents (lower is better). โฆ Total Cost $0.599 Total Cost โฆ $4.859 Total Cost โฆ $0.051 Total Cost โฆ $0.164 Total Cost โฆ Tests Correct 10/16 A test is fully passed only if every run passed for that test. Did not follow instructions: 3 Wrong answer: 3 Response Time (avg) 8.89s Response Time (max) 24.21s Response Time (total) 142.18s A test is fully passed only if every run passed for that test. โฆ 5/16 A test is fully passed only if every run passed for that test. Did not follow instructions: 4 Wrong answer: 3 API error: 2 Extra formatting: 2 Response Time (avg) 9.08s Response Time (max) 35.28s Response Time (total) 127.09s A test is fully passed only if every run passed for that test. โฆ 9/16 A test is fully passed only if every run passed for that test. Did not follow instructions: 3 Wrong answer: 2 No answer: 1 Timed out: 1 Response Time (avg) 26.35s Response Time (max) 121.79s Response Time (total) 237.11s A test is fully passed only if every run passed for that test. โฆ 16/16 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 12.36s Response Time (max) 50.16s Response Time (total) 111.21s A test is fully passed only if every run passed for that test. โฆ Attempt pass rate 70.8% Attempt pass rate = passed attempts / total attempts across runs. โฆ 52.1% Attempt pass rate = passed attempts / total attempts across runs. โฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ Flaky tests 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 6 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 4 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ Total Runs 48 Total Runs โฆ 48 Total Runs โฆ 48 Total Runs โฆ 48 Total Runs โฆ Output Tokens 1,481 Output Tokens โฆ 293,634 Output Tokens โฆ 1,183 Output Tokens โฆ 1,634 Output Tokens โฆ Reasoning Tokens 86,628 Reasoning Tokens โฆ 291,260 Reasoning Tokens โฆ 83,875 Reasoning Tokens โฆ 47,907 Reasoning Tokens โฆ Response Time (avg) 8.89s Response Time (avg) โฆ 9.08s Response Time (avg) โฆ 26.35s Response Time (avg) โฆ 12.36s Response Time (avg) โฆ Response Time (max) 24.21s Response Time (max) โฆ 35.28s Response Time (max) โฆ 121.79s Response Time (max) โฆ 50.16s Response Time (max) โฆ Response Time (total) 142.18s Response Time (total) โฆ 127.09s Response Time (total) โฆ 237.11s Response Time (total) โฆ 111.21s Response Time (total) โฆ Download PNG Copy image Top Models by Score Download PNG Copy image Score vs Total Cost Download PNG Copy image Response Time (avg) Download PNG Copy image Avg Score vs Response Time (avg) Download PNG Copy image Total Output Tokens Download PNG Copy image Avg Score vs Total Output Tokens Category Breakdown Anti-AI Tricks Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 7.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 88.9% Attempt pass rate = passed attempts / total attempts across runs. โฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 2/3 A test is fully passed only if every run passed for that test. Wrong answer: 1 Response Time (avg) 3.19s Response Time (max) 3.44s Response Time (total) 9.57s A test is fully passed only if every run passed for that test. โฆ 3.19s Response Time (avg) โฆ 262 Output Tokens โฆ 6,289 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 4.4 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โฆ 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/3 A test is fully passed only if every run passed for that test. Extra formatting: 1 Wrong answer: 1 Response Time (avg) 3.77s Response Time (max) 4.38s Response Time (total) 11.31s A test is fully passed only if every run passed for that test. โฆ 3.77s Response Time (avg) โฆ 28,392 Output Tokens โฆ 27,808 Reasoning Tokens โฆ Grok 4.1 Fast 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.65s Response Time (max) 5.65s Response Time (total) 5.65s A test is fully passed only if every run passed for that test. โฆ 5.65s Response Time (avg) โฆ 102 Output Tokens โฆ 4,021 Reasoning Tokens โฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.61s Response Time (max) 5.61s Response Time (total) 5.61s A test is fully passed only if every run passed for that test. โฆ 5.61s Response Time (avg) โฆ 299 Output Tokens โฆ 3,127 Reasoning Tokens โฆ Combined Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 20.93s Response Time (max) 20.93s Response Time (total) 20.93s A test is fully passed only if every run passed for that test. โฆ 20.93s Response Time (avg) โฆ 227 Output Tokens โฆ 12,212 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0/1 A test is fully passed only if every run passed for that test. API error: 1 Response Time (avg) 0ms Response Time (max) 0ms Response Time (total) 0ms A test is fully passed only if every run passed for that test. โฆ 0ms Response Time (avg) โฆ 0 Output Tokens โฆ 0 Reasoning Tokens โฆ Grok 4.1 Fast 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 37.64s Response Time (max) 37.64s Response Time (total) 37.64s A test is fully passed only if every run passed for that test. โฆ 37.64s Response Time (avg) โฆ 261 Output Tokens โฆ 12,272 Reasoning Tokens โฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 50.16s Response Time (max) 50.16s Response Time (total) 50.16s A test is fully passed only if every run passed for that test. โฆ 50.16s Response Time (avg) โฆ 351 Output Tokens โฆ 12,645 Reasoning Tokens โฆ Data parsing and extraction Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.01s Response Time (max) 4.27s Response Time (total) 8.02s A test is fully passed only if every run passed for that test. โฆ 4.01s Response Time (avg) โฆ 180 Output Tokens โฆ 5,281 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.54s Response Time (max) 7.51s Response Time (total) 11.08s A test is fully passed only if every run passed for that test. โฆ 5.54s Response Time (avg) โฆ 25,306 Output Tokens โฆ 25,051 Reasoning Tokens โฆ Grok 4.1 Fast 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 6.63s Response Time (max) 6.63s Response Time (total) 6.63s A test is fully passed only if every run passed for that test. โฆ 6.63s Response Time (avg) โฆ 180 Output Tokens โฆ 5,409 Reasoning Tokens โฆ Gemini 3 Flash Preview 9.9 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.72s Response Time (max) 4.72s Response Time (total) 4.72s A test is fully passed only if every run passed for that test. โฆ 4.72s Response Time (avg) โฆ 279 Output Tokens โฆ 5,333 Reasoning Tokens โฆ Domain specific Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 33.3% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/3 A test is fully passed only if every run passed for that test. Wrong answer: 2 Response Time (avg) 21.33s Response Time (max) 24.21s Response Time (total) 64.00s A test is fully passed only if every run passed for that test. โฆ 21.33s Response Time (avg) โฆ 251 Output Tokens โฆ 40,255 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 11.1% Attempt pass rate = passed attempts / total attempts across runs. โฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0/3 A test is fully passed only if every run passed for that test. Wrong answer: 2 Extra formatting: 1 Response Time (avg) 24.67s Response Time (max) 35.28s Response Time (total) 74.02s A test is fully passed only if every run passed for that test. โฆ 24.67s Response Time (avg) โฆ 164,609 Output Tokens โฆ 163,647 Reasoning Tokens โฆ Grok 4.1 Fast 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 4.4 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โฆ 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/3 A test is fully passed only if every run passed for that test. Timed out: 1 Wrong answer: 1 Response Time (avg) 121.79s Response Time (max) 121.79s Response Time (total) 121.79s A test is fully passed only if every run passed for that test. โฆ 121.79s Response Time (avg) โฆ 11 Output Tokens โฆ 37,657 Reasoning Tokens โฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 21.12s Response Time (max) 21.12s Response Time (total) 21.12s A test is fully passed only if every run passed for that test. โฆ 21.12s Response Time (avg) โฆ 12 Output Tokens โฆ 14,908 Reasoning Tokens โฆ General Intelligence Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 5.78s Response Time (max) 5.78s Response Time (total) 5.78s A test is fully passed only if every run passed for that test. โฆ 5.78s Response Time (avg) โฆ 72 Output Tokens โฆ 3,440 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 2.8 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 66.7% Attempt pass rate = passed attempts / total attempts across runs. โฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0/1 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 6.40s Response Time (max) 6.40s Response Time (total) 6.40s A test is fully passed only if every run passed for that test. โฆ 6.40s Response Time (avg) โฆ 15,848 Output Tokens โฆ 15,746 Reasoning Tokens โฆ Grok 4.1 Fast 3.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 9.9 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0/1 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 16.25s Response Time (max) 16.25s Response Time (total) 16.25s A test is fully passed only if every run passed for that test. โฆ 16.25s Response Time (avg) โฆ 127 Output Tokens โฆ 3,456 Reasoning Tokens โฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.09s Response Time (max) 4.09s Response Time (total) 4.09s A test is fully passed only if every run passed for that test. โฆ 4.09s Response Time (avg) โฆ 111 Output Tokens โฆ 1,285 Reasoning Tokens โฆ Instructions following Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 9.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 50.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/2 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 4.97s Response Time (max) 6.05s Response Time (total) 9.94s A test is fully passed only if every run passed for that test. โฆ 4.97s Response Time (avg) โฆ 57 Output Tokens โฆ 7,107 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 9.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 50.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/2 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 4.63s Response Time (max) 5.46s Response Time (total) 9.26s A test is fully passed only if every run passed for that test. โฆ 4.63s Response Time (avg) โฆ 25,457 Output Tokens โฆ 25,322 Reasoning Tokens โฆ Grok 4.1 Fast 5.5 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 50.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/2 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 5.30s Response Time (max) 5.30s Response Time (total) 5.30s A test is fully passed only if every run passed for that test. โฆ 5.30s Response Time (avg) โฆ 55 Output Tokens โฆ 3,489 Reasoning Tokens โฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 2/2 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 6.10s Response Time (max) 6.10s Response Time (total) 6.10s A test is fully passed only if every run passed for that test. โฆ 6.10s Response Time (avg) โฆ 72 Output Tokens โฆ 4,558 Reasoning Tokens โฆ Puzzle Solving Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 7.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 88.9% Attempt pass rate = passed attempts / total attempts across runs. โฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 2/3 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 3.85s Response Time (max) 4.53s Response Time (total) 11.55s A test is fully passed only if every run passed for that test. โฆ 3.85s Response Time (avg) โฆ 249 Output Tokens โฆ 6,660 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 6.3 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 5.1 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 77.8% Attempt pass rate = passed attempts / total attempts across runs. โฆ 2 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/3 A test is fully passed only if every run passed for that test. Did not follow instructions: 2 Response Time (avg) 5.01s Response Time (max) 5.49s Response Time (total) 15.03s A test is fully passed only if every run passed for that test. โฆ 5.01s Response Time (avg) โฆ 34,022 Output Tokens โฆ 33,686 Reasoning Tokens โฆ Grok 4.1 Fast 4.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 7.2 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 44.4% Attempt pass rate = passed attempts / total attempts across runs. โฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/3 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Wrong answer: 1 Response Time (avg) 8.08s Response Time (max) 8.38s Response Time (total) 16.17s A test is fully passed only if every run passed for that test. โฆ 8.08s Response Time (avg) โฆ 187 Output Tokens โฆ 6,086 Reasoning Tokens โฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 3/3 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 4.43s Response Time (max) 4.68s Response Time (total) 8.85s A test is fully passed only if every run passed for that test. โฆ 4.43s Response Time (avg) โฆ 276 Output Tokens โฆ 4,921 Reasoning Tokens โฆ Tool Calling Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens tr+tr>td]:border-t [&>tr+tr>td]:border-slate-300 [&>tr>td+td]:border-l [&>tr>td+td]:border-slate-300 dark:[&>tr+tr>td]:border-slate-800 dark:[&>tr>td+td]:border-slate-800"> Grok 4.20 Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0/1 A test is fully passed only if every run passed for that test. Did not follow instructions: 1 Response Time (avg) 12.39s Response Time (max) 12.39s Response Time (total) 12.39s A test is fully passed only if every run passed for that test. โฆ 12.39s Response Time (avg) โฆ 183 Output Tokens โฆ 5,384 Reasoning Tokens โฆ Grok 4.20 Multi-Agent Beta 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 0.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0/1 A test is fully passed only if every run passed for that test. API error: 1 Response Time (avg) 0ms Response Time (max) 0ms Response Time (total) 0ms A test is fully passed only if every run passed for that test. โฆ 0ms Response Time (avg) โฆ 0 Output Tokens โฆ 0 Reasoning Tokens โฆ Grok 4.1 Fast 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 1.6 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 33.3% Attempt pass rate = passed attempts / total attempts across runs. โฆ 1 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 0/1 A test is fully passed only if every run passed for that test. No answer: 1 Response Time (avg) 27.71s Response Time (max) 27.71s Response Time (total) 27.71s A test is fully passed only if every run passed for that test. โฆ 27.71s Response Time (avg) โฆ 260 Output Tokens โฆ 11,485 Reasoning Tokens โฆ Gemini 3 Flash Preview 10.0 Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. โฆ 10.0 Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong). โฆ 100.0% Attempt pass rate = passed attempts / total attempts across runs. โฆ 0 Flaky tests had mixed outcomes across runs (at least one pass and one fail). โฆ 1/1 A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg) 10.55s Response Time (max) 10.55s Response Time (total) 10.55s A test is fully passed only if every run passed for that test. โฆ 10.55s Response Time (avg) โฆ 234 Output Tokens โฆ 1,130 Reasoning Tokens โฆ Quick Compare Switch Comparison Pair Qwen3.5 Plus 2026-02-15 none vs Grok 4.1 Fast medium Qwen3.5-27B none vs Grok 4.20 Multi-Agent Beta medium Seed-2.0-Lite none vs Grok 4.20 Multi-Agent Beta medium Gemini 3.1 Flash Lite Preview none vs Grok 4.20 Beta medium Qwen3.5-122B-A10B none vs Grok 4.20 Multi-Agent Beta medium Grok 4.1 Fast medium vs GLM 5 none Qwen3.5-35B-A3B none vs Grok 4.20 Multi-Agent Beta medium Gemini 3 Flash Preview none vs Grok 4.20 Beta medium Claude Sonnet 4.6 none vs Grok 4.20 Beta medium GPT-5.3 Chat none vs Grok 4.20 Beta medium Gemini 2.5 Flash none vs Grok 4.20 Multi-Agent Beta medium Gemini 3.1 Flash Lite Preview low vs Grok 4.20 Beta medium